<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Alexandre Vazquez</title>
    <description>The latest articles on Forem by Alexandre Vazquez (@alexandrev).</description>
    <link>https://forem.com/alexandrev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F167984%2F9c789f5c-7dab-4a86-aece-8bf66ea955bd.jpeg</url>
      <title>Forem: Alexandre Vazquez</title>
      <link>https://forem.com/alexandrev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/alexandrev"/>
    <language>en</language>
    <item>
      <title>XSLT grouping with xsl:for-each-group: complete guide</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Mon, 11 May 2026 09:00:00 +0000</pubDate>
      <link>https://forem.com/alexandrev/xslt-grouping-with-xslfor-each-group-complete-guide-7lb</link>
      <guid>https://forem.com/alexandrev/xslt-grouping-with-xslfor-each-group-complete-guide-7lb</guid>
      <description>&lt;p&gt;Grouping is one of the most powerful features introduced in XSLT 2.0. Before it, grouping in XSLT 1.0 required the Muenchian method — a clever but verbose technique involving keys and node-set comparisons. In 2.0, &lt;code&gt;xsl:for-each-group&lt;/code&gt; makes grouping straightforward.&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic grouping with group-by
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;group-by&lt;/code&gt; groups nodes that share the same value for a given expression. The result is one iteration per distinct group value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;
  DE120
  US85
  DE200
  FR60
  US140

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stylesheet:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;










&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;
  2320
  160
  2225

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key functions inside &lt;code&gt;for-each-group&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;current-grouping-key()&lt;/code&gt; — returns the value that defines the current group&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;current-group()&lt;/code&gt; — returns the sequence of all nodes in the current group&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Nested grouping
&lt;/h2&gt;

&lt;p&gt;Groups can be nested. Group orders by country, then within each country by status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;








&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  group-adjacent
&lt;/h2&gt;

&lt;p&gt;Groups consecutive nodes that share the same key value. Unlike &lt;code&gt;group-by&lt;/code&gt;, it starts a new group when the key changes, even if the same key appeared earlier. This is useful for processing structured text or segmented data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;
  Starting
  Processing
  Failed
  Retrying
  Done

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;




&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This produces three blocks: two INFO (positions 1-2), one ERROR (3-4), one INFO (5). With &lt;code&gt;group-by&lt;/code&gt;, the two INFO groups would be merged into one.&lt;/p&gt;

&lt;h2&gt;
  
  
  group-starting-with and group-ending-with
&lt;/h2&gt;

&lt;p&gt;These group nodes based on a pattern match rather than a key value. Every time a node matches the pattern, a new group starts (or ends).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;group-starting-with example&lt;/strong&gt; — treat every &lt;code&gt;##&lt;/code&gt; as the start of a section:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

    &lt;span class="nt"&gt;&amp;lt;xsl:value-of&lt;/span&gt; &lt;span class="na"&gt;select=&lt;/span&gt;&lt;span class="s"&gt;"self::h2"&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;group-ending-with example&lt;/strong&gt; — group lines until a blank line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;




&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Computing aggregates
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;current-group()&lt;/code&gt; returns a sequence, so you can apply any XPath aggregate function directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;







&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Try it in XSLT Playground
&lt;/h2&gt;

&lt;p&gt;Paste any of the examples above into &lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; with version set to 2.0 or 3.0. Grouping is one of the features that benefits most from live testing — you can immediately see how changing the grouping key or switching between &lt;code&gt;group-by&lt;/code&gt; and &lt;code&gt;group-adjacent&lt;/code&gt; affects the output structure.&lt;/p&gt;

</description>
      <category>coding</category>
      <category>programming</category>
      <category>tutorial</category>
      <category>webdev</category>
    </item>
    <item>
      <title>XSLT template matching explained with examples</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Thu, 07 May 2026 09:00:00 +0000</pubDate>
      <link>https://forem.com/alexandrev/xslt-template-matching-explained-with-examples-1f2n</link>
      <guid>https://forem.com/alexandrev/xslt-template-matching-explained-with-examples-1f2n</guid>
      <description>&lt;p&gt;Template matching is the mechanism that drives every XSLT transformation. Understanding how the processor selects templates — and what happens when multiple templates could match — is the difference between a stylesheet that works reliably and one that produces surprising output. This post covers everything you need to know.&lt;/p&gt;

&lt;h2&gt;
  
  
  How match patterns work
&lt;/h2&gt;

&lt;p&gt;When the processor visits a node, it evaluates every template's &lt;code&gt;match&lt;/code&gt; attribute as an XPath pattern. A pattern is a restricted form of XPath that tests properties of a node rather than selecting nodes from a starting point. If the pattern is satisfied, that template is a candidate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;





&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Priority and conflict resolution
&lt;/h2&gt;

&lt;p&gt;More than one template can match the same node. The processor resolves the conflict using priority. Each pattern has a default priority calculated by the spec:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern type&lt;/th&gt;
&lt;th&gt;Default priority&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;node()&lt;/code&gt; or &lt;code&gt;*&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;-0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;element-name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prefix:element-name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;a/b&lt;/code&gt; (path)&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;a[predicate]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@attr&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;More specific patterns automatically get higher priority. You can override this with the &lt;code&gt;priority&lt;/code&gt; attribute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If two templates have equal computed priority, the processor signals an error (or picks the last one, depending on implementation — Saxon issues an error by default). Always assign explicit priorities when you have competing templates.&lt;/p&gt;

&lt;h2&gt;
  
  
  The built-in templates
&lt;/h2&gt;

&lt;p&gt;XSLT has default templates for every node type. If no explicit template matches a node, the built-in fires. For elements and the document root, the built-in calls &lt;code&gt;apply-templates&lt;/code&gt; on all children. For text and attribute nodes, it outputs the string value.&lt;/p&gt;

&lt;p&gt;This means that without any templates at all, the processor will walk the entire tree and output all text content. Understanding this explains why simple stylesheets can produce unexpected extra text — a text node matched nothing explicit, and the built-in output it.&lt;/p&gt;

&lt;p&gt;To suppress text output globally, add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This overrides the built-in with an empty template, producing no output for text nodes that are not handled elsewhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Modes
&lt;/h2&gt;

&lt;p&gt;Modes let you have multiple templates for the same node that serve different purposes. A mode is a named context for a set of templates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;







    ## 





&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Call with mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Modes are especially useful when you need to process the same nodes in multiple places in the output with different logic each time.&lt;/p&gt;

&lt;h2&gt;
  
  
  apply-templates vs for-each
&lt;/h2&gt;

&lt;p&gt;Both iterate over a set of nodes. The difference is that &lt;code&gt;apply-templates&lt;/code&gt; dispatches to the best matching template for each node, while &lt;code&gt;for-each&lt;/code&gt; stays in the current context and does not do template lookup.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;apply-templates&lt;/code&gt; when you want polymorphism — different node types handled differently. Use &lt;code&gt;for-each&lt;/code&gt; when you are doing a simple iteration over a homogeneous set and do not need dispatch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;




  - 

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing patterns in XSLT Playground
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; is a fast way to experiment with matching rules. Paste a stylesheet with multiple competing templates and check which one fires. Add `` in each template to trace which one the processor picks. The trace panel shows all messages in order so you can follow the dispatch chain.&lt;/p&gt;

&lt;p&gt;Solid understanding of template matching pays off every time you work on a complex stylesheet. Once you know how priorities and built-ins interact, most "unexpected output" bugs become obvious.&lt;/p&gt;

</description>
      <category>coding</category>
      <category>programming</category>
      <category>tutorial</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Prometheus Alertmanager vs Grafana Alerting (2026): Architecture, Features, and When to Use Each</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 05 May 2026 11:00:00 +0000</pubDate>
      <link>https://forem.com/alexandrev/prometheus-alertmanager-vs-grafana-alerting-2026-architecture-features-and-when-to-use-each-48d7</link>
      <guid>https://forem.com/alexandrev/prometheus-alertmanager-vs-grafana-alerting-2026-architecture-features-and-when-to-use-each-48d7</guid>
      <description>&lt;h1&gt;
  
  
  Prometheus Alertmanager vs Grafana Alerting (2026): Architecture, Features, and When to Use Each
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Most observability stacks running in production for over a year end up with alerting spread across two systems: Prometheus Alertmanager handling metric-based alerts and Grafana Alerting managing everything else. This creates the "alerting consolidation problem" where on-call teams receive duplicated pages, silencing rules live in two places, and nobody is certain which system is authoritative.&lt;/p&gt;

&lt;p&gt;The question is straightforward: should you standardize on Prometheus Alertmanager, move everything into Grafana Alerting, or deliberately run both? The answer depends on your datasource mix, your GitOps maturity, and how your organization manages on-call routing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prometheus Alertmanager: The Standalone Receiver
&lt;/h3&gt;

&lt;p&gt;Alertmanager is a dedicated, standalone component in the Prometheus ecosystem. It does not evaluate alert rules itself. Instead, Prometheus (or compatible senders like Thanos Ruler, Cortex, or Mimir Ruler) evaluates PromQL expressions and pushes firing alerts to the Alertmanager API. Alertmanager then handles deduplication, grouping, inhibition, silencing, and notification delivery.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Simplified Prometheus → Alertmanager flow
#
# [Prometheus] --evaluates rules--&amp;gt; [firing alerts]
#        |
#        +--POST /api/v2/alerts--&amp;gt; [Alertmanager]
#                                      |
#                          +-----------+-----------+
#                          |           |           |
#                       [Slack]    [PagerDuty]  [Email]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The entire configuration lives in a single YAML file (&lt;code&gt;alertmanager.yml&lt;/code&gt;). This includes the routing tree, receiver definitions, inhibition rules, and silence templates. There is no database, no UI-driven state — just a config file and an optional local storage directory for notification state and silences. This makes it trivially reproducible and ideal for GitOps workflows.&lt;/p&gt;

&lt;p&gt;For high availability, you run multiple Alertmanager instances in a gossip-based cluster. They use a mesh protocol to share silence and notification state, ensuring that failover does not result in duplicate or lost notifications.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grafana Alerting: The Integrated Platform
&lt;/h3&gt;

&lt;p&gt;Grafana Alerting (sometimes called "Grafana Unified Alerting," introduced in Grafana 8) takes a different architectural approach. It embeds the entire alerting lifecycle — rule evaluation, state management, routing, and notification — inside the Grafana server process. Under the hood, it uses a fork of Alertmanager for the routing and notification layer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Simplified Grafana Alerting flow
#
# [Grafana Server]
#   ├── Rule Evaluation Engine
#   │     ├── queries Prometheus
#   │     ├── queries Loki
#   │     ├── queries CloudWatch
#   │     └── queries any supported datasource
#   │
#   ├── Alert State Manager (internal)
#   │
#   └── Embedded Alertmanager (routing + notifications)
#           |
#           +-----------+-----------+
#           |           |           |
#        [Slack]    [PagerDuty]  [Email]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical distinction is that Grafana Alerting evaluates alert rules itself, querying any configured datasource — not just Prometheus. It can fire alerts based on Loki log queries, Elasticsearch searches, CloudWatch metrics, PostgreSQL queries, or any of the 100+ datasource plugins available in Grafana. Rule definitions, contact points, notification policies, and mute timings are stored in the Grafana database (or provisioned via YAML files and the Grafana API).&lt;/p&gt;

&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Prometheus Alertmanager&lt;/th&gt;
&lt;th&gt;Grafana Alerting&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datasources&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prometheus-compatible only (Prometheus, Thanos, Mimir, VictoriaMetrics)&lt;/td&gt;
&lt;td&gt;Any Grafana datasource (Prometheus, Loki, Elasticsearch, CloudWatch, SQL databases, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rule evaluation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;External (Prometheus/Ruler evaluates rules and pushes alerts)&lt;/td&gt;
&lt;td&gt;Built-in (Grafana evaluates rules directly)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Routing tree&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hierarchical YAML-based routing with match/match_re, continue, group_by&lt;/td&gt;
&lt;td&gt;Notification policies with label matchers, nested policies, mute timings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grouping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full support via group_by, group_wait, group_interval&lt;/td&gt;
&lt;td&gt;Full support via notification policies with equivalent controls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inhibition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native inhibition rules (suppress alerts when a related alert is firing)&lt;/td&gt;
&lt;td&gt;Supported since Grafana 10.3 but less flexible than Alertmanager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Silencing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Label-based silences via API or UI, time-limited&lt;/td&gt;
&lt;td&gt;Mute timings (recurring schedules) and silences (ad-hoc, label-based)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Notification channels&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Email, Slack, PagerDuty, OpsGenie, VictoriaOps, webhook, WeChat, Telegram, SNS, Webex&lt;/td&gt;
&lt;td&gt;All of the above plus Teams, Discord, Google Chat, LINE, Threema, Oncall, and more via contact points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Templating&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Go templates in notification config&lt;/td&gt;
&lt;td&gt;Go templates with access to Grafana template variables and functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-tenancy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not built-in; achieved via separate instances or Mimir Alertmanager&lt;/td&gt;
&lt;td&gt;Native multi-tenancy via Grafana organizations and RBAC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gossip-based cluster (peer mesh, well-proven)&lt;/td&gt;
&lt;td&gt;Database-backed HA with peer discovery between Grafana instances&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Configuration model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single YAML file, fully declarative&lt;/td&gt;
&lt;td&gt;UI + API + provisioning YAML files, stored in database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitOps compatibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excellent — config file lives in version control natively&lt;/td&gt;
&lt;td&gt;Possible via provisioning files or Terraform provider, but requires extra tooling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;External alert sources&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any system that can POST to the Alertmanager API&lt;/td&gt;
&lt;td&gt;Supported via the Grafana Alerting API (external alerts can be pushed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Managed service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Available via Grafana Cloud (as Mimir Alertmanager), Amazon Managed Prometheus&lt;/td&gt;
&lt;td&gt;Available via Grafana Cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Alertmanager Strengths
&lt;/h2&gt;

&lt;p&gt;Alertmanager has been a production staple since 2015. Over a decade of use across thousands of organizations has made it one of the most battle-tested components in the CNCF ecosystem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Declarative, GitOps-Native Configuration
&lt;/h3&gt;

&lt;p&gt;The entire Alertmanager configuration is a single YAML file. There is no hidden state in a database, no click-driven configuration that someone forgets to document. You check it into Git, review it in a pull request, and deploy it through your CI/CD pipeline like any other infrastructure code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# alertmanager.yml — everything in one file&lt;/span&gt;
&lt;span class="na"&gt;global&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resolve_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;slack_api_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://hooks.slack.com/services/T00/B00/XXX"&lt;/span&gt;

&lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-team&lt;/span&gt;
  &lt;span class="na"&gt;group_by&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;group_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
  &lt;span class="na"&gt;group_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;repeat_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4h&lt;/span&gt;
  &lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pagerduty-oncall&lt;/span&gt;
      &lt;span class="na"&gt;group_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match_re&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;^(payments|checkout)$"&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments-slack&lt;/span&gt;
      &lt;span class="na"&gt;continue&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-team&lt;/span&gt;
    &lt;span class="na"&gt;slack_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;channel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#platform-alerts"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pagerduty-oncall&lt;/span&gt;
    &lt;span class="na"&gt;pagerduty_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;service_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments-slack&lt;/span&gt;
    &lt;span class="na"&gt;slack_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;channel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#payments-oncall"&lt;/span&gt;

&lt;span class="na"&gt;inhibit_rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
    &lt;span class="na"&gt;target_match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
    &lt;span class="na"&gt;equal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every change is auditable. Rollbacks are a &lt;code&gt;git revert&lt;/code&gt; away. This matters enormously when you are debugging why an alert did not fire at 3 AM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lightweight and Single-Purpose
&lt;/h3&gt;

&lt;p&gt;Alertmanager does one thing: route and deliver notifications. It has no dashboard, no query engine, no datasource plugins. This single-purpose design makes it operationally simple. Resource consumption is minimal — a small Alertmanager instance handles thousands of active alerts on a few hundred megabytes of memory. It starts in milliseconds and requires almost no maintenance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mature Inhibition and Routing
&lt;/h3&gt;

&lt;p&gt;Alertmanager's inhibition rules are first-class citizens. You can suppress downstream warnings when a critical alert is already firing, preventing alert storms from overwhelming your on-call team. The hierarchical routing tree with &lt;code&gt;continue&lt;/code&gt; flags allows for nuanced delivery: send to the team channel AND escalate to PagerDuty simultaneously, with different grouping strategies at each level.&lt;/p&gt;

&lt;h3&gt;
  
  
  Proven High Availability
&lt;/h3&gt;

&lt;p&gt;The gossip-based HA cluster has been stable for years. Running three Alertmanager replicas behind a load balancer (or using Kubernetes service discovery) gives you reliable notification delivery without shared storage. The protocol handles deduplication across instances automatically, which is the hardest part of distributed alerting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grafana Alerting Strengths
&lt;/h2&gt;

&lt;p&gt;Grafana Alerting has matured considerably since its rocky introduction in Grafana 8. By Grafana 11 and 12, it has become a legitimate production alerting platform with capabilities that Alertmanager cannot match on its own.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Datasource Alert Rules
&lt;/h3&gt;

&lt;p&gt;This is Grafana Alerting's strongest differentiator. You can write alert rules that query Loki for error log spikes, CloudWatch for AWS resource utilization, Elasticsearch for application errors, or a PostgreSQL database for business metrics — all from the same alerting system. If your observability stack includes more than just Prometheus, this eliminates the need for separate alerting tools per datasource.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Grafana alert rule provisioning example — alerting on Loki log errors&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;orgId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;application-errors&lt;/span&gt;
    &lt;span class="na"&gt;folder&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Production&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1m&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;loki-error-spike&lt;/span&gt;
        &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;High&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;service"&lt;/span&gt;
        &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;C&lt;/span&gt;
        &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;refId: A\            datasourceUid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;loki-prod&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sum(rate({app="payment-service"}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"ERROR"&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;[5m]))'&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;refId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;B&lt;/span&gt;
            &lt;span class="na"&gt;datasourceUid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__expr__"&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;reduce&lt;/span&gt;
              &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;A&lt;/span&gt;
              &lt;span class="na"&gt;reducer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;last&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;refId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;C&lt;/span&gt;
            &lt;span class="na"&gt;datasourceUid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__expr__"&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;threshold&lt;/span&gt;
              &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;B&lt;/span&gt;
              &lt;span class="na"&gt;conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;evaluator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gt&lt;/span&gt;
                    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;10&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
          &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is something Alertmanager simply cannot do. Alertmanager only receives pre-evaluated alerts — it has no concept of datasources or query execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unified UI for Alert Management
&lt;/h3&gt;

&lt;p&gt;Grafana provides a single pane of glass for alert rule creation, visualization, notification policy management, contact point configuration, and silence management. For teams where not every engineer is comfortable editing YAML routing trees, the visual notification policy editor significantly reduces the barrier to entry. You can see the state of every alert rule, its evaluation history, and the exact notification path it will take — all without leaving the browser.&lt;/p&gt;

&lt;h3&gt;
  
  
  Native Multi-Tenancy and RBAC
&lt;/h3&gt;

&lt;p&gt;Grafana's organization model and role-based access control extend naturally to alerting. Different teams can manage their own alert rules, contact points, and notification policies within their organization or folder scope, without seeing or interfering with other teams. Achieving this with standalone Alertmanager requires either running separate instances per tenant or using Mimir's multi-tenant Alertmanager.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mute Timings and Richer Scheduling
&lt;/h3&gt;

&lt;p&gt;While Alertmanager supports silences (ad-hoc, time-limited suppressions), Grafana Alerting adds mute timings — recurring time-based windows where notifications are suppressed. This is useful for scheduled maintenance windows, business-hours-only alerting, or suppressing non-critical alerts on weekends. Alertmanager requires external tooling or manual silence creation for recurring windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grafana Cloud as a Managed Option
&lt;/h3&gt;

&lt;p&gt;For teams that want to avoid managing alerting infrastructure entirely, Grafana Cloud provides a fully managed Grafana Alerting stack. This includes HA, state persistence, and notification delivery without any self-hosted components. The Grafana Cloud alerting stack also includes a managed Mimir Alertmanager, which means you can use Prometheus-native alerting rules if you prefer that model while still benefiting from the managed infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Prometheus Alertmanager
&lt;/h2&gt;

&lt;p&gt;Alertmanager is the right choice when the following conditions describe your environment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Your metrics stack is Prometheus-native.&lt;/strong&gt; If all your alert rules are PromQL expressions evaluated by Prometheus, Thanos Ruler, or Mimir Ruler, Alertmanager is the natural fit. There is no added value in routing those alerts through Grafana.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitOps is non-negotiable.&lt;/strong&gt; If every infrastructure change must go through a pull request and be fully declarative, Alertmanager's single-file configuration model is significantly easier to manage than Grafana's database-backed state. Tools like &lt;code&gt;amtool&lt;/code&gt; provide config validation in CI pipelines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need fine-grained routing with inhibition.&lt;/strong&gt; Complex routing trees with multiple levels of grouping, inhibition rules, and &lt;code&gt;continue&lt;/code&gt; flags are more naturally expressed in Alertmanager's YAML format. The routing logic has been stable and well-documented for years.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You run microservices with per-team routing.&lt;/strong&gt; If each team owns its routing subtree and the routing logic is complex, Alertmanager's hierarchical model scales better than UI-driven configuration. Teams can own their section of the config file via CODEOWNERS in Git.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want minimal operational overhead.&lt;/strong&gt; Alertmanager is a single binary with minimal resource requirements. There is no database to back up, no migrations to run, and no UI framework to keep updated.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Use Grafana Alerting
&lt;/h2&gt;

&lt;p&gt;Grafana Alerting is the right choice when these conditions apply:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You alert on more than just Prometheus metrics.&lt;/strong&gt; If you need alert rules based on Loki logs, Elasticsearch queries, CloudWatch metrics, or database queries, Grafana Alerting is the only option that handles all of these natively. The alternative is running separate alerting tools per datasource, which is worse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your team prefers UI-driven configuration.&lt;/strong&gt; Not every engineer wants to edit YAML routing trees. If your organization values a visual interface for managing alerts, contact points, and notification policies, Grafana's UI is a major productivity advantage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You are using Grafana Cloud.&lt;/strong&gt; If you are already on Grafana Cloud, using its built-in alerting is the path of least resistance. You get HA, managed notification delivery, and a unified experience without running any additional infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenancy is a requirement.&lt;/strong&gt; If multiple teams need isolated alerting configurations with RBAC, Grafana's native organization and folder-based access model is significantly easier to set up than running per-tenant Alertmanager instances.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want mute timings for recurring maintenance windows.&lt;/strong&gt; If your team regularly needs to suppress alerts during scheduled windows (deploy windows, batch processing hours, weekend non-critical suppression), Grafana's mute timings feature is more ergonomic than creating and managing recurring silences in Alertmanager.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Running Both Together: The Hybrid Pattern
&lt;/h2&gt;

&lt;p&gt;In practice, many production environments run both Alertmanager and Grafana Alerting. This is not necessarily a mistake — it can be a deliberate architectural choice when done with clear boundaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common Hybrid Architecture
&lt;/h3&gt;

&lt;p&gt;The most common pattern looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus Alertmanager&lt;/strong&gt; handles all metric-based alerts. PromQL rules are evaluated by Prometheus or a long-term storage ruler (Thanos, Mimir). Alertmanager owns routing, grouping, and notification for these alerts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana Alerting&lt;/strong&gt; handles non-Prometheus alerts: log-based alerts from Loki, business metrics from SQL datasources, and cross-datasource correlation rules.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key to making this work without chaos is establishing clear ownership rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Ownership boundaries for hybrid alerting
#
# Prometheus Alertmanager owns:
#   - All PromQL-based alert rules
#   - Infrastructure alerts (node, kubelet, etcd, CoreDNS)
#   - Application SLO/SLI alerts based on metrics
#
# Grafana Alerting owns:
#   - Log-based alert rules (Loki, Elasticsearch)
#   - Business metric alerts (SQL datasources)
#   - Cross-datasource correlation rules
#   - Alerts for teams that prefer UI-driven management
#
# Shared:
#   - Contact points / receivers use the same Slack channels and PagerDuty services
#   - On-call rotations are managed externally (PagerDuty, Grafana OnCall)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both systems can deliver to the same notification channels. The critical discipline is ensuring that silencing and maintenance windows are applied in both systems when needed. This is the primary operational cost of the hybrid approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grafana as a Viewer for Alertmanager
&lt;/h3&gt;

&lt;p&gt;Even if you use Alertmanager exclusively for routing and notification, Grafana can serve as a read-only viewer. Grafana natively supports connecting to an external Alertmanager datasource, allowing you to see firing alerts, active silences, and alert groups in the Grafana UI. This gives you the operational visibility of Grafana without moving your alerting logic into it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Grafana datasource provisioning for external Alertmanager&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;datasources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Alertmanager&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alertmanager&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://alertmanager.monitoring.svc:9093&lt;/span&gt;
    &lt;span class="na"&gt;access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
    &lt;span class="na"&gt;jsonData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;implementation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Migration Considerations
&lt;/h2&gt;

&lt;p&gt;If you are moving from one system to the other, here are the practical considerations to plan for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Migrating from Alertmanager to Grafana Alerting
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rule conversion.&lt;/strong&gt; Your PromQL-based recording and alerting rules defined in Prometheus rule files need to be recreated as Grafana alert rules. Grafana provides a migration tool that can import Prometheus-format rules, but complex expressions may need manual adjustment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing tree translation.&lt;/strong&gt; Alertmanager's hierarchical routing tree maps to Grafana's notification policies, but the semantics are not identical. Test the notification routing thoroughly — the &lt;code&gt;continue&lt;/code&gt; flag behavior and default routes may differ.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silence and inhibition migration.&lt;/strong&gt; Active silences are ephemeral and do not need migration. Inhibition rules need to be recreated in Grafana's format. Recurring maintenance windows should be converted to mute timings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run in parallel first.&lt;/strong&gt; The safest migration strategy is to run both systems in parallel for two to four weeks, sending notifications from both, then cutting over when you have confidence in the Grafana setup. Accept the temporary noise of duplicate alerts — it is far cheaper than missing a critical page during migration.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Migrating from Grafana Alerting to Alertmanager
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Datasource limitation.&lt;/strong&gt; You can only migrate alerts that are based on Prometheus-compatible datasources. Alerts querying Loki, Elasticsearch, or SQL datasources have no equivalent in Alertmanager — you will need an alternative solution for those.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rule export.&lt;/strong&gt; Export Grafana alert rules and convert them to Prometheus-format rule files. The Grafana API (&lt;code&gt;GET /api/v1/provisioning/alert-rules&lt;/code&gt;) provides structured output that can be transformed with a script.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contact point mapping.&lt;/strong&gt; Map Grafana contact points to Alertmanager receivers. The configuration format is different, but the concepts are equivalent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State loss.&lt;/strong&gt; Alertmanager does not carry over Grafana's alert evaluation history. You start fresh. Plan for a brief period where alerts may re-fire as Prometheus evaluates rules that were previously managed by Grafana.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Decision Framework
&lt;/h2&gt;

&lt;p&gt;If you want a quick decision path, use this framework:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Start here:
│
├── Do you alert on non-Prometheus datasources (Loki, ES, SQL, CloudWatch)?
│   ├── YES → Grafana Alerting (at least for those datasources)
│   └── NO ↓
│
├── Is GitOps/declarative config a hard requirement?
│   ├── YES → Alertmanager
│   └── NO ↓
│
├── Do you need multi-tenancy with RBAC?
│   ├── YES → Grafana Alerting (or Mimir Alertmanager)
│   └── NO ↓
│
├── Are you on Grafana Cloud?
│   ├── YES → Grafana Alerting (path of least resistance)
│   └── NO ↓
│
└── Default → Alertmanager (simpler, lighter, well-proven)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For many teams, the honest answer is "both" — Alertmanager for the Prometheus-native metric pipeline, Grafana Alerting for everything else. That is a valid architecture as long as the ownership boundaries are documented and the on-call team knows where to look.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the difference between Alertmanager and Grafana Alerting?
&lt;/h3&gt;

&lt;p&gt;Prometheus Alertmanager is a standalone notification routing engine that receives pre-evaluated alerts from Prometheus and delivers them to receivers like Slack, PagerDuty, or email. Grafana Alerting is an integrated alerting platform embedded in Grafana that both evaluates alert rules and handles notification routing. Alertmanager is configured entirely via YAML, while Grafana Alerting offers a UI, API, and file-based provisioning. The fundamental difference is scope: Alertmanager handles only the routing and notification phase, while Grafana Alerting handles the full lifecycle from query evaluation to notification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can Grafana Alerting replace Prometheus Alertmanager?
&lt;/h3&gt;

&lt;p&gt;Yes, for many use cases. Grafana Alerting can evaluate PromQL rules directly against your Prometheus datasource, so you do not strictly need a separate Alertmanager instance. However, there are scenarios where Alertmanager remains the better choice: heavily GitOps-driven environments, teams that need Alertmanager's mature inhibition rules, or architectures where Prometheus rule evaluation happens externally (Thanos Ruler, Mimir Ruler) and a dedicated Alertmanager is already in the pipeline. If your only datasource is Prometheus and you value declarative configuration, Alertmanager is still simpler and lighter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Grafana Alertmanager the same as Prometheus Alertmanager?
&lt;/h3&gt;

&lt;p&gt;Not exactly. Grafana Alerting uses a fork of the Prometheus Alertmanager code internally for its notification routing engine, but it is not the same product. The Grafana "Alertmanager" visible in the UI is a managed, embedded component with a different configuration interface (notification policies, contact points, mute timings) compared to the standalone Prometheus Alertmanager (routing tree, receivers, inhibition rules in YAML). Grafana can also connect to an external Prometheus Alertmanager as a datasource, which adds to the confusion.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the best alternatives to Prometheus Alertmanager?
&lt;/h3&gt;

&lt;p&gt;The most direct alternative is Grafana Alerting, which can receive and route Prometheus alerts while also supporting other datasources. Beyond that: &lt;strong&gt;Grafana OnCall&lt;/strong&gt; for on-call management and escalation, &lt;strong&gt;PagerDuty&lt;/strong&gt; or &lt;strong&gt;Opsgenie&lt;/strong&gt; as managed incident response platforms, &lt;strong&gt;Keep&lt;/strong&gt; as an open-source AIOps alert management platform, and &lt;strong&gt;Mimir Alertmanager&lt;/strong&gt; for multi-tenant environments running Grafana Mimir.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I use Prometheus alerts or Grafana alerts for Kubernetes monitoring?
&lt;/h3&gt;

&lt;p&gt;For Kubernetes monitoring specifically, the kube-prometheus-stack (which includes Prometheus, Alertmanager, and a comprehensive set of pre-built alerting rules) remains the industry standard. These rules are PromQL-based and are designed to work with Alertmanager. If you are deploying kube-prometheus-stack, using Alertmanager for metric-based alerts is the straightforward choice. Add Grafana Alerting on top if you also need to alert on logs (via Loki) or non-metric datasources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The Alertmanager vs Grafana Alerting debate is not really about which tool is better — it is about which tool fits your operational context. Alertmanager is simpler, lighter, and more GitOps-friendly. Grafana Alerting is more versatile, more accessible to UI-oriented teams, and the only option if you need multi-datasource alerting. Running both is perfectly valid when the boundaries are clear.&lt;/p&gt;

&lt;p&gt;The worst outcome is not picking the "wrong" tool. The worst outcome is running both accidentally, with overlapping coverage, duplicated notifications, and no clear ownership. Whatever you choose, document the decision, define the ownership boundaries, and make sure your on-call team knows exactly where to go when they need to silence an alert at 3 AM.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexandre-vazquez.com/alertmanager-vs-grafana-alerting/" rel="noopener noreferrer"&gt;alexandre-vazquez.com/alertmanager-vs-grafana-alerting&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>prometheus</category>
      <category>grafana</category>
      <category>devops</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Prometheus Alertmanager Vs Grafana Alerting (2026): Architecture, Features, And When To Use Each</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 05 May 2026 10:00:00 +0000</pubDate>
      <link>https://forem.com/alexandrev/prometheus-alertmanager-vs-grafana-alerting-2026-architecture-features-and-when-to-use-each-18pi</link>
      <guid>https://forem.com/alexandrev/prometheus-alertmanager-vs-grafana-alerting-2026-architecture-features-and-when-to-use-each-18pi</guid>
      <description>&lt;h1&gt;
  
  
  Prometheus Alertmanager Vs Grafana Alerting (2026): Architecture, Features, And When To Use Each
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexandre-vazquez.com/alertmanager-vs-grafana-alerting/" rel="noopener noreferrer"&gt;alexandre-vazquez.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Read the full article on my blog: &lt;a href="https://alexandre-vazquez.com/alertmanager-vs-grafana-alerting/" rel="noopener noreferrer"&gt;https://alexandre-vazquez.com/alertmanager-vs-grafana-alerting/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Debugging Distroless Containers: kubectl debug, Ephemeral Containers, and When to Use Each</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 05 May 2026 08:00:01 +0000</pubDate>
      <link>https://forem.com/alexandrev/debugging-distroless-containers-kubectl-debug-ephemeral-containers-and-when-to-use-each-5enb</link>
      <guid>https://forem.com/alexandrev/debugging-distroless-containers-kubectl-debug-ephemeral-containers-and-when-to-use-each-5enb</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexandre-vazquez.com/debugging-distroless-containers/" rel="noopener noreferrer"&gt;alexandre-vazquez.com/debugging-distroless-containers/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The container works fine in CI. It deploys successfully to staging. Then something goes wrong in production and you type the command you always type: &lt;code&gt;kubectl exec -it my-pod -- /bin/bash&lt;/code&gt;. The response is immediate: &lt;code&gt;OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/bash": stat /bin/bash: no such file or directory&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You try &lt;code&gt;/bin/sh&lt;/code&gt;. Same error. You try &lt;code&gt;ls&lt;/code&gt;. Same error. The container image is distroless — it ships only your application binary and its runtime dependencies, with no shell, no package manager, no debugging tools of any kind. This is intentional and correct from a security standpoint. It is also a significant operational challenge the first time you face it in production.&lt;/p&gt;

&lt;p&gt;This article covers every practical technique for debugging distroless containers in Kubernetes: &lt;strong&gt;kubectl debug with ephemeral containers&lt;/strong&gt; (the standard approach), &lt;strong&gt;pod copy strategy&lt;/strong&gt; (for Kubernetes versions without ephemeral container support, or when you need to modify the running pod spec), &lt;strong&gt;debug image variants&lt;/strong&gt; (the pragmatic developer shortcut), &lt;strong&gt;cdebug&lt;/strong&gt; (a purpose-built tool that simplifies the process), and &lt;strong&gt;node-level debugging&lt;/strong&gt; (the last resort with the most power). For each technique I will explain what it can and cannot do, what Kubernetes version or RBAC permissions it requires, and in which scenario — developer in local, platform engineer in staging, ops in production — it is the appropriate choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Distroless Breaks the Normal Debugging Workflow
&lt;/h2&gt;

&lt;p&gt;Traditional container debugging assumes you can exec into the container and use shell tools: &lt;code&gt;ps&lt;/code&gt;, &lt;code&gt;netstat&lt;/code&gt;, &lt;code&gt;strace&lt;/code&gt;, &lt;code&gt;curl&lt;/code&gt;, a text editor. Distroless images remove all of this by design. The Google distroless project, Chainguard's Wolfi-based images, and the broader minimal image ecosystem deliberately exclude everything that is not required to run the application. The result is a dramatically smaller attack surface: no shell means no RCE via shell injection, no package manager means no easy escalation path, fewer binaries means fewer CVEs in the image scan.&lt;/p&gt;

&lt;p&gt;The tradeoff is operational: when something goes wrong, you cannot use the tools that the process itself is not allowed to run. A Java application in &lt;code&gt;gcr.io/distroless/java17-debian12&lt;/code&gt; has the JRE and nothing else. A Go binary compiled with CGO disabled and shipped in &lt;code&gt;gcr.io/distroless/static-debian12&lt;/code&gt; has literally only the binary and the necessary CA certificates and timezone data. There is no &lt;code&gt;wget&lt;/code&gt; to download a debug binary, no &lt;code&gt;apt&lt;/code&gt; to install one, no &lt;code&gt;bash&lt;/code&gt; to run a script.&lt;/p&gt;

&lt;p&gt;Kubernetes solves this at the platform level with &lt;strong&gt;ephemeral containers&lt;/strong&gt; , added as stable in Kubernetes 1.25. The principle is that a debug container — which can have a full shell and any tools you want — can be injected into a running pod and share its process namespace, network namespace, and filesystem mounts without modifying the original container or restarting the pod.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 1: kubectl debug with Ephemeral Containers
&lt;/h2&gt;

&lt;p&gt;Ephemeral containers are the canonical solution. Since Kubernetes 1.25 (stable), &lt;code&gt;kubectl debug&lt;/code&gt; can inject a temporary container into a running pod. The container shares the target pod's network namespace by default, and with &lt;code&gt;--target&lt;/code&gt; it can also share the process namespace of a specific container, allowing you to inspect its running processes and open file descriptors.&lt;/p&gt;

&lt;p&gt;The basic invocation is:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl debug -it my-pod \
  --image=busybox:latest \
  --target=my-container
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;--target&lt;/code&gt; flag is the critical piece. Without it, the ephemeral container gets its own process namespace. With it, it shares the process namespace of the specified container — meaning you can run &lt;code&gt;ps aux&lt;/code&gt; and see the application's processes, use &lt;code&gt;ls -la /proc//fd&lt;/code&gt; to inspect open file descriptors, and read the application's environment via &lt;code&gt;cat /proc//environ&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For a more capable debug environment, replace &lt;code&gt;busybox&lt;/code&gt; with a richer image:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl debug -it my-pod \
  --image=nicolaka/netshoot \
  --target=my-container
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;nicolaka/netshoot&lt;/code&gt; includes &lt;code&gt;tcpdump&lt;/code&gt;, &lt;code&gt;curl&lt;/code&gt;, &lt;code&gt;dig&lt;/code&gt;, &lt;code&gt;nmap&lt;/code&gt;, &lt;code&gt;ss&lt;/code&gt;, &lt;code&gt;iperf3&lt;/code&gt;, and dozens of other network diagnostic tools, making it the standard choice for network debugging scenarios.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You Can and Cannot Do
&lt;/h3&gt;

&lt;p&gt;Ephemeral containers share the pod's network namespace and, when &lt;code&gt;--target&lt;/code&gt; is used, the process namespace. This gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full visibility into the application's network traffic from inside the pod (tcpdump, ss, netstat)&lt;/li&gt;
&lt;li&gt;Process inspection via &lt;code&gt;/proc/&lt;/code&gt; — open files, memory maps, environment variables, CPU/memory usage&lt;/li&gt;
&lt;li&gt;Access to the pod's DNS resolution context — exactly the same &lt;code&gt;/etc/resolv.conf&lt;/code&gt; the application sees&lt;/li&gt;
&lt;li&gt;Ability to make outbound network calls from the same network namespace (testing service endpoints, DNS resolution)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What you do &lt;em&gt;not&lt;/em&gt; get with ephemeral containers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Access to the application container 's filesystem.&lt;/strong&gt; The ephemeral container has its own root filesystem. You cannot &lt;code&gt;cat /app/config.yaml&lt;/code&gt; from the application container's filesystem unless you access it via &lt;code&gt;/proc//root/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ability to remove the container once added.&lt;/strong&gt; Ephemeral containers are permanent until the pod is deleted. This is by design — the Kubernetes API does not allow removing them after creation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volume mount modifications via CLI.&lt;/strong&gt; You cannot add volume mounts to an ephemeral container via &lt;code&gt;kubectl debug&lt;/code&gt; (though the API spec supports it, the CLI does not expose this).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource limits.&lt;/strong&gt; Ephemeral containers do not support resource requests and limits in the &lt;code&gt;kubectl debug&lt;/code&gt; CLI, though this is evolving.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Accessing the Application Filesystem
&lt;/h3&gt;

&lt;p&gt;The most common surprise for developers new to ephemeral containers is that they cannot directly browse the application container's filesystem. The workaround is the &lt;code&gt;/proc&lt;/code&gt; filesystem:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Find the application's PID
ps aux

# Browse its filesystem via /proc
ls /proc/1/root/app/
cat /proc/1/root/etc/config.yaml

# Or set the root to the application's root
chroot /proc/1/root /bin/sh  # only if /bin/sh exists in the app image
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;/proc//root&lt;/code&gt; path is a symlink to the container's root filesystem as seen from the process namespace. Because the ephemeral container shares the process namespace with &lt;code&gt;--target&lt;/code&gt;, the application's PID is typically 1, and &lt;code&gt;/proc/1/root&lt;/code&gt; gives you full read access to its filesystem.&lt;/p&gt;

&lt;h3&gt;
  
  
  RBAC Requirements
&lt;/h3&gt;

&lt;p&gt;Ephemeral containers require the &lt;code&gt;pods/ephemeralcontainers&lt;/code&gt; subresource permission. This is separate from &lt;code&gt;pods/exec&lt;/code&gt;, which controls &lt;code&gt;kubectl exec&lt;/code&gt;. A common mistake is to grant &lt;code&gt;pods/exec&lt;/code&gt; for debugging purposes without realizing that ephemeral containers require an additional grant:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ephemeral-debugger
rules:
- apiGroups: [""]
  resources: ["pods/ephemeralcontainers"]
  verbs: ["update", "patch"]
- apiGroups: [""]
  resources: ["pods/attach"]
  verbs: ["create", "get"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In production environments, this permission should be tightly scoped: time-limited via &lt;code&gt;RoleBinding&lt;/code&gt; rather than permanent &lt;code&gt;ClusterRoleBinding&lt;/code&gt;, restricted to specific namespaces, and ideally gated behind an approval workflow. The debug container runs as root by default, which can create privilege escalation paths if the application container runs as a non-root user with shared process namespace — the debug container can attach to the application's processes with higher privileges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 2: kubectl debug -copy-to (Pod Copy Strategy)
&lt;/h2&gt;

&lt;p&gt;When you need to modify the pod's container spec — replace the image, change environment variables, add a sidecar with a shared filesystem — the &lt;code&gt;--copy-to&lt;/code&gt; flag creates a full copy of the pod with your modifications applied:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl debug my-pod \
  -it \
  --copy-to=my-pod-debug \
  --image=my-app:debug \
  --share-processes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This creates a new pod named &lt;code&gt;my-pod-debug&lt;/code&gt; that is a copy of &lt;code&gt;my-pod&lt;/code&gt; but with the container image replaced by &lt;code&gt;my-app:debug&lt;/code&gt;. If &lt;code&gt;my-app:debug&lt;/code&gt; is your application image built with debug tooling included (or a debug variant from your registry), this lets you interact with the exact same binary in the exact same configuration as the original pod.&lt;/p&gt;

&lt;p&gt;A more common use of &lt;code&gt;--copy-to&lt;/code&gt; is to attach a debug container alongside the existing application container while keeping the original image unchanged:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl debug my-pod \
  -it \
  --copy-to=my-pod-debug \
  --image=busybox \
  --share-processes \
  --container=debugger
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This creates the copy-pod with both the original containers and a new &lt;code&gt;debugger&lt;/code&gt; container sharing the process namespace. Unlike ephemeral containers, this approach supports volume mounts and resource limits, and the debug pod can be deleted cleanly when you are done.&lt;/p&gt;

&lt;h3&gt;
  
  
  Limitations of the Copy Strategy
&lt;/h3&gt;

&lt;p&gt;The pod copy approach has a critical limitation: &lt;strong&gt;it is not debugging the original pod&lt;/strong&gt;. It creates a new pod that may behave differently because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It does not share the original pod's &lt;strong&gt;in-memory state&lt;/strong&gt; — if the issue is a goroutine leak or heap corruption that has been accumulating for hours, the fresh copy will not exhibit it immediately&lt;/li&gt;
&lt;li&gt;It creates a new Pod UID, which means any admission webhooks, network policies, or pod-level security contexts that depend on pod identity may apply differently&lt;/li&gt;
&lt;li&gt;If the original pod is crashing (&lt;code&gt;CrashLoopBackOff&lt;/code&gt;), the copy will also crash — this technique does not help for crash debugging unless you also change the entrypoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For crash debugging specifically, combine &lt;code&gt;--copy-to&lt;/code&gt; with a modified entrypoint to keep the container alive:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl debug my-crashing-pod \
  -it \
  --copy-to=my-pod-debug \
  --image=busybox \
  --share-processes \
  -- sleep 3600
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Option 3: Debug Image Variants
&lt;/h2&gt;

&lt;p&gt;The most pragmatic approach — and the one most appropriate for developer workflows — is to maintain a debug variant of your application image that includes shell tooling. Both the Google distroless project and Chainguard provide this pattern officially.&lt;/p&gt;

&lt;p&gt;Google distroless images have a &lt;code&gt;:debug&lt;/code&gt; tag that adds BusyBox to the image:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Production image
FROM gcr.io/distroless/java17-debian12

# Debug variant — identical but with BusyBox shell
FROM gcr.io/distroless/java17-debian12:debug
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Chainguard images follow a similar convention with &lt;code&gt;:latest-dev&lt;/code&gt; variants that include apk, a shell, and common utilities:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Production (zero shell, minimal footprint)
FROM cgr.dev/chainguard/go:latest

# Development/debug variant
FROM cgr.dev/chainguard/go:latest-dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If you build your own base images, the recommended approach is to use multi-stage builds and maintain separate build targets:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FROM golang:1.22 AS builder
WORKDIR /app
COPY . .
RUN go build -o myapp .

# Production: static distroless image
FROM gcr.io/distroless/static-debian12 AS production
COPY --from=builder /app/myapp /myapp
ENTRYPOINT ["/myapp"]

# Debug variant: same binary, with shell tools
FROM gcr.io/distroless/static-debian12:debug AS debug
COPY --from=builder /app/myapp /myapp
ENTRYPOINT ["/myapp"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In your CI/CD pipeline, build both targets and push &lt;code&gt;my-app:${VERSION}&lt;/code&gt; (production) and &lt;code&gt;my-app:${VERSION}-debug&lt;/code&gt; (debug variant) to your registry. The debug image is never deployed to production by default, but it exists and is ready to be used with &lt;code&gt;kubectl debug --copy-to&lt;/code&gt; when needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security Considerations for Debug Variants
&lt;/h3&gt;

&lt;p&gt;Debug image variants defeat much of the security benefit of distroless if they are used in production, even temporarily. Track usage carefully: log when debug images are deployed, require explicit approval, and ensure they are removed after the debugging session. In regulated environments, consider whether deploying a debug variant to production namespaces is permitted by your security policy — in many cases it is not, and you must use ephemeral containers (which add a debug process to the pod without modifying the application image) instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 4: cdebug
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;cdebug&lt;/code&gt; is an open-source CLI tool that simplifies distroless debugging by wrapping &lt;code&gt;kubectl debug&lt;/code&gt; with more ergonomic defaults and additional capabilities. Its primary value is in making ephemeral container debugging feel like a native shell experience:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Install
brew install cdebug
# or: go install github.com/iximiuz/cdebug@latest

# Debug a running pod
cdebug exec -it my-pod

# Specify a namespace and container
cdebug exec -it -n production my-pod -c my-container

# Use a specific debug image
cdebug exec -it my-pod --image=nicolaka/netshoot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;What &lt;code&gt;cdebug&lt;/code&gt; adds over raw &lt;code&gt;kubectl debug&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic filesystem chroot.&lt;/strong&gt; &lt;code&gt;cdebug exec&lt;/code&gt; automatically sets the filesystem root of the debug container to the target container's filesystem, so you browse &lt;code&gt;/&lt;/code&gt; and see the application's files — not the debug image's files. This addresses the most common friction point with &lt;code&gt;kubectl debug&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker integration.&lt;/strong&gt; &lt;code&gt;cdebug exec&lt;/code&gt; works identically for Docker containers (&lt;code&gt;cdebug exec -it&lt;/code&gt;), making it the same muscle memory for local and cluster debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No RBAC complications&lt;/strong&gt; for Docker-based local development — useful for developer workflows before the code reaches Kubernetes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff: &lt;code&gt;cdebug&lt;/code&gt; is a third-party dependency and requires installation. In environments with strict tooling policies (regulated industries, air-gapped clusters), it may not be an option. In those cases, the raw &lt;code&gt;kubectl debug&lt;/code&gt; workflow with &lt;code&gt;/proc/1/root&lt;/code&gt; filesystem navigation is the baseline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 5: Node-Level Debugging
&lt;/h2&gt;

&lt;p&gt;When everything else fails — the pod is in &lt;code&gt;CrashLoopBackOff&lt;/code&gt; too fast to attach to, the issue is a kernel-level problem, or you need tools like &lt;code&gt;strace&lt;/code&gt; that require elevated privileges — node-level debugging gives you direct access to the container's processes from the host node.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl debug node/&lt;/code&gt; creates a privileged pod on the target node that mounts the node's root filesystem under &lt;code&gt;/host&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl debug node/my-node-name \
  -it \
  --image=nicolaka/netshoot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;From this privileged pod, you can use &lt;code&gt;nsenter&lt;/code&gt; to enter the namespaces of any container running on the node:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Find the container's PID on the node
# (from within the node debug pod)
crictl ps | grep my-container
crictl inspect  | grep pid

# Enter the container's namespaces
nsenter -t  -m -u -i -n -p -- /bin/sh

# Or just the network namespace (for network debugging)
nsenter -t  -n -- ip a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;nsenter&lt;/code&gt; approach lets you run tools from the node's or debug container's toolset while operating in the namespaces of the target container. This is how you run &lt;code&gt;strace&lt;/code&gt; against a distroless process: &lt;code&gt;strace&lt;/code&gt; is not in the application container, but you can run it from the node level while targeting the application's PID.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Trace all syscalls from the application process
nsenter -t  -- strace -p  -f -e trace=network
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  RBAC and Security for Node Debugging
&lt;/h3&gt;

&lt;p&gt;Node-level debugging requires &lt;code&gt;nodes/proxy&lt;/code&gt; and the ability to create privileged pods, which in most production clusters is restricted to cluster administrators. The debug pod runs with &lt;code&gt;hostPID: true&lt;/code&gt; and &lt;code&gt;hostNetwork: true&lt;/code&gt;, giving it visibility into all processes and network traffic on the node — not just the target container. This is significant: every process running on the node, including those in other tenants' namespaces, is visible.&lt;/p&gt;

&lt;p&gt;This technique should be treated as a break-glass procedure: log the access, require dual approval in production environments, and clean up immediately after the debugging session with &lt;code&gt;kubectl delete pod --selector=app=node-debugger&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Choosing the Right Approach: Access Profile and Environment Matrix
&lt;/h2&gt;

&lt;p&gt;The technique you should use depends on two axes: &lt;strong&gt;who you are&lt;/strong&gt; (developer, platform engineer, ops/SRE) and &lt;strong&gt;where the issue is&lt;/strong&gt; (local development, staging, production). The requirements and constraints differ significantly across these combinations.&lt;/p&gt;
&lt;h3&gt;
  
  
  Developer — Local or Development Cluster
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Reproduce and understand a bug, inspect configuration, verify network connectivity to services.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Constraints:&lt;/strong&gt; None material — full cluster admin on local or personal dev namespace.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Recommended approach:&lt;/strong&gt; Debug image variants or cdebug.&lt;/p&gt;

&lt;p&gt;In local development (Minikube, Kind, Docker Desktop), the fastest path is to build the debug variant of your image and deploy it directly. If you are working with another team's service, &lt;code&gt;cdebug exec&lt;/code&gt; gives you a shell in the container with automatic filesystem root without any special RBAC. The goal is speed and iteration — reserve the more structured approaches for higher environments.&lt;/p&gt;
&lt;h3&gt;
  
  
  Developer — Staging Cluster
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Debug integration issues, inspect live configuration, verify environment-specific behavior.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Constraints:&lt;/strong&gt; Shared cluster — cannot deploy arbitrary workloads to other teams' namespaces, but has &lt;code&gt;pods/ephemeralcontainers&lt;/code&gt; in own namespace.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Recommended approach:&lt;/strong&gt; kubectl debug with ephemeral containers (&lt;code&gt;--target&lt;/code&gt;), scoped to own namespace.&lt;/p&gt;

&lt;p&gt;Staging is where ephemeral containers earn their keep. You can attach to a running pod without restarting it, without modifying the deployment spec, and without affecting other users of the same cluster. Grant developers &lt;code&gt;pods/ephemeralcontainers&lt;/code&gt; in their team's namespaces and they can self-service debug without needing ops involvement.&lt;/p&gt;
&lt;h3&gt;
  
  
  Platform Engineer / SRE — Production
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Diagnose a live production incident. The pod is behaving unexpectedly — high latency, memory growth, unexpected connections, incorrect responses.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Constraints:&lt;/strong&gt; Changes to running pods are high-risk. Any debug image deployment must be gated. The issue is live and affecting users.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Recommended approach:&lt;/strong&gt; kubectl debug with ephemeral containers (ephemeral containers do not restart the pod, do not modify the deployment, and are auditable via API audit logs).&lt;/p&gt;

&lt;p&gt;The key production requirements are auditability and minimal blast radius. Ephemeral containers satisfy both: they are recorded in the Kubernetes API audit log (who attached, when, to which pod), they do not modify the running application container, and they are limited to the pod's own network and process namespaces. Document the debug session in your incident ticket: pod name, time, what was observed, who ran the debug container.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--copy-to&lt;/code&gt; strategy is generally inappropriate for production incident response: it creates a new pod that may or may not exhibit the issue, it adds load to the cluster during an incident, and if it is attached to the same services (databases, downstream APIs), it produces additional traffic that complicates forensics.&lt;/p&gt;
&lt;h3&gt;
  
  
  Platform Engineer — Production, Node-Level Issue
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Diagnose a kernel-level issue, a container runtime problem, a networking issue that spans multiple pods, or a situation where the pod is crashing too fast to attach to.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Constraints:&lt;/strong&gt; Maximum privilege required. High operational risk.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Recommended approach:&lt;/strong&gt; Node-level debug pod with &lt;code&gt;nsenter&lt;/code&gt;. Treat as break-glass.&lt;/p&gt;

&lt;p&gt;For this scenario, create a dedicated RBAC role that grants &lt;code&gt;nodes/proxy&lt;/code&gt; access and the ability to create pods with &lt;code&gt;hostPID: true&lt;/code&gt; in a dedicated debug namespace. Bind it only to specific users, require a separate authentication step (e.g., &lt;code&gt;kubectl auth can-i&lt;/code&gt; check against a time-limited binding), and log all access. This level of access should generate a PagerDuty-style alert so that the security team knows a privileged debug session is active in production.&lt;/p&gt;
&lt;h2&gt;
  
  
  Common Errors and Solutions
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Error: "ephemeral containers are disabled for this cluster"
&lt;/h3&gt;

&lt;p&gt;Ephemeral containers require Kubernetes 1.16+ (alpha, behind feature gate) and are stable from 1.25. If you are on 1.16–1.22, you need to enable the &lt;code&gt;EphemeralContainers&lt;/code&gt; feature gate on the API server and kubelet. From 1.23 it was beta and enabled by default. From 1.25 it is stable and always on. On managed Kubernetes services (EKS, GKE, AKS), check the cluster version — versions older than 1.25 may still have it disabled depending on your configuration.&lt;/p&gt;
&lt;h3&gt;
  
  
  Error: "cannot update ephemeralcontainers" (RBAC)
&lt;/h3&gt;

&lt;p&gt;You have &lt;code&gt;pods/exec&lt;/code&gt; but not &lt;code&gt;pods/ephemeralcontainers&lt;/code&gt;. Add the grant shown in the RBAC section above. Note that &lt;code&gt;pods/exec&lt;/code&gt; and &lt;code&gt;pods/ephemeralcontainers&lt;/code&gt; are separate subresources — having one does not imply the other.&lt;/p&gt;
&lt;h3&gt;
  
  
  Error: "container not found" with -target
&lt;/h3&gt;

&lt;p&gt;The container name in &lt;code&gt;--target&lt;/code&gt; must match exactly the container name as defined in the Pod spec — not the image name. Check with &lt;code&gt;kubectl get pod my-pod -o jsonpath='{.spec.containers[*].name}'&lt;/code&gt; to get the exact container names.&lt;/p&gt;
&lt;h3&gt;
  
  
  Error: Can see processes but cannot read /proc/1/root
&lt;/h3&gt;

&lt;p&gt;The application container runs as a non-root user (e.g., UID 1000) and the ephemeral container runs as root. The application's filesystem may have files owned by UID 1000 that are not readable by other UIDs depending on permissions. The &lt;code&gt;/proc//root&lt;/code&gt; path itself requires &lt;code&gt;CAP_SYS_PTRACE&lt;/code&gt; capability. If your cluster's PodSecurityStandards (PSS) are set to &lt;code&gt;restricted&lt;/code&gt;, the debug container may not have this capability. Use the &lt;code&gt;Baseline&lt;/code&gt; PSS profile for debug namespaces or explicitly add &lt;code&gt;SYS_PTRACE&lt;/code&gt; to the ephemeral container's &lt;code&gt;securityContext&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Error: tcpdump shows no traffic
&lt;/h3&gt;

&lt;p&gt;When using &lt;code&gt;nicolaka/netshoot&lt;/code&gt; for network debugging, ensure the ephemeral container is created &lt;em&gt;without&lt;/em&gt; &lt;code&gt;--target&lt;/code&gt; if your goal is to capture all traffic on the pod's network interface (not just the specific container's process). With &lt;code&gt;--target&lt;/code&gt;, you share the process namespace but the network namespace is shared at the pod level regardless. Run &lt;code&gt;tcpdump -i any&lt;/code&gt; to capture on all interfaces including loopback, which is where inter-container traffic within a pod travels.&lt;/p&gt;
&lt;h2&gt;
  
  
  Decision Framework
&lt;/h2&gt;

&lt;p&gt;Use this as a starting point to select the right technique for your situation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Technique&lt;/th&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Active production incident, pod running&lt;/td&gt;
&lt;td&gt;kubectl debug + ephemeral container&lt;/td&gt;
&lt;td&gt;pods/ephemeralcontainers RBAC, k8s 1.25+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pod crashing too fast to attach&lt;/td&gt;
&lt;td&gt;kubectl debug -copy-to + modified entrypoint&lt;/td&gt;
&lt;td&gt;Ability to create pods in namespace&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Developer debugging in dev/staging&lt;/td&gt;
&lt;td&gt;cdebug exec or kubectl debug&lt;/td&gt;
&lt;td&gt;pods/ephemeralcontainers or pod create&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need full filesystem access&lt;/td&gt;
&lt;td&gt;kubectl debug -copy-to + debug image variant&lt;/td&gt;
&lt;td&gt;Debug image in registry, pod create&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need strace or kernel tracing&lt;/td&gt;
&lt;td&gt;Node-level debug with nsenter&lt;/td&gt;
&lt;td&gt;nodes/proxy, cluster admin equivalent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network packet capture&lt;/td&gt;
&lt;td&gt;kubectl debug + nicolaka/netshoot&lt;/td&gt;
&lt;td&gt;pods/ephemeralcontainers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local Docker debugging&lt;/td&gt;
&lt;td&gt;cdebug exec&lt;/td&gt;
&lt;td&gt;Docker socket access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI-reproducible debug environment&lt;/td&gt;
&lt;td&gt;Debug image variant in separate build target&lt;/td&gt;
&lt;td&gt;Separate image tag in registry&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  Production RBAC Design
&lt;/h2&gt;

&lt;p&gt;A clean RBAC design for production distroless debugging separates three roles with different privilege levels:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Tier 1: Developer self-service in team namespaces
# Allows attaching ephemeral containers, no node access
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: distroless-debugger
  namespace: team-namespace
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/ephemeralcontainers"]
  verbs: ["update", "patch"]
- apiGroups: [""]
  resources: ["pods/attach"]
  verbs: ["create", "get"]
---
# Tier 2: SRE production incident access
# Ephemeral containers across all namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sre-distroless-debugger
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/ephemeralcontainers"]
  verbs: ["update", "patch"]
- apiGroups: [""]
  resources: ["pods/attach"]
  verbs: ["create", "get"]
---
# Tier 3: Break-glass node access
# Only for platform team, time-limited binding recommended
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-debugger
rules:
- apiGroups: [""]
  resources: ["nodes/proxy"]
  verbs: ["get"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create", "get", "list", "delete"]
  # Restrict to debug namespace via RoleBinding, not ClusterRoleBinding
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Bind Tier 1 permanently to your developers. Bind Tier 2 to SREs permanently but with audit alerts on use. Bind Tier 3 only on-demand (via a Kubernetes operator that creates time-limited RoleBindings) and never as a permanent ClusterRoleBinding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Distroless containers are the correct choice for production workloads. They reduce attack surface, eliminate unnecessary CVEs, and force a cleaner separation between application and tooling. The operational cost is that your traditional debugging workflow — exec into the container, run some commands — no longer works by default.&lt;/p&gt;

&lt;p&gt;Kubernetes provides a clean answer with ephemeral containers and &lt;code&gt;kubectl debug&lt;/code&gt;: inject a debug container with whatever tools you need into the running pod, sharing its network and process namespaces, without restarting or modifying the application. For scenarios where ephemeral containers are insufficient — filesystem access, crash debugging, kernel-level investigation — the copy strategy and node-level debug fill the remaining gaps.&lt;/p&gt;

&lt;p&gt;The key to making this work at scale is not the technique itself but the &lt;strong&gt;access model&lt;/strong&gt; : developers get self-service ephemeral container access in their own namespaces, SREs get cluster-wide ephemeral container access for production incidents, and node-level access is a break-glass procedure with audit trail and time limits. With that model in place, distroless becomes an operational non-issue rather than an obstacle.&lt;/p&gt;

</description>
      <category>containers</category>
      <category>docker</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>XSLT 3.0 new features: what changed from 2.0</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Mon, 04 May 2026 09:00:01 +0000</pubDate>
      <link>https://forem.com/alexandrev/xslt-30-new-features-what-changed-from-20-4eea</link>
      <guid>https://forem.com/alexandrev/xslt-30-new-features-what-changed-from-20-4eea</guid>
      <description>&lt;p&gt;XSLT 3.0 is a significant step beyond 2.0. If you are running Saxon on the backend — as XSLT Playground does — you have access to the full 3.0 feature set today. This post covers the additions that matter most in practice and shows how to try each one directly in the playground.&lt;/p&gt;

&lt;h2&gt;
  
  
  Streaming
&lt;/h2&gt;

&lt;p&gt;The most impactful change in 3.0 is streaming. In earlier versions, the processor loads the entire source document into memory before any template can run. With 3.0 streaming, selected templates can consume the document as a stream, which drastically reduces memory usage for large inputs.&lt;/p&gt;

&lt;p&gt;To enable streaming, declare it on the stylesheet and on the mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;








&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not every expression is streamable. Saxon will tell you at compile time if a pattern is not compatible. The key restriction is that you can only visit each node once — no backward axes, no variables that hold nodes for later inspection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Maps and arrays
&lt;/h2&gt;

&lt;p&gt;XSLT 3.0 adds maps and arrays as first-class values, borrowed from XPath 3.1. A map is a collection of key-value pairs; an array is an ordered sequence that can hold any value including other maps or arrays.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Arrays work similarly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Arrays use 1-based indexing. Use &lt;code&gt;array:size()&lt;/code&gt;, &lt;code&gt;array:get()&lt;/code&gt;, and &lt;code&gt;array:append()&lt;/code&gt; from the array namespace for common operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  JSON input and output
&lt;/h2&gt;

&lt;p&gt;XSLT 3.0 can parse and produce JSON natively via &lt;code&gt;json-to-xml()&lt;/code&gt; and &lt;code&gt;xml-to-json()&lt;/code&gt;. This eliminates the need for a pre-processing step when your source is JSON.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The function converts JSON into a predictable XML representation defined by the W3C. Objects become &lt;code&gt;elements, arrays become&lt;/code&gt;, and primitives become typed &lt;code&gt;,&lt;/code&gt;, or `&lt;code&gt;elements. You transform this intermediate XML normally and then serialize it back with&lt;/code&gt;xml-to-json()` if needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Higher-order functions
&lt;/h2&gt;

&lt;p&gt;You can now pass functions as arguments using &lt;code&gt;xsl:function&lt;/code&gt; and the &lt;code&gt;function()&lt;/code&gt; type. This enables patterns like map, filter, and fold over sequences without recursion.&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;xml&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;#1&lt;/code&gt; notation creates a function reference with arity 1.&lt;/p&gt;

&lt;h2&gt;
  
  
  Packages
&lt;/h2&gt;

&lt;p&gt;XSLT 3.0 introduces packages, which let you split a large stylesheet into independently compiled units that expose explicit interfaces. This is the equivalent of modules or libraries in other languages.&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;xml&lt;/p&gt;

&lt;p&gt;...&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;Packages reduce coupling and enable reuse across projects without copy-paste.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it in XSLT Playground
&lt;/h2&gt;

&lt;p&gt;All the examples above run in &lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; with version set to 3.0. Maps and JSON support are the quickest to explore. Paste the &lt;code&gt;json-to-xml()&lt;/code&gt; example, provide a JSON string as input, and see the intermediate representation immediately.&lt;/p&gt;

&lt;p&gt;XSLT 3.0 is available today. If your integration still targets 2.0, the features above are the best reasons to upgrade.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>performance</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>XSL online tester: run XSL and XSLT transforms in your browser</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Thu, 30 Apr 2026 09:00:00 +0000</pubDate>
      <link>https://forem.com/alexandrev/xsl-online-tester-run-xsl-and-xslt-transforms-in-your-browser-4njl</link>
      <guid>https://forem.com/alexandrev/xsl-online-tester-run-xsl-and-xslt-transforms-in-your-browser-4njl</guid>
      <description>&lt;p&gt;XSL (Extensible Stylesheet Language) is an umbrella term that covers three related specifications: XSLT for transformations, XPath for node selection, and XSL-FO for formatting objects. When developers search for an "XSL tester" or "XSL online editor", they are usually looking for a way to run XSLT stylesheets against XML input without a local install. &lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; does exactly that.&lt;/p&gt;

&lt;h2&gt;
  
  
  XSL vs XSLT — what is the difference?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;XSL&lt;/strong&gt; is the full family of W3C specifications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;XSLT&lt;/strong&gt; (XSL Transformations) — transforms XML documents into other formats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XPath&lt;/strong&gt; — the path language used inside XSLT to navigate XML trees&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XSL-FO&lt;/strong&gt; (XSL Formatting Objects) — describes page layout for print and PDF output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, when people say "XSL" in an integration or development context, they almost always mean &lt;strong&gt;XSLT&lt;/strong&gt;. The stylesheet file extension &lt;code&gt;.xsl&lt;/code&gt; and &lt;code&gt;.xslt&lt;/code&gt; are interchangeable — Saxon and most processors accept both.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running XSL transforms online
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; supports XSLT 1.0, 2.0, and 3.0 via the Saxon processor. To run an XSL transform:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Paste your XML source document in the input panel&lt;/li&gt;
&lt;li&gt;Paste your XSL stylesheet in the stylesheet panel&lt;/li&gt;
&lt;li&gt;Select the XSLT version (1.0, 2.0 or 3.0)&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Run&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The output appears immediately. If the stylesheet has errors, the error panel shows the exact line and message from Saxon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common XSL use cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;XML to HTML&lt;/strong&gt; — the most common use. An XSL stylesheet walks an XML document tree and emits HTML tags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;











&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;XML to XML&lt;/strong&gt; — reshaping or filtering a document structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;







&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;XML to plain text or CSV&lt;/strong&gt; — using &lt;code&gt;xsl:output method="text"&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;









&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  XSL file extensions: .xsl vs .xslt
&lt;/h2&gt;

&lt;p&gt;Both &lt;code&gt;.xsl&lt;/code&gt; and &lt;code&gt;.xslt&lt;/code&gt; are valid. The &lt;code&gt;.xsl&lt;/code&gt; extension is older and more common in enterprise systems (SAP, Oracle, IBM DataPower). The &lt;code&gt;.xslt&lt;/code&gt; extension is more explicit. Saxon accepts either. XSLT Playground accepts any content regardless of what you call it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing XSL stylesheets online
&lt;/h2&gt;

&lt;p&gt;The main advantage of an online XSL tester is speed of iteration. You can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Paste a real XML payload from a production system and see what the stylesheet produces&lt;/li&gt;
&lt;li&gt;Add parameters and test different code paths&lt;/li&gt;
&lt;li&gt;Enable trace mode to see which templates fired and in what order&lt;/li&gt;
&lt;li&gt;Export the entire test case (input + stylesheet + parameters) as a JSON workspace to share with a colleague&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this is available at &lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; without creating an account or installing anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  XSL transform online vs local Saxon
&lt;/h2&gt;

&lt;p&gt;For most development and debugging tasks, the online tester is faster than running Saxon locally. Use local Saxon when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You are processing confidential data that cannot leave your network&lt;/li&gt;
&lt;li&gt;Your input files are very large (several MB or more)&lt;/li&gt;
&lt;li&gt;You need to integrate the transform into a build pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For everything else — prototyping, debugging, sharing test cases — the online XSL tester is quicker.&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>programming</category>
      <category>tooling</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Istio ServiceEntry Explained: External Services, DNS, and Traffic Control</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 28 Apr 2026 11:00:01 +0000</pubDate>
      <link>https://forem.com/alexandrev/istio-serviceentry-explained-external-services-dns-and-traffic-control-3cn</link>
      <guid>https://forem.com/alexandrev/istio-serviceentry-explained-external-services-dns-and-traffic-control-3cn</guid>
      <description>&lt;h1&gt;
  
  
  Istio ServiceEntry Explained: External Services, DNS, and Traffic Control
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What Is a ServiceEntry
&lt;/h2&gt;

&lt;p&gt;Istio maintains an internal service registry that merges Kubernetes Services with additional entries you declare. When a sidecar proxy needs to route a request, it consults this registry. Services inside the mesh are automatically registered, but external services require a &lt;strong&gt;ServiceEntry&lt;/strong&gt; to be added to the registry.&lt;/p&gt;

&lt;p&gt;A ServiceEntry is a custom resource that registers external services in the mesh's service registry. Once registered, external services become first-class citizens with access to Istio features including metrics, access logs, distributed traces, mTLS origination, retries, timeouts, and circuit breaking.&lt;/p&gt;

&lt;h2&gt;
  
  
  ServiceEntry Anatomy: All Fields Explained
&lt;/h2&gt;

&lt;h3&gt;
  
  
  hosts
&lt;/h3&gt;

&lt;p&gt;A list of hostnames associated with the service. For external services, this is typically the DNS name your application uses (e.g., &lt;code&gt;api.stripe.com&lt;/code&gt;). For HTTP protocols, the &lt;code&gt;hosts&lt;/code&gt; field is matched against the HTTP Host header. For non-HTTP protocols, you can use synthetic hostnames paired with &lt;code&gt;addresses&lt;/code&gt; or static &lt;code&gt;endpoints&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  addresses
&lt;/h3&gt;

&lt;p&gt;Optional virtual IP addresses associated with the service. Useful for TCP services where you want to assign a VIP that the sidecar will intercept. Not required for HTTP/HTTPS services that use hostname-based routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  ports
&lt;/h3&gt;

&lt;p&gt;The ports on which the external service is exposed. Each port needs a &lt;code&gt;number&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, and &lt;code&gt;protocol&lt;/code&gt;. The protocol setting determines how Envoy handles the connection—&lt;code&gt;TLS&lt;/code&gt; for pass-through without termination, &lt;code&gt;HTTPS&lt;/code&gt; for HTTP over TLS, and &lt;code&gt;TCP&lt;/code&gt; for database connections.&lt;/p&gt;

&lt;h3&gt;
  
  
  location
&lt;/h3&gt;

&lt;p&gt;Either &lt;code&gt;MESH_EXTERNAL&lt;/code&gt; or &lt;code&gt;MESH_INTERNAL&lt;/code&gt;. Use &lt;code&gt;MESH_EXTERNAL&lt;/code&gt; for services outside your cluster (third-party APIs, managed databases). Use &lt;code&gt;MESH_INTERNAL&lt;/code&gt; for services inside your infrastructure without a sidecar, such as VMs in the same VPC or unmeshed Kubernetes Services. This affects mTLS application and metrics labeling.&lt;/p&gt;

&lt;h3&gt;
  
  
  resolution
&lt;/h3&gt;

&lt;p&gt;How the sidecar resolves endpoint addresses. Options include &lt;code&gt;NONE&lt;/code&gt;, &lt;code&gt;STATIC&lt;/code&gt;, &lt;code&gt;DNS&lt;/code&gt;, and &lt;code&gt;DNS_ROUND_ROBIN&lt;/code&gt;. This is the most critical field for ServiceEntry configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  endpoints
&lt;/h3&gt;

&lt;p&gt;An explicit list of network endpoints. Required when resolution is &lt;code&gt;STATIC&lt;/code&gt;. Each endpoint can have an &lt;code&gt;address&lt;/code&gt;, &lt;code&gt;ports&lt;/code&gt;, &lt;code&gt;labels&lt;/code&gt;, &lt;code&gt;network&lt;/code&gt;, &lt;code&gt;locality&lt;/code&gt;, and &lt;code&gt;weight&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  exportTo
&lt;/h3&gt;

&lt;p&gt;Controls visibility across namespaces. Use &lt;code&gt;"."&lt;/code&gt; for the current namespace only, &lt;code&gt;"*"&lt;/code&gt; for all namespaces. In multi-team clusters, restrict exports to avoid namespace pollution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resolution Types: NONE vs STATIC vs DNS vs DNS_ROUND_ROBIN
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;resolution&lt;/code&gt; field determines how Envoy discovers IP addresses behind the service.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resolution&lt;/th&gt;
&lt;th&gt;How It Works&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NONE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Envoy uses the original destination IP from the connection. No DNS lookup by the proxy.&lt;/td&gt;
&lt;td&gt;Wildcard entries, pass-through scenarios, services where the application already resolved the IP.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;STATIC&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Envoy routes to the IPs listed in the &lt;code&gt;endpoints&lt;/code&gt; field. No DNS involved.&lt;/td&gt;
&lt;td&gt;Services with stable, known IPs (e.g., on-prem databases, VMs with fixed IPs).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DNS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Envoy resolves the hostname at connection time and creates an endpoint per returned IP. Uses async DNS with health checking per IP.&lt;/td&gt;
&lt;td&gt;External APIs behind load balancers, managed databases with DNS endpoints (RDS, CloudSQL).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DNS_ROUND_ROBIN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Envoy resolves the hostname and uses a single logical endpoint, rotating across returned IPs. No per-IP health checking.&lt;/td&gt;
&lt;td&gt;Simple external services, services where you do not need per-endpoint circuit breaking.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  When to Use NONE
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;NONE&lt;/code&gt; when registering a range of external IPs or wildcard hosts without Envoy performing address resolution. This is common for broad egress policies like "allow traffic to &lt;code&gt;*.googleapis.com&lt;/code&gt; on port 443." The downside is that Envoy has limited ability to apply per-endpoint policies.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Use STATIC
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;STATIC&lt;/code&gt; when the external service has known, stable IP addresses that rarely change. This avoids DNS dependencies entirely. Classic use case: a legacy Oracle database on a fixed IP in your data center.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Use DNS
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;DNS&lt;/code&gt; for most external API integrations. Envoy performs asynchronous DNS resolution and creates a cluster endpoint for each returned IP address. This enables per-endpoint health checking and circuit breaking—critical for production reliability.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Use DNS_ROUND_ROBIN
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;DNS_ROUND_ROBIN&lt;/code&gt; when the external hostname returns many IPs and you do not need per-IP circuit breaking. Envoy treats all resolved IPs as a single logical endpoint and round-robins across them, which is lighter weight than &lt;code&gt;DNS&lt;/code&gt; mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: External HTTP API (api.stripe.com)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceEntry&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stripe-api&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;api.stripe.com&lt;/span&gt;
  &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MESH_EXTERNAL&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tls&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TLS&lt;/span&gt;
  &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DNS&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The protocol is &lt;code&gt;TLS&lt;/code&gt;, not &lt;code&gt;HTTPS&lt;/code&gt;, because the application initiates the TLS handshake directly. Envoy handles this as opaque TLS using SNI-based routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: External Managed Database (RDS / CloudSQL)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceEntry&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;orders-database&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;orders&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;orders-db.abc123.us-east-1.rds.amazonaws.com&lt;/span&gt;
  &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MESH_EXTERNAL&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
  &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DNS&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For TCP services, the &lt;code&gt;DNS&lt;/code&gt; resolution mode ensures Envoy periodically re-resolves the hostname and updates its endpoint list, which is critical for RDS multi-AZ failover scenarios.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: Legacy Internal Service Not in the Mesh
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceEntry&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;legacy-monitoring&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;observability&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;legacy-monitoring.internal&lt;/span&gt;
  &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MESH_INTERNAL&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTP&lt;/span&gt;
  &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;STATIC&lt;/span&gt;
  &lt;span class="na"&gt;endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.5.10&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.5.11&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.5.12&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;location&lt;/code&gt; is &lt;code&gt;MESH_INTERNAL&lt;/code&gt; because the service lives inside your network, and &lt;code&gt;resolution&lt;/code&gt; is &lt;code&gt;STATIC&lt;/code&gt; because the IPs are known. The hostname is synthetic—your application uses it, and Istio's DNS proxy resolves it to one of the listed endpoints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 4: TCP Services with Multiple Ports
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceEntry&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-elasticsearch&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;search&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;es.example.com&lt;/span&gt;
  &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MESH_EXTERNAL&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9200&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9300&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;transport&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
  &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DNS&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each port gets its own Envoy listener configuration. The HTTP port benefits from full Layer 7 telemetry, while the TCP port gets Layer 4 metrics and connection-level policies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Combining ServiceEntry with DestinationRule
&lt;/h2&gt;

&lt;p&gt;A ServiceEntry alone registers the external service. To apply traffic policies—connection pooling, circuit breaking, TLS origination, load balancing—pair it with a &lt;strong&gt;DestinationRule&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Connection Pooling and Circuit Breaking
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceEntry&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stripe-api&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;api.stripe.com&lt;/span&gt;
  &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MESH_EXTERNAL&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tls&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TLS&lt;/span&gt;
  &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DNS&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DestinationRule&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stripe-api-dr&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api.stripe.com&lt;/span&gt;
  &lt;span class="na"&gt;trafficPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;connectionPool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;tcp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;maxConnections&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
        &lt;span class="na"&gt;connectTimeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;h2UpgradePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DO_NOT_UPGRADE&lt;/span&gt;
        &lt;span class="na"&gt;maxRequestsPerConnection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
    &lt;span class="na"&gt;outlierDetection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;consecutive5xxErrors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;baseEjectionTime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;
      &lt;span class="na"&gt;maxEjectionPercent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration caps outbound connections at 50, sets a 5-second connection timeout, and ejects endpoints that return 3 consecutive 5xx errors, preventing a degraded external API from consuming all connection slots.&lt;/p&gt;

&lt;h3&gt;
  
  
  TLS Origination
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceEntry&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-api&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;api.external-service.com&lt;/span&gt;
  &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MESH_EXTERNAL&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TLS&lt;/span&gt;
  &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DNS&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DestinationRule&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-api-tls&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api.external-service.com&lt;/span&gt;
  &lt;span class="na"&gt;trafficPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;portLevelSettings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
        &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SIMPLE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The application sends HTTP to port 80. A VirtualService redirects that to port 443. The DestinationRule initiates TLS to the external endpoint. The application never knows TLS happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  Combining ServiceEntry with VirtualService
&lt;/h2&gt;

&lt;p&gt;VirtualService provides Layer 7 traffic management for external services: retries, timeouts, fault injection, header-based routing, and traffic shifting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retries and Timeouts
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VirtualService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stripe-api-vs&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;api.stripe.com&lt;/span&gt;
  &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api.stripe.com&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;attempts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
        &lt;span class="na"&gt;perTryTimeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3s&lt;/span&gt;
        &lt;span class="na"&gt;retryOn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes&lt;/span&gt;
        &lt;span class="na"&gt;retryRemoteLocalities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This applies a 10-second overall timeout with up to 3 retry attempts (3 seconds each) for specific failure conditions. This only works for HTTP-protocol ServiceEntries. For TLS-protocol entries, you are limited to TCP-level connection retries via the DestinationRule.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traffic Shifting Between External Providers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceEntry&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;geocoding-primary&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;geo&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;geocoding.internal&lt;/span&gt;
  &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MESH_EXTERNAL&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tls&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TLS&lt;/span&gt;
  &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;STATIC&lt;/span&gt;
  &lt;span class="na"&gt;endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api.old-geocoding-provider.com&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;old&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api.new-geocoding-provider.com&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;new&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DestinationRule&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;geocoding-dr&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;geo&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;geocoding.internal&lt;/span&gt;
  &lt;span class="na"&gt;trafficPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SIMPLE&lt;/span&gt;
  &lt;span class="na"&gt;subsets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;old-provider&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;old&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;new-provider&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;new&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VirtualService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;geocoding-vs&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;geo&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;geocoding.internal&lt;/span&gt;
  &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;geocoding.internal&lt;/span&gt;
            &lt;span class="na"&gt;subset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;old-provider&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;geocoding.internal&lt;/span&gt;
            &lt;span class="na"&gt;subset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;new-provider&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sends 80% of geocoding traffic to the old provider and 20% to the new one. Adjust weights as you gain confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  DNS Resolution Patterns: Istio DNS Proxy vs kube-dns
&lt;/h2&gt;

&lt;p&gt;Istio DNS resolution involves two layers: how your application resolves the hostname (kube-dns / CoreDNS) and how the sidecar resolves the hostname (Envoy's async DNS or Istio's DNS proxy).&lt;/p&gt;

&lt;h3&gt;
  
  
  Default Flow (Without Istio DNS Proxy)
&lt;/h3&gt;

&lt;p&gt;Your application calls &lt;code&gt;api.stripe.com&lt;/code&gt;. kube-dns resolves it to an IP. The application opens a connection to that IP. The sidecar intercepts the connection and—if the ServiceEntry uses &lt;code&gt;DNS&lt;/code&gt; resolution—Envoy independently resolves &lt;code&gt;api.stripe.com&lt;/code&gt;. Two separate DNS lookups happen, which can lead to inconsistencies if DNS records change between the two resolutions.&lt;/p&gt;

&lt;h3&gt;
  
  
  With Istio DNS Proxy (dns.istio.io)
&lt;/h3&gt;

&lt;p&gt;Istio's sidecar includes a DNS proxy that intercepts DNS queries from the application. When enabled, the proxy can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto-allocate virtual IPs for ServiceEntry hosts that do not have addresses defined, which is critical for TCP ServiceEntries.&lt;/li&gt;
&lt;li&gt;Resolve ServiceEntry hosts directly, avoiding the round-trip to kube-dns for known mesh services.&lt;/li&gt;
&lt;li&gt;Ensure consistency between the application's DNS resolution and the sidecar's endpoint resolution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In modern Istio installations (1.18+), DNS capture is enabled by default.&lt;/p&gt;

&lt;h3&gt;
  
  
  When DNS Proxy Matters Most
&lt;/h3&gt;

&lt;p&gt;The DNS proxy is especially important for &lt;strong&gt;TCP ServiceEntries without an explicit &lt;code&gt;addresses&lt;/code&gt; field&lt;/strong&gt;. Without a VIP, Envoy cannot match an incoming TCP connection to the correct ServiceEntry. The DNS proxy solves this by auto-allocating a VIP from the &lt;code&gt;240.240.0.0/16&lt;/code&gt; range and returning that VIP when the application resolves the hostname.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sticky Sessions with ServiceEntry
&lt;/h2&gt;

&lt;p&gt;Some external services require session affinity. Istio supports sticky sessions for external services through consistent hashing in a DestinationRule.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceEntry&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;legacy-session-service&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;legacy-session.internal&lt;/span&gt;
  &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MESH_INTERNAL&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTP&lt;/span&gt;
  &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;STATIC&lt;/span&gt;
  &lt;span class="na"&gt;endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.1.10&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.1.11&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.1.12&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DestinationRule&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;legacy-session-dr&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;legacy-session.internal&lt;/span&gt;
  &lt;span class="na"&gt;trafficPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;loadBalancer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;consistentHash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;httpCookie&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SERVERID&lt;/span&gt;
          &lt;span class="na"&gt;ttl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3600s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration hashes on an HTTP cookie named &lt;code&gt;SERVERID&lt;/code&gt;. If the cookie does not exist, Envoy generates one and sets it on the response. You can also hash on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTTP header&lt;/strong&gt;: &lt;code&gt;consistentHash.httpHeaderName: "x-user-id"&lt;/code&gt; — useful when your application sends a user identifier in every request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source IP&lt;/strong&gt;: &lt;code&gt;consistentHash.useSourceIp: true&lt;/code&gt; — simplest option but breaks in environments with NAT or shared egress IPs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query parameter&lt;/strong&gt;: &lt;code&gt;consistentHash.httpQueryParameterName: "session_id"&lt;/code&gt; — for REST APIs that include a session identifier in the URL.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ServiceEntry must use &lt;code&gt;STATIC&lt;/code&gt; or &lt;code&gt;DNS&lt;/code&gt; resolution for sticky sessions to work. With &lt;code&gt;DNS_ROUND_ROBIN&lt;/code&gt;, there is only one logical endpoint, so consistent hashing has no effect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting Common Issues
&lt;/h2&gt;

&lt;h3&gt;
  
  
  503 Errors When Calling External Services
&lt;/h3&gt;

&lt;p&gt;Start with this diagnostic sequence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check if the ServiceEntry is applied and visible to the proxy&lt;/span&gt;
istioctl proxy-config cluster  &lt;span class="nt"&gt;-n&lt;/span&gt;  | &lt;span class="nb"&gt;grep&lt;/span&gt; 

&lt;span class="c"&gt;# Check the listeners&lt;/span&gt;
istioctl proxy-config listener  &lt;span class="nt"&gt;-n&lt;/span&gt;  &lt;span class="nt"&gt;--port&lt;/span&gt; 

&lt;span class="c"&gt;# Look at Envoy access logs for the specific request&lt;/span&gt;
kubectl logs  &lt;span class="nt"&gt;-n&lt;/span&gt;  &lt;span class="nt"&gt;-c&lt;/span&gt; istio-proxy | &lt;span class="nb"&gt;grep&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common causes of 503 errors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wrong protocol&lt;/strong&gt;: Setting &lt;code&gt;protocol: HTTPS&lt;/code&gt; when your application initiates TLS. Use &lt;code&gt;TLS&lt;/code&gt; for pass-through.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing ServiceEntry in REGISTRY_ONLY mode&lt;/strong&gt;: Any host without a ServiceEntry is blocked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;exportTo restriction&lt;/strong&gt;: The ServiceEntry is in namespace A, exported only to &lt;code&gt;"."&lt;/code&gt;, and the calling pod is in namespace B.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DNS resolution failure&lt;/strong&gt;: Envoy cannot resolve the hostname. Check that DNS servers are reachable from the pod.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  DNS Resolution Failures
&lt;/h3&gt;

&lt;p&gt;When Envoy's async DNS resolver fails, you will see &lt;code&gt;UH&lt;/code&gt; (upstream unhealthy) or &lt;code&gt;UF&lt;/code&gt; (upstream connection failure) flags in access logs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify DNS works from inside the sidecar&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt;  &lt;span class="nt"&gt;-n&lt;/span&gt;  &lt;span class="nt"&gt;-c&lt;/span&gt; istio-proxy &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  pilot-agent request GET /dns_resolve?proxyID&lt;span class="o"&gt;=&lt;/span&gt;.&amp;amp;host&lt;span class="o"&gt;=&lt;/span&gt;api.stripe.com

&lt;span class="c"&gt;# Check Envoy cluster health&lt;/span&gt;
istioctl proxy-config endpoint  &lt;span class="nt"&gt;-n&lt;/span&gt;  | &lt;span class="nb"&gt;grep&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the endpoint shows &lt;code&gt;UNHEALTHY&lt;/code&gt;, Envoy resolved the DNS but outlier detection ejected the host. If no endpoint appears, DNS resolution is failing. Ensure your pods can reach an external DNS server, or that CoreDNS is configured to forward queries for the external domain.&lt;/p&gt;

&lt;h3&gt;
  
  
  TLS Origination Not Working
&lt;/h3&gt;

&lt;p&gt;If you configured TLS origination via a DestinationRule but traffic still fails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensure the ServiceEntry port protocol is &lt;code&gt;HTTP&lt;/code&gt;, not &lt;code&gt;TLS&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Verify the DestinationRule's &lt;code&gt;host&lt;/code&gt; field exactly matches the ServiceEntry's &lt;code&gt;hosts&lt;/code&gt; entry.&lt;/li&gt;
&lt;li&gt;Check that the VirtualService routes to the correct port number.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  TCP ServiceEntry Not Intercepting Traffic
&lt;/h3&gt;

&lt;p&gt;For TCP-protocol ServiceEntries without the DNS proxy, Envoy cannot match traffic by hostname. You must either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set an explicit &lt;code&gt;addresses&lt;/code&gt; field with a VIP that your application targets.&lt;/li&gt;
&lt;li&gt;Enable Istio's DNS proxy to auto-allocate VIPs.&lt;/li&gt;
&lt;li&gt;Ensure the destination IP matches what the ServiceEntry resolves to.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without one of these, TCP traffic goes through the &lt;code&gt;PassthroughCluster&lt;/code&gt; and bypasses your ServiceEntry entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Do I need a ServiceEntry if outboundTrafficPolicy is set to ALLOW_ANY?
&lt;/h3&gt;

&lt;p&gt;You do not &lt;em&gt;need&lt;/em&gt; one for connectivity. But you &lt;strong&gt;should&lt;/strong&gt; create ServiceEntries anyway. Without them, outbound traffic goes through the &lt;code&gt;PassthroughCluster&lt;/code&gt;, which means no detailed metrics per destination, no access logging with the external hostname, no circuit breaking, no retries, and no timeout policies.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between protocol TLS and HTTPS in a ServiceEntry port?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;TLS&lt;/code&gt; tells Envoy to treat the connection as opaque TLS. Envoy reads the SNI header to determine routing but does not decrypt the payload. Use this when your application initiates TLS directly. &lt;code&gt;HTTPS&lt;/code&gt; tells Envoy the protocol is HTTP over TLS. In practice, for external services where the application manages its own TLS, use &lt;code&gt;TLS&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use wildcards in ServiceEntry hosts?
&lt;/h3&gt;

&lt;p&gt;Yes, but with limitations. You can use &lt;code&gt;*.example.com&lt;/code&gt; to match any subdomain of &lt;code&gt;example.com&lt;/code&gt;. However, wildcard entries only work with &lt;code&gt;resolution: NONE&lt;/code&gt; because Envoy cannot perform DNS lookups for wildcard hostnames. Wildcard ServiceEntries are best used for broad egress access control rather than fine-grained traffic management.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I configure sticky sessions for an external service behind a ServiceEntry?
&lt;/h3&gt;

&lt;p&gt;Create a ServiceEntry with &lt;code&gt;STATIC&lt;/code&gt; or &lt;code&gt;DNS&lt;/code&gt; resolution so Envoy has multiple endpoints. Pair it with a DestinationRule that configures &lt;code&gt;consistentHash&lt;/code&gt; under &lt;code&gt;trafficPolicy.loadBalancer&lt;/code&gt;. You can hash on an HTTP cookie, header, source IP, or query parameter.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does ServiceEntry interact with NetworkPolicy and Istio AuthorizationPolicy?
&lt;/h3&gt;

&lt;p&gt;A ServiceEntry does not bypass Kubernetes NetworkPolicy. If a NetworkPolicy blocks egress to the external IP, traffic will be dropped at the CNI level before Envoy can route it. Istio AuthorizationPolicy can also restrict which workloads are allowed to call specific ServiceEntry hosts. Use ServiceEntry for traffic management and observability, AuthorizationPolicy for workload-level access control, and NetworkPolicy for network-level enforcement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;ServiceEntry transforms opaque outbound connections into managed, observable, policy-controlled traffic without requiring changes to your application code. Start with the basics: create a ServiceEntry for each external dependency, set the correct resolution type, and pair it with a DestinationRule for connection limits and circuit breaking. As you mature, add VirtualServices for retries and timeouts, configure sticky sessions where needed, and enable the DNS proxy for seamless TCP service integration. Every external dependency you formalize with a ServiceEntry is one fewer blind spot in your production mesh.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexandre-vazquez.com/istio-serviceentry-explained/" rel="noopener noreferrer"&gt;alexandre-vazquez.com/istio-serviceentry-explained&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>istio</category>
      <category>servicemesh</category>
      <category>devops</category>
    </item>
    <item>
      <title>Istio ServiceEntry Explained: External Services, DNS, and Traffic Control</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 28 Apr 2026 10:00:00 +0000</pubDate>
      <link>https://forem.com/alexandrev/istio-serviceentry-explained-external-services-dns-and-traffic-control-35li</link>
      <guid>https://forem.com/alexandrev/istio-serviceentry-explained-external-services-dns-and-traffic-control-35li</guid>
      <description>&lt;h1&gt;
  
  
  Istio ServiceEntry Explained: External Services, DNS, and Traffic Control
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexandre-vazquez.com/istio-serviceentry-explained/" rel="noopener noreferrer"&gt;alexandre-vazquez.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Read the full article on my blog: &lt;a href="https://alexandre-vazquez.com/istio-serviceentry-explained/" rel="noopener noreferrer"&gt;https://alexandre-vazquez.com/istio-serviceentry-explained/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>XSLT validator online: catch errors before running your transform</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Mon, 27 Apr 2026 09:00:02 +0000</pubDate>
      <link>https://forem.com/alexandrev/xslt-validator-online-catch-errors-before-running-your-transform-21n7</link>
      <guid>https://forem.com/alexandrev/xslt-validator-online-catch-errors-before-running-your-transform-21n7</guid>
      <description>&lt;p&gt;A broken XSLT stylesheet can fail in several ways: a syntax error stops the processor immediately, a namespace mismatch silently produces empty output, or an undefined variable causes a runtime error that only appears with specific inputs. Catching these issues early, before the stylesheet reaches a test environment, saves significant debugging time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What XSLT validation actually checks
&lt;/h2&gt;

&lt;p&gt;XSLT validation happens at two levels:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compile-time checks&lt;/strong&gt; happen when the processor parses the stylesheet. These catch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Malformed XML in the stylesheet itself&lt;/li&gt;
&lt;li&gt;References to undefined named templates or functions&lt;/li&gt;
&lt;li&gt;Type errors in static expressions&lt;/li&gt;
&lt;li&gt;Invalid XSLT element usage (wrong attributes, missing required children)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Runtime errors&lt;/strong&gt; only appear when the stylesheet runs against actual input:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing nodes that are assumed to exist&lt;/li&gt;
&lt;li&gt;Type errors in dynamic expressions&lt;/li&gt;
&lt;li&gt;Namespace mismatches between the stylesheet and the input document&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A good validator runs both levels.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using XSLT Playground as a validator
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; runs Saxon, which is one of the most thorough XSLT processors available. When you paste a stylesheet and click Run, Saxon compiles it first and reports compile errors with exact line numbers before attempting execution.&lt;/p&gt;

&lt;p&gt;For runtime errors, the trace mode shows you exactly which template fired, which node was being processed, and where the failure occurred. This is more useful than a bare error message because it gives you the execution context.&lt;/p&gt;

&lt;p&gt;To validate a stylesheet quickly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Paste the stylesheet into the editor&lt;/li&gt;
&lt;li&gt;Provide a minimal XML input — even an empty `` catches most compile errors&lt;/li&gt;
&lt;li&gt;Run with trace enabled&lt;/li&gt;
&lt;li&gt;Check the error panel for compile-time issues and the trace panel for runtime behaviour&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Common XSLT errors and how to spot them
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Namespace mismatch&lt;/strong&gt;&lt;br&gt;
Your input uses &lt;code&gt;xmlns="http://example.com/ns"&lt;/code&gt; but your stylesheet matches &lt;code&gt;element-name&lt;/code&gt; without the namespace. The match never fires, output is empty.&lt;/p&gt;

&lt;p&gt;Fix: declare the namespace in the stylesheet and use the prefix in match patterns:&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;xml&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;xpath-default-namespace&lt;/code&gt; (XSLT 2.0+) avoids having to prefix every element name in your XPath expressions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Undefined variable&lt;/strong&gt;&lt;br&gt;
You reference &lt;code&gt;$config&lt;/code&gt; but it is only defined inside a conditional branch that did not execute for this input. Saxon reports: &lt;em&gt;Variable $config has not been assigned a value&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Fix: move variable declarations to the template root or provide a default:&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;xml&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrong output method&lt;/strong&gt;&lt;br&gt;
You are generating HTML but the processor serialises as XML, adding self-closing tags that browsers reject. Declare the output method explicitly:&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;xml&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Template priority conflict&lt;/strong&gt;&lt;br&gt;
Two templates match the same node with equal priority. Saxon signals an error rather than silently picking one. Assign explicit &lt;code&gt;priority&lt;/code&gt; attributes to resolve the conflict:&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;xml&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;h2&gt;
  
  
  Validating before deploying to production
&lt;/h2&gt;

&lt;p&gt;If you run XSLT as part of an integration pipeline (MuleSoft, Tibco, IBM DataPower, or a custom backend), test the stylesheet in &lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; against representative inputs before deploying. Saxon in the playground uses the same processor your backend may be running, so errors caught here are errors caught before production.&lt;/p&gt;

&lt;p&gt;Export the workspace as JSON and keep it as a regression test artifact. If a future change breaks the transform, you have the original inputs and expected output to compare against.&lt;/p&gt;

</description>
      <category>codequality</category>
      <category>programming</category>
      <category>testing</category>
      <category>tooling</category>
    </item>
    <item>
      <title>XSLT online editor: how to test transformations without installing anything</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Thu, 23 Apr 2026 09:00:00 +0000</pubDate>
      <link>https://forem.com/alexandrev/xslt-online-editor-how-to-test-transformations-without-installing-anything-1p88</link>
      <guid>https://forem.com/alexandrev/xslt-online-editor-how-to-test-transformations-without-installing-anything-1p88</guid>
      <description>&lt;p&gt;Testing XSLT locally means installing a processor, configuring classpaths, and running command-line tools every time you want to check a change. For most day-to-day work — writing a new transform, debugging an output, or verifying a colleague's stylesheet — that overhead is unnecessary. A browser-based XSLT editor removes all of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to look for in an online XSLT editor
&lt;/h2&gt;

&lt;p&gt;Not all browser-based tools are equal. The things that matter in practice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;XSLT version support.&lt;/strong&gt; Many older tools only support XSLT 1.0. If your integration targets 2.0 or 3.0 (grouping, functions, maps, JSON support), you need a tool that runs a proper processor, not a JavaScript port. &lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; uses Saxon on the backend, which gives you full XSLT 2.0 and 3.0 support including extension functions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multiple inputs and parameters.&lt;/strong&gt; Real transforms rarely take a single XML document. You often need a main input plus a reference document, or you need to pass runtime parameters to control output. A good editor lets you define as many inputs and parameters as your stylesheet needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace output.&lt;/strong&gt; When a transform produces wrong output, you need to see what the processor did. Trace mode shows template firings and variable values step by step, which is far more useful than reading the final output and guessing what went wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workspace persistence.&lt;/strong&gt; If you close the browser and come back later, your inputs and stylesheet should still be there. Saving to localStorage means you can pick up where you left off without copying everything into a text file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using XSLT Playground
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; covers all of the above. Here is the basic workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Paste your XML source into the input panel.&lt;/li&gt;
&lt;li&gt;Paste your XSLT stylesheet into the stylesheet panel.&lt;/li&gt;
&lt;li&gt;Set the XSLT version (1.0, 2.0, or 3.0) in the toolbar.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Run&lt;/strong&gt;. The result appears in the output panel within a second or two.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the transform fails, error messages appear immediately with line references. Enable trace to see the execution log.&lt;/p&gt;

&lt;p&gt;For transforms with parameters, open the parameters panel, add key-value pairs, and they are passed to the stylesheet as external parameters on each run. No need to hardcode them in the stylesheet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sharing and exporting setups
&lt;/h2&gt;

&lt;p&gt;Each workspace in XSLT Playground can be exported as a JSON file. The export includes the stylesheet, input document, parameters, and any trace output. You can send this file to a colleague, and they import it directly — no copy-pasting required.&lt;/p&gt;

&lt;p&gt;This is useful for bug reports: instead of describing what went wrong, export the workspace and share the file. The recipient can reproduce the exact input and output in one click.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use an online editor vs a local setup
&lt;/h2&gt;

&lt;p&gt;Use the online editor when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You are exploring a new XSLT feature or syntax&lt;/li&gt;
&lt;li&gt;You need to reproduce or share a specific transform issue&lt;/li&gt;
&lt;li&gt;You are working away from your main machine&lt;/li&gt;
&lt;li&gt;You want to quickly verify a change before committing it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use a local setup when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You are processing files that are sensitive or cannot leave your network&lt;/li&gt;
&lt;li&gt;You need to transform very large documents (megabytes or more)&lt;/li&gt;
&lt;li&gt;You are integrating XSLT into a CI pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For everything else, the browser editor is faster and easier to use.&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>resources</category>
      <category>testing</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 21 Apr 2026 11:00:00 +0000</pubDate>
      <link>https://forem.com/alexandrev/kubernetes-hpa-best-practices-when-cpu-works-why-memory-almost-never-does-54a1</link>
      <guid>https://forem.com/alexandrev/kubernetes-hpa-best-practices-when-cpu-works-why-memory-almost-never-does-54a1</guid>
      <description>&lt;h1&gt;
  
  
  Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does
&lt;/h1&gt;

&lt;h2&gt;
  
  
  How HPA Actually Decides to Scale
&lt;/h2&gt;

&lt;p&gt;The HPA controller uses a formula to determine desired replicas: &lt;code&gt;desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue))&lt;/code&gt;. A critical detail is that "the metric value is expressed relative to the resource &lt;em&gt;request&lt;/em&gt;, not the resource limit." This distinction explains many HPA failures.&lt;/p&gt;

&lt;p&gt;HPA polls metrics every 15 seconds by default, scaling up within one to three polling cycles when thresholds are exceeded. Scale-down is deliberately slow, waiting 5 minutes by default to prevent oscillation.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU-Based HPA: When It Works and When It Doesn't
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Where CPU HPA Works Well
&lt;/h3&gt;

&lt;p&gt;CPU-based HPA succeeds with stateless request-processing workloads where CPU consumption correlates with request volume. Prerequisites include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accurate CPU requests&lt;/strong&gt; set to actual sustained consumption, not placeholders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasonable request-to-limit ratios&lt;/strong&gt; (1:4 or less)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CPU consumption that tracks user load linearly&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where CPU HPA Fails
&lt;/h3&gt;

&lt;p&gt;CPU HPA struggles with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency-sensitive services with sharp spikes&lt;/strong&gt; — by the time HPA detects and reacts to peaks, the burst may be over&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I/O-bound workloads&lt;/strong&gt; — showing low CPU even under heavy load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workloads with cold-start costs&lt;/strong&gt; — requiring earlier scaling decisions than CPU metrics can trigger&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Memory-Based HPA: Why It Almost Always Breaks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem
&lt;/h3&gt;

&lt;p&gt;Memory is incompressible; exhausting it causes OOM termination. Unlike CPU, "memory consumption is relatively stable" for well-architected services. A Go service or JVM application maintains a consistent memory footprint regardless of traffic volume from 10 to 10,000 requests per second.&lt;/p&gt;

&lt;p&gt;This creates two outcomes: memory HPA either never triggers (useless) or always triggers (permanently scaled out).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Request Misconfiguration Trap
&lt;/h3&gt;

&lt;p&gt;A Java service needing 512Mi heap but configured with a 256Mi request will immediately consume 200% of its request. An HPA with 70% memory threshold will scale such workloads to maximum replicas permanently. The solution is right-sizing requests, not adjusting thresholds.&lt;/p&gt;

&lt;h3&gt;
  
  
  JVM and Go Runtime Memory Behavior
&lt;/h3&gt;

&lt;p&gt;The JVM allocates heap up to its maximum and doesn't release it aggressively, even after garbage collection. Go's garbage collector prioritizes low latency over minimal memory use, potentially holding memory above strict necessity.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Memory HPA Is Actually Appropriate
&lt;/h3&gt;

&lt;p&gt;Memory-based HPA is defensible only in narrow cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Workloads where memory consumption tracks load linearly&lt;/li&gt;
&lt;li&gt;As a secondary safety valve (not primary) at 85-90% threshold for protecting against memory leaks&lt;/li&gt;
&lt;li&gt;Caching services where avoiding eviction before scaling out is critical&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Right-Sizing Requests Before Adding HPA
&lt;/h2&gt;

&lt;p&gt;No HPA strategy works without accurate resource requests. Run workloads under representative load and measure actual consumption. VPA in recommendation mode provides data-driven baselines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VerticalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service-vpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;targetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
  &lt;span class="na"&gt;updatePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;updateMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Off"&lt;/span&gt;   &lt;span class="c1"&gt;# Recommendation only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical note:&lt;/strong&gt; VPA and HPA cannot both auto-manage the same resource metric simultaneously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Better Signals: What to Scale On Instead
&lt;/h2&gt;

&lt;p&gt;Shift from resource consumption metrics (describing the past) to demand metrics (describing current needs).&lt;/p&gt;

&lt;h3&gt;
  
  
  Requests Per Second (RPS)
&lt;/h3&gt;

&lt;p&gt;For HTTP services, "requests per second per replica is usually the most accurate proxy for load." RPS measures demand directly, working for CPU-bound, memory-bound, or I/O-bound services.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v2&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service-hpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pods&lt;/span&gt;
    &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http_requests_per_second&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AverageValue&lt;/span&gt;
        &lt;span class="na"&gt;averageValue&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Queue Depth and Lag
&lt;/h3&gt;

&lt;p&gt;For consumer workloads reading from message queues, "consumer lag: how many messages are waiting to be processed" is the right scaling signal. KEDA was built for this use case, reading consumer group lag directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency
&lt;/h3&gt;

&lt;p&gt;P99 latency per replica is an excellent signal for latency-sensitive services, requiring custom metrics from service meshes or APM tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scheduled and Predictive Scaling
&lt;/h3&gt;

&lt;p&gt;For predictable traffic patterns, proactive scaling outperforms reactive scaling. KEDA's Cron scaler enables time-based scaling rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  HPA Configuration Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Always Set minReplicas ≥ 2 for Production
&lt;/h3&gt;

&lt;p&gt;A single-replica HPA creates a single point of failure during scale-in events.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tune Stabilization Windows
&lt;/h3&gt;

&lt;p&gt;The default 5-minute scale-down stabilization is too aggressive for workloads with cyclical patterns. Increase it to match your workload's natural cycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v2&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service-hpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Resource&lt;/span&gt;
    &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cpu&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Utilization&lt;/span&gt;
        &lt;span class="na"&gt;averageUtilization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
  &lt;span class="na"&gt;behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;scaleDown&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;stabilizationWindowSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;600&lt;/span&gt;
      &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Percent&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;25&lt;/span&gt;
        &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
    &lt;span class="na"&gt;scaleUp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;stabilizationWindowSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
      &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Percent&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
        &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;behavior&lt;/code&gt; block (available in HPA v2) enables independent control over scale-up and scale-down.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use a Lower CPU Threshold Than You Think
&lt;/h3&gt;

&lt;p&gt;If scale-up takes 45 seconds, a 70% threshold leaves existing pods throttled during that window. Set CPU targets at 50-60% for services where scaling latency matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Combine HPA with PodDisruptionBudgets
&lt;/h3&gt;

&lt;p&gt;HPA scale-down terminates pods. Without a PodDisruptionBudget, multiple replicas can be terminated simultaneously during maintenance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodDisruptionBudget&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service-pdb&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minAvailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50%"&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Don't Mix VPA Auto-Update with HPA on the Same Metric
&lt;/h3&gt;

&lt;p&gt;VPA auto-updating requests while HPA scales on those metrics creates conflicting control loops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Framework: Which Autoscaler for Which Workload
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload type&lt;/th&gt;
&lt;th&gt;Recommended signal&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stateless HTTP API, CPU-bound&lt;/td&gt;
&lt;td&gt;CPU utilization at 50-60%&lt;/td&gt;
&lt;td&gt;HPA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stateless HTTP API, I/O-bound&lt;/td&gt;
&lt;td&gt;RPS per replica or P99 latency&lt;/td&gt;
&lt;td&gt;HPA + custom metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Message queue consumer&lt;/td&gt;
&lt;td&gt;Consumer lag / queue depth&lt;/td&gt;
&lt;td&gt;KEDA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event-driven / Kafka / SQS&lt;/td&gt;
&lt;td&gt;Event rate or lag&lt;/td&gt;
&lt;td&gt;KEDA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Predictable traffic pattern&lt;/td&gt;
&lt;td&gt;Schedule (time-based)&lt;/td&gt;
&lt;td&gt;KEDA Cron scaler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workload with memory leak risk&lt;/td&gt;
&lt;td&gt;CPU primary + memory at 85% secondary&lt;/td&gt;
&lt;td&gt;HPA (v2 multi-metric)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Right-sizing before HPA&lt;/td&gt;
&lt;td&gt;Historical CPU/memory recommendations&lt;/td&gt;
&lt;td&gt;VPA recommendation mode&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Going Beyond HPA: KEDA and Custom Metrics
&lt;/h2&gt;

&lt;p&gt;KEDA provides a Kubernetes-native autoscaling framework supporting over 60 built-in scalers. The key architectural point: "KEDA does not replace HPA — it feeds it." KEDA creates and manages HPA resources while consuming signals HPA cannot access natively.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Can I use both CPU and memory in the same HPA?
&lt;/h3&gt;

&lt;p&gt;Yes. HPA v2 supports multiple metrics simultaneously, scaling to satisfy the most demanding metric. Use CPU at 60% threshold and memory at 85% threshold so memory only triggers in genuine overconsumption.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does my workload scale up immediately after deployment?
&lt;/h3&gt;

&lt;p&gt;Resource request misconfiguration. Check actual consumption against requests using &lt;code&gt;kubectl top pods&lt;/code&gt;. If consuming 200% of request by simply running, adjust requests to match actual usage before enabling HPA.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does HPA scale down too aggressively and cause latency spikes?
&lt;/h3&gt;

&lt;p&gt;Increase &lt;code&gt;scaleDown.stabilizationWindowSeconds&lt;/code&gt; in the HPA &lt;code&gt;behavior&lt;/code&gt; block. Also add a &lt;code&gt;Percent&lt;/code&gt; policy limiting scale-down to 25% of replicas per minute.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I set HPA on every deployment?
&lt;/h3&gt;

&lt;p&gt;No. HPA fits stateless services, consumers, and request handlers. It's inappropriate for stateful workloads requiring more than replica addition, singleton controllers, or batch jobs that should run to completion.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the minimum CPU request for reliable HPA?
&lt;/h3&gt;

&lt;p&gt;No absolute minimum, but requests below 100m make percentage thresholds coarse-grained. At 50m and 70% threshold, scaling triggers at 35m consumption. For lower needs, use RPS or custom metrics instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I debug HPA scaling decisions?
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;kubectl describe hpa&lt;/code&gt; to see current metrics and last scaling events. Check HPA events with &lt;code&gt;kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler&lt;/code&gt;. For custom metrics, verify the metrics server returns expected values.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexandre-vazquez.com/kubernetes-hpa-best-practices/" rel="noopener noreferrer"&gt;alexandre-vazquez.com/kubernetes-hpa-best-practices&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloud</category>
      <category>scalability</category>
    </item>
  </channel>
</rss>
