<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mateen Anjum</title>
    <description>The latest articles on Forem by Mateen Anjum (@mateenali66).</description>
    <link>https://forem.com/mateenali66</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3644604%2F79a9c96c-74eb-4675-9e33-f32d208b4d1b.jpg</url>
      <title>Forem: Mateen Anjum</title>
      <link>https://forem.com/mateenali66</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mateenali66"/>
    <language>en</language>
    <item>
      <title>Kubernetes v1.36 Drops April 22: What Platform Engineers Actually Need to Know</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sat, 18 Apr 2026 04:54:58 +0000</pubDate>
      <link>https://forem.com/mateenali66/kubernetes-v136-drops-april-22-what-platform-engineers-actually-need-to-know-3l81</link>
      <guid>https://forem.com/mateenali66/kubernetes-v136-drops-april-22-what-platform-engineers-actually-need-to-know-3l81</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Kubernetes v1.36 releases April 22, 2026. The headline features are DRA GPU partitioning, workload-aware preemption for AI/ML jobs, and the permanent removal of the gitRepo volume plugin. Ingress-nginx is also officially retired. If you run AI inference workloads or care about cluster security, this release is not optional reading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Release Matters More Than Most
&lt;/h2&gt;

&lt;p&gt;The CNCF's 2025 annual survey dropped a number that stopped a lot of people mid-scroll: 66% of organizations hosting generative AI models now use Kubernetes for some or all of their inference workloads. That's not a trend, that's a fait accompli. Kubernetes is the AI compute substrate whether you planned for it or not.&lt;/p&gt;

&lt;p&gt;v1.36 is the release that leans into that reality. The bulk of the new work is in Dynamic Resource Allocation (DRA), gang scheduling, and topology-aware placement, all of which exist because running distributed AI/ML jobs on Kubernetes has historically been painful. This release makes it less painful.&lt;/p&gt;

&lt;p&gt;But there are also breaking changes and security fixes that affect everyone, not just the ML crowd. Let me walk through what actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Breaking Changes First
&lt;/h2&gt;

&lt;h3&gt;
  
  
  gitRepo Volume Plugin: Gone for Good
&lt;/h3&gt;

&lt;p&gt;If you're still using &lt;code&gt;gitRepo&lt;/code&gt; volumes, stop reading and go fix that right now. The plugin has been deprecated since v1.11 and is now permanently disabled in v1.36. No feature flag, no workaround.&lt;/p&gt;

&lt;p&gt;The reason it's gone is serious: gitRepo allowed attackers to run code as root on the node. It was a known attack vector for years. The right replacement is an init container running &lt;code&gt;git clone&lt;/code&gt;, or a git-sync sidecar. Both are well-documented and production-proven.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before (broken in v1.36)&lt;/span&gt;
&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;code&lt;/span&gt;
    &lt;span class="na"&gt;gitRepo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;repository&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://github.com/example/repo"&lt;/span&gt;
      &lt;span class="na"&gt;revision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;main"&lt;/span&gt;

&lt;span class="c1"&gt;# After: use an init container&lt;/span&gt;
&lt;span class="na"&gt;initContainers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;git-sync&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.k8s.io/git-sync/git-sync:v4.2.1&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--repo=https://github.com/example/repo&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--branch=main&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--root=/git&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--one-time&lt;/span&gt;
    &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;code&lt;/span&gt;
        &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/git&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Ingress-NGINX Is Retired
&lt;/h3&gt;

&lt;p&gt;SIG Network and the Security Response Committee retired ingress-nginx on March 24, 2026. No more releases, no more security patches. Existing deployments keep running, but you're on your own for CVEs from here.&lt;/p&gt;

&lt;p&gt;The community's recommended alternatives are Envoy Gateway (CNCF graduated), Cilium Gateway API, and Traefik. If you're on ingress-nginx in production, this is your migration window. Don't wait for the next CVE to force your hand.&lt;/p&gt;

&lt;h3&gt;
  
  
  service.spec.externalIPs Deprecated
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;externalIPs&lt;/code&gt; field in Service specs is being deprecated (full removal planned for v1.43). It's been a known vector for man-in-the-middle attacks since CVE-2020-8554. You'll see deprecation warnings starting in v1.36. Migrate to LoadBalancer services, NodePort, or Gateway API.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI/ML Features That Actually Change How You Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  DRA: Partitionable Devices (Beta)
&lt;/h3&gt;

&lt;p&gt;This is the one I'm most excited about. v1.36 promotes DRA support for partitionable devices to beta, meaning it's enabled by default. A single GPU can now be split into multiple logical units and allocated to different workloads.&lt;/p&gt;

&lt;p&gt;Before this, if you had an H100 and a workload that only needed 20% of it, you either wasted 80% or ran a separate MIG configuration outside Kubernetes. Now the scheduler handles it natively.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;resource.k8s.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ResourceClaim&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;partial-gpu&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;devices&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-slice&lt;/span&gt;
      &lt;span class="na"&gt;deviceClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia.com/gpu&lt;/span&gt;
      &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="c1"&gt;# Request a partition, not the whole device&lt;/span&gt;
      &lt;span class="na"&gt;selectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;device.attributes["nvidia.com/gpu"].partitionable == &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For platform teams running shared GPU clusters, this is a significant cost lever. You can pack more inference workloads onto the same hardware without sacrificing isolation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workload-Aware Preemption (Alpha)
&lt;/h3&gt;

&lt;p&gt;Standard Kubernetes preemption works pod-by-pod. For distributed AI/ML jobs, that's a disaster: preempt one pod from a training job and the whole job stalls, wasting all the resources it's still holding.&lt;/p&gt;

&lt;p&gt;v1.36 introduces workload-aware preemption via &lt;code&gt;PodGroups&lt;/code&gt;. The scheduler now treats a group of related pods as a single entity. When it needs to make room for a high-priority job, it preempts entire groups rather than individual pods.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scheduling.k8s.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodGroup&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;training-job-a&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minMember&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
  &lt;span class="na"&gt;priorityClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high-priority&lt;/span&gt;
  &lt;span class="na"&gt;gangSchedulingPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;disruptionMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodGroup&lt;/span&gt;  &lt;span class="c1"&gt;# preempt the whole group, not individual pods&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is alpha, so it's off by default. But if you're running Kueue or JobSet for batch AI workloads, this is worth enabling in a test cluster now.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pod-Level Resource Managers (Alpha)
&lt;/h3&gt;

&lt;p&gt;For HPC and AI/ML workloads, NUMA alignment matters. Previously, the Topology Manager only worked at the container level. If you had a training container plus logging and monitoring sidecars in the same pod, you couldn't guarantee they all landed on the same NUMA node.&lt;/p&gt;

&lt;p&gt;v1.36 adds pod-scope resource management: you can now set &lt;code&gt;pod.spec.resources&lt;/code&gt; and have the Topology Manager treat the entire pod as a single scheduling unit. All containers get resources from the same NUMA node.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16"&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64Gi"&lt;/span&gt;
  &lt;span class="na"&gt;topologySpreadConstraints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;maxSkew&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;topologyKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;topology.kubernetes.io/numa-node&lt;/span&gt;
      &lt;span class="na"&gt;whenUnsatisfiable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DoNotSchedule&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  DRA Resource Availability Visibility (Alpha)
&lt;/h3&gt;

&lt;p&gt;Finally, a native way to answer "how many GPUs are free in this cluster?" without writing custom tooling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create &lt;span class="nt"&gt;-f&lt;/span&gt; - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
apiVersion: resource.k8s.io/v1alpha1
kind: ResourcePoolStatusRequest
metadata:
  name: check-gpus
spec:
  driver: nvidia.com/gpu
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;kubectl get rpsr/check-gpus &lt;span class="nt"&gt;-o&lt;/span&gt; yaml
&lt;span class="c"&gt;# Returns: totalDevices, allocatedDevices, availableDevices per node&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is alpha, but it's the kind of operational visibility that platform teams have been hacking around for years.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stability Improvements
&lt;/h2&gt;

&lt;h3&gt;
  
  
  SELinux Volume Labeling: Now GA
&lt;/h3&gt;

&lt;p&gt;Faster pod startup on SELinux-enforcing systems. This replaces recursive file relabeling with a single mount-time label, which can cut pod startup time significantly on large volumes. It's been in beta since v1.28 and is now stable and on by default.&lt;/p&gt;

&lt;p&gt;If you're running RHEL or any SELinux-enforcing OS, you'll notice this immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  External ServiceAccount Token Signing: GA
&lt;/h3&gt;

&lt;p&gt;The kube-apiserver can now delegate token signing to external KMS or HSM systems. For clusters with strict key management requirements (financial services, healthcare, government), this removes a significant compliance gap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graceful Leader Transition (Alpha)
&lt;/h3&gt;

&lt;p&gt;Control plane components (kube-controller-manager, kube-scheduler) used to call &lt;code&gt;os.Exit()&lt;/code&gt; when losing leader election, forcing a full restart. v1.36 introduces graceful transitions: the component moves to follower state and re-enters the election without restarting. Faster failover, less noise in your control plane logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stale Controller Mitigation (Alpha)
&lt;/h3&gt;

&lt;p&gt;Large clusters with high churn have always had a subtle bug: a controller creates a resource, its cache hasn't updated yet, and it tries to create the same resource again. v1.36 adds cache freshness tracking so controllers check whether their local state is current before reconciling. Fewer duplicate creates, fewer spurious errors in busy clusters.&lt;/p&gt;

&lt;h3&gt;
  
  
  HPA Scale-to-Zero (Alpha)
&lt;/h3&gt;

&lt;p&gt;The Horizontal Pod Autoscaler can now scale deployments to zero replicas based on external metrics (queue depth, custom metrics). When the queue is empty, the deployment goes to zero. When work arrives, it scales back up. This is the missing piece for event-driven workloads that don't need to run 24/7.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do Before April 22
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Audit gitRepo volumes.&lt;/strong&gt; Run &lt;code&gt;kubectl get pods -A -o json | jq '.items[].spec.volumes[]? | select(.gitRepo != null)'&lt;/code&gt;. If you get output, you have work to do.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plan your ingress-nginx migration.&lt;/strong&gt; Check &lt;code&gt;kubectl get ingressclass&lt;/code&gt; and &lt;code&gt;kubectl get pods -A | grep ingress-nginx&lt;/code&gt;. If you're running it, pick a replacement and start testing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Check for externalIPs usage.&lt;/strong&gt; &lt;code&gt;kubectl get svc -A -o json | jq '.items[] | select(.spec.externalIPs != null) | .metadata.name'&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enable DRA partitionable devices in staging.&lt;/strong&gt; If you run GPU workloads, this is worth testing before it becomes the default everywhere.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Read the full changelog.&lt;/strong&gt; The &lt;a href="https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.36.md" rel="noopener noreferrer"&gt;CHANGELOG-1.36.md&lt;/a&gt; is dense but worth scanning for anything specific to your stack.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;v1.36 isn't a flashy release. There's no single feature that rewrites how Kubernetes works. What it is, is a release that takes the AI/ML workload story seriously at the scheduler and resource allocation level, while cleaning up years of accumulated security debt.&lt;/p&gt;

&lt;p&gt;The gitRepo removal and ingress-nginx retirement are overdue. The DRA work is genuinely new capability. And the gang scheduling improvements are the kind of thing that makes distributed training jobs actually reliable on Kubernetes instead of just theoretically possible.&lt;/p&gt;

&lt;p&gt;If you're running AI inference at scale, v1.36 is the release you've been waiting for. If you're running anything else, it's a solid maintenance release with a few security items you can't ignore.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/blog/2026/03/30/kubernetes-v1-36-sneak-peek/" rel="noopener noreferrer"&gt;Kubernetes v1.36 Sneak Peek&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://palark.com/blog/kubernetes-1-36-release-features/" rel="noopener noreferrer"&gt;Palark: Deep Dive into v1.36 Alpha Features&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cncf.io/announcements/2026/01/20/kubernetes-established-as-the-de-facto-operating-system-for-ai-as-production-use-hits-82-in-2025-cncf-annual-cloud-native-survey/" rel="noopener noreferrer"&gt;CNCF 2025 Annual Survey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/blog/2025/11/11/ingress-nginx-retirement/" rel="noopener noreferrer"&gt;Ingress-NGINX Retirement Announcement&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/" rel="noopener noreferrer"&gt;DRA Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloudcomputing</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>ingress-nginx Is Dead: How I Migrated to Gateway API Before It Became a Liability</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Tue, 07 Apr 2026 18:15:05 +0000</pubDate>
      <link>https://forem.com/mateenali66/ingress-nginx-is-dead-how-i-migrated-to-gateway-api-before-it-became-a-liability-2815</link>
      <guid>https://forem.com/mateenali66/ingress-nginx-is-dead-how-i-migrated-to-gateway-api-before-it-became-a-liability-2815</guid>
      <description>&lt;p&gt;ingress-nginx was archived on March 24, 2026 after a string of critical CVEs including a 9.8 CVSS unauthenticated RCE. Gateway API v1.4 is the CNCF-graduated replacement. I used ingress2gateway 1.0 to convert 40+ Ingress resources to HTTPRoutes, validated the output, and cut over with zero downtime. Here's exactly how I did it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Happened
&lt;/h2&gt;

&lt;p&gt;In March 2025, CVE-2025-1974 (dubbed "IngressNightmare") dropped: a CVSS 9.8 unauthenticated remote code execution vulnerability in ingress-nginx's admission webhook. Any attacker with network access to the webhook could execute arbitrary code inside the controller pod, which typically has broad cluster permissions. That was bad enough on its own.&lt;/p&gt;

&lt;p&gt;Then came 2026. Four more HIGH-severity CVEs landed in quick succession:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CVE&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2025-1974&lt;/td&gt;
&lt;td&gt;CRITICAL 9.8&lt;/td&gt;
&lt;td&gt;Unauthenticated RCE via admission webhook&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-1580&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;Config injection leading to privilege escalation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-24512&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;Path injection through nginx config manipulation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-24513&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;Authentication bypass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-24514&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;Annotation abuse for unauthorized access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On March 24, 2026, the ingress-nginx repository was officially archived. Read-only. No more patches. No more CVE fixes. If you're still running it, you're running unpatched software with known critical vulnerabilities.&lt;/p&gt;

&lt;p&gt;This wasn't a surprise deprecation. The Kubernetes community had been building Gateway API for years as the successor to the Ingress resource. But the CVE storm turned "migrate when convenient" into "migrate now."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75nc6whgvdkzbi6s9rja.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75nc6whgvdkzbi6s9rja.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Gateway API: What Actually Changed
&lt;/h2&gt;

&lt;p&gt;Gateway API isn't just "Ingress v2." It fundamentally changes how traffic routing is modeled in Kubernetes by splitting responsibilities across three layers:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff104yqs3037jx07ljf8j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff104yqs3037jx07ljf8j.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: GatewayClass (Infrastructure Admin)
&lt;/h3&gt;

&lt;p&gt;The infrastructure team defines what gateway implementation is available. Think of it as the "which load balancer technology" decision.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway.networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GatewayClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production-gateway&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;controllerName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway.envoyproxy.io/gatewayclass-controller&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 2: Gateway (Cluster Operator)
&lt;/h3&gt;

&lt;p&gt;The platform team creates Gateway resources that bind to a GatewayClass. This is where you define listeners, ports, TLS certificates, and which namespaces can attach routes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway.networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Gateway&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main-gateway&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway-infra&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;gatewayClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production-gateway&lt;/span&gt;
  &lt;span class="na"&gt;listeners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTPS&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
      &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Terminate&lt;/span&gt;
        &lt;span class="na"&gt;certificateRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;wildcard-tls&lt;/span&gt;
      &lt;span class="na"&gt;allowedRoutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Selector&lt;/span&gt;
          &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;gateway-access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTP&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 3: HTTPRoute (Application Developer)
&lt;/h3&gt;

&lt;p&gt;Application teams define their own routing rules without touching the gateway configuration. They just reference the Gateway they want to attach to.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway.networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTPRoute&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parentRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main-gateway&lt;/span&gt;
      &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway-infra&lt;/span&gt;
  &lt;span class="na"&gt;hostnames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api.example.com"&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PathPrefix&lt;/span&gt;
            &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/v1&lt;/span&gt;
      &lt;span class="na"&gt;backendRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-service&lt;/span&gt;
          &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This separation matters because it maps to how teams actually operate. Infrastructure admins pick the implementation. Platform engineers configure the gateway. App developers define their routes. Nobody steps on each other's toes, and RBAC enforces the boundaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Is Better Than Annotations
&lt;/h3&gt;

&lt;p&gt;With ingress-nginx, everything was shoved into annotations. Rate limiting, CORS, timeouts, rewrites, all of it crammed into &lt;code&gt;nginx.ingress.kubernetes.io/*&lt;/code&gt; strings that were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Non-standard&lt;/strong&gt;: Every controller had its own annotation format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unvalidated&lt;/strong&gt;: Typo an annotation name? Silent failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unstructured&lt;/strong&gt;: Complex configs as string values&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-portable&lt;/strong&gt;: Locked to one implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Gateway API uses typed CRD fields. Your IDE autocompletes them. The API server validates them. They work across implementations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Migration: Using ingress2gateway 1.0
&lt;/h2&gt;

&lt;p&gt;On March 20, 2026, ingress2gateway 1.0 shipped with support for 30+ ingress-nginx annotations. This was the tool that made bulk migration practical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;ingress2gateway
&lt;span class="c"&gt;# or&lt;/span&gt;
go &lt;span class="nb"&gt;install &lt;/span&gt;github.com/kubernetes-sigs/ingress2gateway@v1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Scan and Convert
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Convert everything cluster-wide&lt;/span&gt;
ingress2gateway print &lt;span class="nt"&gt;--providers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ingress-nginx &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; gwapi.yaml

&lt;span class="c"&gt;# Or target a specific namespace&lt;/span&gt;
ingress2gateway print &lt;span class="nt"&gt;--namespace&lt;/span&gt; my-api &lt;span class="nt"&gt;--providers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ingress-nginx &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; gwapi.yaml

&lt;span class="c"&gt;# If you've chosen your implementation, use emitter flags&lt;/span&gt;
ingress2gateway print &lt;span class="nt"&gt;--emitter&lt;/span&gt; envoy-gateway &lt;span class="nt"&gt;--providers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ingress-nginx &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; gwapi.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Review the Output
&lt;/h3&gt;

&lt;p&gt;Here's what a typical translation looks like.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before (Ingress with ingress-nginx annotations):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/cors-allow-origin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://app.example.com"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/cors-allow-methods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GET,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;POST,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;OPTIONS"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/cors-enable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-read-timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;60"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/use-regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
  &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;api.example.com&lt;/span&gt;
      &lt;span class="na"&gt;secretName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-tls&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api.example.com&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/api/v[0-9]+/users&lt;/span&gt;
            &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ImplementationSpecific&lt;/span&gt;
            &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users-service&lt;/span&gt;
                &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After (Gateway API HTTPRoute):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway.networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTPRoute&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parentRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main-gateway&lt;/span&gt;
      &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gateway-infra&lt;/span&gt;
  &lt;span class="na"&gt;hostnames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api.example.com"&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RegularExpression&lt;/span&gt;
            &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/v[0-9]+/users"&lt;/span&gt;
      &lt;span class="na"&gt;filters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ResponseHeaderModifier&lt;/span&gt;
          &lt;span class="na"&gt;responseHeaderModifier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;set&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Access-Control-Allow-Origin&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://app.example.com"&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Access-Control-Allow-Methods&lt;/span&gt;
                &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GET,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;POST,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;OPTIONS"&lt;/span&gt;
      &lt;span class="na"&gt;timeouts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;backendRequest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;
      &lt;span class="na"&gt;backendRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users-service&lt;/span&gt;
          &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The structure is cleaner. CORS headers are explicit. The regex path type is a first-class field instead of being toggled by an annotation. Timeouts are typed durations, not string-encoded integers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcsmgeji89sk2o5v22s0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcsmgeji89sk2o5v22s0g.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What ingress2gateway Cannot Translate
&lt;/h2&gt;

&lt;p&gt;The tool is good, but it's not magic. Watch for these:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom Lua snippets.&lt;/strong&gt; If you used &lt;code&gt;nginx.ingress.kubernetes.io/server-snippet&lt;/code&gt; or &lt;code&gt;configuration-snippet&lt;/code&gt; with custom Lua or raw nginx config, those have no Gateway API equivalent. You'll need to reimplement that logic in your application or use implementation-specific policies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting.&lt;/strong&gt; ingress-nginx rate limiting annotations don't map to standard Gateway API fields. Most implementations offer their own rate limiting CRDs (like Envoy Gateway's &lt;code&gt;BackendTrafficPolicy&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ModSecurity / WAF rules.&lt;/strong&gt; If you had ModSecurity enabled via annotations, you'll need a separate WAF solution or an implementation that supports it natively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session affinity.&lt;/strong&gt; Cookie-based session affinity annotations need implementation-specific configuration in Gateway API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom error pages.&lt;/strong&gt; These were nginx-specific and need to be handled at the application level or through implementation extensions.&lt;/p&gt;

&lt;p&gt;ingress2gateway will print warnings for annotations it can't convert. Read every warning. I found three services silently losing rate limiting configs that would have caused issues in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing a Gateway API Implementation
&lt;/h2&gt;

&lt;p&gt;Gateway API is a spec. You need an implementation. Here's how I evaluated the main options:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Implementation&lt;/th&gt;
&lt;th&gt;Backed By&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Envoy Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Envoy Proxy / CNCF&lt;/td&gt;
&lt;td&gt;General purpose, feature-rich&lt;/td&gt;
&lt;td&gt;Strong community, good docs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;kgateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Solo.io&lt;/td&gt;
&lt;td&gt;Advanced traffic management&lt;/td&gt;
&lt;td&gt;Commercial support available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cilium Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Isovalent/Cisco&lt;/td&gt;
&lt;td&gt;eBPF-native networking&lt;/td&gt;
&lt;td&gt;Great if you already run Cilium CNI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NGINX Gateway Fabric&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;F5/NGINX&lt;/td&gt;
&lt;td&gt;Familiar nginx users&lt;/td&gt;
&lt;td&gt;Uses nginx under the hood&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Istio Waypoint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google/Solo.io&lt;/td&gt;
&lt;td&gt;Service mesh integration&lt;/td&gt;
&lt;td&gt;If you're already on Istio&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I went with Envoy Gateway. It's CNCF-backed, has broad feature coverage, and doesn't require buying into a service mesh. The &lt;code&gt;--emitter envoy-gateway&lt;/code&gt; flag in ingress2gateway generates implementation-specific extensions where needed, which saved manual work.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Migration Checklist
&lt;/h2&gt;

&lt;p&gt;Here's the checklist I followed. Steal it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pre-migration:
[ ] Inventory all Ingress resources: kubectl get ingress --all-namespaces
[ ] Document custom annotations per Ingress
[ ] Identify any custom nginx configs (ConfigMap, snippets)
[ ] Install Gateway API CRDs: kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml
[ ] Deploy chosen Gateway API implementation

Conversion:
[ ] Run ingress2gateway print and capture output
[ ] Review ALL warnings from ingress2gateway
[ ] Manually handle untranslatable annotations
[ ] Create GatewayClass and Gateway resources
[ ] Create ReferenceGrant resources for cross-namespace refs

Validation:
[ ] Apply HTTPRoutes to staging cluster
[ ] Test every endpoint (automated: curl + expected status codes)
[ ] Verify TLS termination works
[ ] Check CORS headers in browser dev tools
[ ] Validate regex paths match correctly
[ ] Load test to confirm no performance regression

Cutover:
[ ] Update DNS or switch load balancer target
[ ] Monitor error rates for 30 minutes
[ ] Keep old Ingress resources (don't delete yet)
[ ] After 48 hours stable: remove old Ingress resources
[ ] Uninstall ingress-nginx controller
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After migrating 40+ Ingress resources across 12 namespaces:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Known CVEs&lt;/td&gt;
&lt;td&gt;5 (1 critical)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Annotation sprawl&lt;/td&gt;
&lt;td&gt;180+ annotations&lt;/td&gt;
&lt;td&gt;0 (typed fields)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-namespace routing&lt;/td&gt;
&lt;td&gt;Manual workarounds&lt;/td&gt;
&lt;td&gt;Native ReferenceGrant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Downtime during migration&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to complete&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;3 days (including validation)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Don't wait for the archive notice.&lt;/strong&gt; Gateway API has been stable since v1.0 (October 2023). I should have started earlier. The CVE pressure made this more stressful than it needed to be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ingress2gateway is a starting point, not a finish line.&lt;/strong&gt; It handled about 85% of our config automatically. The remaining 15% required understanding both the old nginx annotations and the new Gateway API model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The three-layer model pays off immediately.&lt;/strong&gt; Within a week of the migration, our app teams were creating their own HTTPRoutes without filing tickets to the platform team. That alone justified the effort.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test regex paths carefully.&lt;/strong&gt; The regex syntax between nginx and Gateway API implementations can differ subtly. I caught two path patterns that matched differently under Envoy than they did under nginx.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep the old Ingress resources around.&lt;/strong&gt; Don't delete them the moment Gateway API routes are working. Give yourself a rollback window. I kept ours for 48 hours before cleanup.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gateway-api.sigs.k8s.io/" rel="noopener noreferrer"&gt;Gateway API Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kubernetes-sigs/ingress2gateway" rel="noopener noreferrer"&gt;ingress2gateway GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nvd.nist.gov/vuln/detail/CVE-2025-1974" rel="noopener noreferrer"&gt;CVE-2025-1974 Advisory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gateway-api.sigs.k8s.io/blog/" rel="noopener noreferrer"&gt;Gateway API v1.4 Release Notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gateway.envoyproxy.io/docs/" rel="noopener noreferrer"&gt;Envoy Gateway Quickstart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kubernetes/ingress-nginx" rel="noopener noreferrer"&gt;ingress-nginx Archive Notice&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloudnative</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Your Security Scanner Was the Weapon: Inside the Trivy Supply Chain Attack</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sat, 28 Mar 2026 17:40:45 +0000</pubDate>
      <link>https://forem.com/mateenali66/your-security-scanner-was-the-weapon-inside-the-trivy-supply-chain-attack-2gc</link>
      <guid>https://forem.com/mateenali66/your-security-scanner-was-the-weapon-inside-the-trivy-supply-chain-attack-2gc</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Trivy, the most widely used container scanning action in GitHub Actions, was compromised on March 19, 2026. A threat actor poisoned 76 of its 77 version tags. Every pipeline that ran a scan silently handed over SSH keys, cloud credentials, Kubernetes tokens, and more. The scan appeared to succeed. You'd never know.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I've had Trivy in my pipelines for years. Container scanning on every PR, every merge, every deploy. It's one of those things you set up once and stop thinking about, which is exactly what makes this attack so effective.&lt;/p&gt;

&lt;p&gt;On March 19, 2026, a threat actor group called TeamPCP force-pushed malicious commits to 76 of the 77 version tags in the &lt;code&gt;aquasecurity/trivy-action&lt;/code&gt; GitHub repository. All 7 tags in &lt;code&gt;aquasecurity/setup-trivy&lt;/code&gt; were also compromised. If your workflow referenced Trivy by a tag (which is how basically everyone references GitHub Actions), you were running their code.&lt;/p&gt;

&lt;p&gt;The scanner still ran. Your pipeline still went green. You had no idea.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Actually Happened
&lt;/h2&gt;

&lt;p&gt;This attack didn't start on March 19. It started weeks earlier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gg7vrokeqnj1aspiraw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gg7vrokeqnj1aspiraw.png" alt=" " width="800" height="207"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Late February 2026:&lt;/strong&gt; An automated bot called "hackerbot-claw" exploited a misconfigured GitHub Actions workflow and stole a privileged Personal Access Token from Aqua Security's CI environment. The attacker used this to push malware to the Trivy VS Code extension on Open VSX.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 1:&lt;/strong&gt; Aqua Security disclosed the incident publicly via a GitHub discussion and rotated credentials. Except the rotation was incomplete. One service account, one PAT, one residual access path, still live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 19, 17:43 UTC:&lt;/strong&gt; Using the still-valid credentials, TeamPCP force-pushed malicious commits to 76 of 77 tags in &lt;code&gt;trivy-action&lt;/code&gt; and all 7 tags in &lt;code&gt;setup-trivy&lt;/code&gt;. The compromised commits spoofed legitimate maintainer identities. GitHub itself flagged them with "This commit does not belong to any branch on this repository" but that warning is easy to miss in a workflow log.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 19, 18:22 UTC:&lt;/strong&gt; A rogue commit published a malicious Trivy binary as &lt;code&gt;v0.69.4&lt;/code&gt; across every distribution channel simultaneously: GitHub Releases, GHCR, Docker Hub, ECR Public, deb/rpm repositories, and get.trivy.dev.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 20, 05:40 UTC:&lt;/strong&gt; Aqua remediated the trivy-action tags. The window was roughly 12 hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 22:&lt;/strong&gt; The attacker pushed additional malicious Docker Hub images (&lt;code&gt;v0.69.5&lt;/code&gt;, &lt;code&gt;v0.69.6&lt;/code&gt;, &lt;code&gt;latest&lt;/code&gt;) using separately compromised Docker Hub credentials, bypassing all GitHub controls. Same day, 44 repositories in Aqua's &lt;code&gt;aquasec-com&lt;/code&gt; GitHub org were defaced using a stolen service account token that bridged both orgs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;March 24:&lt;/strong&gt; The campaign expanded to Checkmarx KICS and LiteLLM PyPI packages (&lt;code&gt;1.82.7&lt;/code&gt;, &lt;code&gt;1.82.8&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The takeaway here is not just that a tool got compromised. It's that incomplete remediation turned a single breach into a three-week campaign.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Payload Did
&lt;/h2&gt;

&lt;p&gt;This is the part that should make you uncomfortable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcfedlzk2zryp9rtt623m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcfedlzk2zryp9rtt623m.png" alt=" " width="800" height="2042"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The malicious &lt;code&gt;entrypoint.sh&lt;/code&gt; prepended about 105 lines of attack code before the legitimate Trivy scanner logic. The scan completed normally. Your logs looked fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Process enumeration.&lt;/strong&gt; The script scanned &lt;code&gt;/proc/*/environ&lt;/code&gt; across all runner processes, extracting environment-level secrets, filtering for anything with &lt;code&gt;env&lt;/code&gt; or &lt;code&gt;ssh&lt;/code&gt; in the name.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: Memory scraping.&lt;/strong&gt; Here's where it gets clever. On GitHub-hosted runners, a base64-encoded Python script located the &lt;code&gt;Runner.Worker&lt;/code&gt; process, read its memory maps via &lt;code&gt;/proc/{PID}/maps&lt;/code&gt;, and scraped raw process memory via &lt;code&gt;/proc/{PID}/mem&lt;/code&gt;. It was targeting GitHub Actions secrets specifically, looking for JSON structures matching &lt;code&gt;"name":{"value":"...","isSecret":true}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This matters because it bypassed GitHub Actions' secret masking. Values that would show as &lt;code&gt;***&lt;/code&gt; in your logs were extracted in plaintext from process memory. There's no log artifact. Nothing looked wrong.&lt;/p&gt;

&lt;p&gt;On self-hosted runners, a different payload did broad filesystem harvesting across 50+ target paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SSH private keys and configs&lt;/li&gt;
&lt;li&gt;Cloud credentials (AWS, GCP, Azure)&lt;/li&gt;
&lt;li&gt;Kubernetes configs and service account tokens&lt;/li&gt;
&lt;li&gt;Docker registry credentials&lt;/li&gt;
&lt;li&gt;Database credentials (MySQL, PostgreSQL, MongoDB, Redis)&lt;/li&gt;
&lt;li&gt;Terraform state files and variables&lt;/li&gt;
&lt;li&gt;Ansible vaults and Helm values&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.env&lt;/code&gt; files and API keys&lt;/li&gt;
&lt;li&gt;Git credentials and shell history&lt;/li&gt;
&lt;li&gt;Cryptocurrency wallet keys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stage 3: Encryption.&lt;/strong&gt; Collected data was encrypted with AES-256-CBC using a randomly generated session key, then the session key was wrapped with a hardcoded 4096-bit RSA public key. Only the attacker's server could decrypt it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 4: Exfiltration.&lt;/strong&gt; Primary channel was an HTTPS POST to &lt;code&gt;scan.aquasecurtiy[.]org&lt;/code&gt;, a typosquatted domain designed to look like Aqua Security's legitimate domain (note the extra 'i' in 'security'). Resolves to &lt;code&gt;45.148.10.212&lt;/code&gt; in Amsterdam.&lt;/p&gt;

&lt;p&gt;Fallback: if the C2 was unreachable and a GitHub PAT was in the environment, the malware created a public repository named &lt;code&gt;tpcp-docs&lt;/code&gt; and uploaded stolen data as release assets, turning GitHub itself into the exfiltration channel.&lt;/p&gt;

&lt;h2&gt;
  
  
  Are You Affected?
&lt;/h2&gt;

&lt;p&gt;Check these specific exposure windows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Affected Versions&lt;/th&gt;
&lt;th&gt;Exposure Window&lt;/th&gt;
&lt;th&gt;Safe&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;trivy binary&lt;/td&gt;
&lt;td&gt;v0.69.4&lt;/td&gt;
&lt;td&gt;~3h (Mar 19)&lt;/td&gt;
&lt;td&gt;v0.69.3 or earlier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;trivy Docker Hub&lt;/td&gt;
&lt;td&gt;v0.69.5, v0.69.6, latest&lt;/td&gt;
&lt;td&gt;~10h (Mar 22–24)&lt;/td&gt;
&lt;td&gt;v0.69.3 or earlier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;trivy-action&lt;/td&gt;
&lt;td&gt;Tags 0.0.1–0.34.2&lt;/td&gt;
&lt;td&gt;~12h (Mar 19–20)&lt;/td&gt;
&lt;td&gt;v0.35.0+ or SHA-pinned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;setup-trivy&lt;/td&gt;
&lt;td&gt;All 7 tags&lt;/td&gt;
&lt;td&gt;~12h (Mar 19–20)&lt;/td&gt;
&lt;td&gt;SHA-pinned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM PyPI&lt;/td&gt;
&lt;td&gt;1.82.7, 1.82.8&lt;/td&gt;
&lt;td&gt;Mar 24+&lt;/td&gt;
&lt;td&gt;1.82.6 or earlier&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you ran Trivy in any pipeline during those windows and weren't pinning to a commit SHA, you have to assume secrets were stolen. All of them. Every secret accessible from that runner environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Need to Change
&lt;/h2&gt;

&lt;p&gt;This is the remediation checklist, ordered by priority.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Rotate first, investigate second
&lt;/h3&gt;

&lt;p&gt;If you were in the exposure window, rotate everything the runner could have touched. Don't wait for confirmation. Treat every secret as compromised:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS access keys and IAM roles&lt;/li&gt;
&lt;li&gt;GCP service account keys&lt;/li&gt;
&lt;li&gt;Azure service principals&lt;/li&gt;
&lt;li&gt;Kubernetes service account tokens&lt;/li&gt;
&lt;li&gt;Docker registry credentials&lt;/li&gt;
&lt;li&gt;SSH keys&lt;/li&gt;
&lt;li&gt;Database credentials&lt;/li&gt;
&lt;li&gt;GitHub PATs and tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Pin actions to commit SHAs
&lt;/h3&gt;

&lt;p&gt;This is the single most effective structural change. Tags are mutable. Commit SHAs are not.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bad — this is what everyone does, and what got compromised&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@0.24.0&lt;/span&gt;

&lt;span class="c1"&gt;# Good — SHA-pinned, immutable&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@57a97c7843d7da7a7b4f8ce2a0c4e3b7f0c2e1d&lt;/span&gt;  &lt;span class="c1"&gt;# 0.35.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes, it's more work to update. That's the point. Renovatebot or Dependabot can automate SHA updates if you configure them for Actions.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Switch to OIDC for cloud authentication
&lt;/h3&gt;

&lt;p&gt;Long-lived cloud credentials in CI are a liability. OIDC lets your runner authenticate to AWS, GCP, or Azure without storing static keys:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# AWS example&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::ACCOUNT:role/github-actions-role&lt;/span&gt;
    &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing to steal if there's nothing stored. The credentials are ephemeral and scoped to the job.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Restrict runner permissions
&lt;/h3&gt;

&lt;p&gt;GitHub Actions runners get &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; by default. Scope it down:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
  &lt;span class="na"&gt;security-events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
  &lt;span class="c1"&gt;# Nothing else&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most workflows need far less than the default. Less permission means smaller blast radius.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Audit non-human identities
&lt;/h3&gt;

&lt;p&gt;The Trivy attack persisted because one service account credential wasn't rotated. Audit all machine identities in your org:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub PATs: Who issued them? When do they expire? Are they scoped minimally?&lt;/li&gt;
&lt;li&gt;Service accounts: Which ones have write access to release infrastructure?&lt;/li&gt;
&lt;li&gt;Bot accounts: Are any shared across orgs or repositories?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Long-lived, over-privileged service accounts are how a one-time breach becomes a three-week campaign.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Use secret scanning
&lt;/h3&gt;

&lt;p&gt;GitGuardian, GitHub's native secret scanning, or both. The Trivy attacker used GitHub as a fallback exfiltration channel. If your credentials ever end up in a public repo, you want to know in minutes, not days.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Verify binaries before running them
&lt;/h3&gt;

&lt;p&gt;For direct binary downloads (not GitHub Actions), verify checksums:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download the official checksums&lt;/span&gt;
curl &lt;span class="nt"&gt;-sSL&lt;/span&gt; https://github.com/aquasecurity/trivy/releases/download/v0.69.3/trivy_0.69.3_checksums.txt &lt;span class="nt"&gt;-o&lt;/span&gt; checksums.txt

&lt;span class="c"&gt;# Verify your binary&lt;/span&gt;
&lt;span class="nb"&gt;sha256sum&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; checksums.txt &lt;span class="nt"&gt;--ignore-missing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your pipeline downloads and runs binaries from the internet, add checksum verification as a step.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Lesson
&lt;/h2&gt;

&lt;p&gt;The Trivy attack was technically sophisticated, but the root cause is unglamorous: incomplete credential rotation.&lt;/p&gt;

&lt;p&gt;Aqua disclosed the initial breach on March 1 and rotated credentials. One PAT, one service account, one residual access path was left active. That's what TeamPCP used on March 19. The March 22 Docker Hub compromise used yet another separate credential that wasn't in scope of the original remediation.&lt;/p&gt;

&lt;p&gt;When you rotate secrets after a breach, you need to be exhaustive. Enumerate every credential that could have been exposed, every service account that had access, every integration that used a compromised token. Rotation is not a task you do until it feels complete. It's a task you do until you've verified every access path is severed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz4n6nsa5ydv92kuv24i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz4n6nsa5ydv92kuv24i.png" alt=" " width="671" height="2678"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The other lesson: the attack surface for CI/CD is enormous. Your pipeline runs with access to secrets, cloud credentials, internal infrastructure. When you add a third-party action, you're trusting that maintainer's entire security posture, including their CI, their service accounts, and their credential management practices. SHA pinning doesn't eliminate that trust but it gives you a stable, auditable point you can reason about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Immediate Checklist
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ ] Check pipeline logs for trivy-action usage between March 19–20
[ ] Check pipeline logs for trivy binary v0.69.4 usage on March 19
[ ] Check for Docker image usage of v0.69.5, v0.69.6, or latest between Mar 22–24
[ ] Rotate all secrets accessible from affected runners
[ ] Update trivy-action to v0.35.0 or pin to SHA
[ ] Check for LiteLLM usage of 1.82.7 or 1.82.8
[ ] Switch cloud auth to OIDC
[ ] Pin all third-party actions to commit SHAs
[ ] Restrict workflow permissions to minimum required
[ ] Audit service accounts and PATs for expiry and scope
[ ] Enable secret scanning on your org
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.crowdstrike.com/en-us/blog/from-scanner-to-stealer-inside-the-trivy-action-supply-chain-compromise/" rel="noopener noreferrer"&gt;CrowdStrike: From Scanner to Stealer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.gitguardian.com/trivys-march-supply-chain-attack-shows-where-secret-exposure-hurts-most/" rel="noopener noreferrer"&gt;GitGuardian: Trivy's March Supply Chain Attack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.legitsecurity.com/blog/the-trivy-supply-chain-compromise-what-happened-and-playbooks-to-respond" rel="noopener noreferrer"&gt;Legit Security: Playbooks to Respond&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.microsoft.com/en-us/security/blog/2026/03/24/detecting-investigating-defending-against-trivy-supply-chain-compromise/" rel="noopener noreferrer"&gt;Microsoft Security Blog: Detecting and Defending&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arcticwolf.com/resources/blog/teampcp-supply-chain-attack-campaign-targets-trivy-checkmarx-kics-and-litellm-potential-downstream-impact-to-additional-projects/" rel="noopener noreferrer"&gt;Arctic Wolf: TeamPCP Campaign Analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/aquasecurity/trivy/discussions/10425" rel="noopener noreferrer"&gt;Aqua Security: Official Disclosure&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>security</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>cicd</category>
    </item>
    <item>
      <title>GitHub Actions costs are leaking, and most teams don't notice until it's too late</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Mon, 16 Mar 2026 05:24:12 +0000</pubDate>
      <link>https://forem.com/mateenali66/github-actions-costs-are-leaking-and-most-teams-dont-notice-until-its-too-late-27d1</link>
      <guid>https://forem.com/mateenali66/github-actions-costs-are-leaking-and-most-teams-dont-notice-until-its-too-late-27d1</guid>
      <description>&lt;p&gt;Two years ago I was working on a connected vehicles platform running 40+ microservices on Kubernetes. CI was healthy, tests were passing, and nobody was paying attention to the GitHub Actions bill until it hit $4,200 in a single month.&lt;/p&gt;

&lt;p&gt;The culprit was a matrix build that someone had extended to cover six Node versions. Nobody noticed because the cost didn't show up anywhere obvious. It wasn't flagged in any alert. The engineers who added the matrix jobs weren't thinking about cost. By the time finance asked the question, the pattern had been running for three months.&lt;/p&gt;

&lt;p&gt;I started looking for a tool that could give us per-workflow cost visibility. Something that would let us answer "which workflows cost the most" and "did this PR make CI more expensive." I didn't find anything that fit, so I built CICosts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;CICosts installs as a GitHub App and receives a webhook event every time a workflow run completes. It multiplies the runner minutes by GitHub's published pricing for that runner type (Linux, Windows, macOS, self-hosted) and stores the result.&lt;/p&gt;

&lt;p&gt;From there you get a dashboard showing cost by workflow, by repository, by branch, and over time. You can set alerts when a workflow exceeds a threshold. You can see trends, spot regressions after PRs merge, and compare costs across environments.&lt;/p&gt;

&lt;p&gt;The math is straightforward. GitHub charges $0.008/minute for Linux runners, $0.016 for Windows, $0.08 for macOS. If a workflow runs for 12 minutes on Linux, that's $0.096. Not much in isolation. Run it 500 times a day across 30 repositories and it adds up fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  The common patterns I see
&lt;/h2&gt;

&lt;p&gt;After watching enough CI pipelines, a few patterns account for most of the waste:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Matrix explosions.&lt;/strong&gt; A workflow that tests across 3 OS versions and 4 runtime versions runs 12 times per push. If the matrix was added incrementally over time, nobody may have thought through the cumulative cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;macOS runners for non-macOS work.&lt;/strong&gt; macOS runners cost 10x more than Linux. They're necessary for iOS builds and sometimes for Homebrew. They're not necessary for most backend services, but they show up there sometimes because someone copied a workflow template.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test parallelism without caching.&lt;/strong&gt; Running tests in parallel is good. Running them in parallel while re-downloading 200MB of dependencies on every run because the cache key is wrong is expensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nightly builds that nobody needs.&lt;/strong&gt; Workflows scheduled to run nightly that were set up to catch a specific class of bug that was fixed 18 months ago. The schedule never got cleaned up.&lt;/p&gt;

&lt;p&gt;None of these are difficult to fix once you can see them. The problem is visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's now open source and free
&lt;/h2&gt;

&lt;p&gt;I built this as a paid SaaS originally. The pricing was too restrictive for a product without an established reputation. If you're asking engineers to add a GitHub App to their organization and trust it with their CI data, "trust us, it's $29/month" is a hard sell when nobody's heard of you.&lt;/p&gt;

&lt;p&gt;The honest version: the product was good and nobody knew about it. That's a distribution problem, not a product problem.&lt;/p&gt;

&lt;p&gt;So the model is now simple. CICosts is MIT licensed, the code is on GitHub, and the hosted version at app.cicosts.dev is free with no usage limits. If your organization needs an SLA or wants a private deployment, that's the enterprise tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;Install it from GitHub:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://github.com/phonotechnologies/cicosts-app
https://github.com/phonotechnologies/cicosts-api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use the hosted version directly at &lt;a href="https://app.cicosts.dev" rel="noopener noreferrer"&gt;app.cicosts.dev&lt;/a&gt;. Add the GitHub App to your organization, and cost data starts flowing within a few minutes of your next workflow run.&lt;/p&gt;

&lt;p&gt;The setup takes about five minutes. There's no code change required in your repos. The GitHub App receives webhook events automatically once installed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;If I were starting from zero, I'd make it open source from day one and focus entirely on getting the GitHub App installation experience right. The hardest part of a tool like this isn't the cost calculation. It's getting someone to trust it enough to install it.&lt;/p&gt;

&lt;p&gt;Open source makes that easier. You can read the code. You can see exactly what data is being stored and what isn't. That matters when you're asking someone to add an app to their GitHub organization.&lt;/p&gt;




&lt;p&gt;The code is on GitHub under the phonotechnologies organization. PRs welcome, especially around runner pricing updates and new alert types. If you run into something, open an issue.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>github</category>
      <category>cicd</category>
      <category>opensource</category>
    </item>
    <item>
      <title>GitOps for ML in 2026: Treat Your AI Models Like Microservices (Or Watch Them Drift Into Production Chaos)</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sat, 14 Mar 2026 21:46:50 +0000</pubDate>
      <link>https://forem.com/mateenali66/gitops-for-ml-in-2026-treat-your-ai-models-like-microservices-or-watch-them-drift-into-production-40m2</link>
      <guid>https://forem.com/mateenali66/gitops-for-ml-in-2026-treat-your-ai-models-like-microservices-or-watch-them-drift-into-production-40m2</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Apply the same GitOps discipline you use for application code to ML model deployments, and you get version history, rollback, and promotion gates that actually work, instead of the SSH-and-pray workflow most teams are still running.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;There's a model running in production right now that nobody on your team can explain. It was trained six weeks ago, deployed by someone who's since moved to a different team, and the only record of what version it is lives in a Slack message that's been buried under 4,000 other messages.&lt;/p&gt;

&lt;p&gt;When it starts making bad predictions, what's your rollback plan? If your answer involves SSHing into a server, editing a config file by hand, and hoping the right weights get loaded, you're in the majority. That doesn't make it less of a disaster.&lt;/p&gt;

&lt;p&gt;I spent the better part of last year helping platform teams get their ML deployment story straight. The pattern I kept seeing: teams had decent model training pipelines, reasonable experiment tracking in MLflow, and then a complete gap between "model registered" and "model serving traffic." The gap got filled with shell scripts, manual steps, and a whole lot of tribal knowledge.&lt;/p&gt;

&lt;p&gt;The fix isn't a new tool. It's applying discipline you already have from application deployments to the model deployment layer.&lt;/p&gt;

&lt;p&gt;Before we moved to GitOps for model deployments, a typical promotion cycle looked like this. A data scientist trains a new version, registers it in MLflow, then files a ticket. A platform engineer picks up the ticket, SSH-es into the model server, updates the model path, restarts the serving process, and manually validates that predictions look reasonable. Start to finish: 4 to 6 hours on a good day, longer when the engineer is in meetings or the server is being weird.&lt;/p&gt;

&lt;p&gt;Rollback? There was no rollback. The best-case scenario was that someone remembered what the previous model path was.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Most Teams Try First (And Why It Fails)
&lt;/h2&gt;

&lt;p&gt;The first instinct is usually scripts. Someone writes a deploy.sh that takes a model version as an argument, connects to the serving infrastructure, and handles the update. This is better than pure manual steps, but it fails in a few predictable ways.&lt;/p&gt;

&lt;p&gt;First, scripts don't have memory. You can run deploy.sh with model version 47, then run it again with version 51, and there's no audit trail of who ran what or why. When something goes wrong, you're back to grep-ing through logs and asking around.&lt;/p&gt;

&lt;p&gt;Second, scripts don't handle promotion gates. You can't encode "this model can only go to production if it passed staging validation for 24 hours" in a shell script without it becoming a sprawling mess that nobody wants to maintain.&lt;/p&gt;

&lt;p&gt;Third, and this one bites hardest: scripts assume the current state. If someone manually changes something on the serving infrastructure, your script has no way of detecting that drift. The next run might succeed or fail unpredictably depending on what changed and when.&lt;/p&gt;

&lt;p&gt;MLflow solves the experiment tracking and model registry side well. You get version numbers, artifact storage in S3, stage transitions (Staging, Production), and a clean API. What MLflow doesn't give you is a Kubernetes-native way to declare "this cluster should be running model version 47 right now" and enforce that continuously.&lt;/p&gt;

&lt;p&gt;That's where KServe and ArgoCD come in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The full stack has five layers working together.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsla8yxyreadpi4g0h7zr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsla8yxyreadpi4g0h7zr.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MLflow + S3&lt;/strong&gt; handle model artifacts. Every trained model version gets registered with MLflow, which stores the artifact URI pointing to a path in S3. The URI looks something like &lt;code&gt;s3://ml-models-prod/fraud-detector/v47/model.pkl&lt;/code&gt;. MLflow's registry gives you a version number and stage metadata. The actual weights live in S3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KServe InferenceService&lt;/strong&gt; is the Kubernetes abstraction for serving. Instead of managing a Pod or Deployment by hand, you define an InferenceService custom resource that describes what model to load, from where, and how to scale. KServe handles the rest: downloading the artifact from S3, loading it into the serving framework (Triton, TorchServe, SKLearn Server), and exposing an HTTP endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Git&lt;/strong&gt; holds the desired state. A &lt;code&gt;values.yaml&lt;/code&gt; file in your repository specifies which model version each environment should run. Promoting from staging to production is a PR that bumps a version number. The PR is the change review, the approval gate, and the audit trail all at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ArgoCD&lt;/strong&gt; reconciles the cluster to match what's in Git. When the PR merges, ArgoCD detects the change and applies the updated KServe InferenceService. If someone manually changes the InferenceService on the cluster, ArgoCD detects the drift and reverts it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Istio&lt;/strong&gt; manages traffic splitting. During canary promotion, a VirtualService routes 10% of traffic to the new model version while 90% continues to the stable version. If metrics look good after a soak period, you update the weights and do a full cutover.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prometheus&lt;/strong&gt; collects serving metrics. Latency (p99 in particular), throughput, and prediction distribution histograms give you the signals needed to decide whether a canary is healthy or needs to be rolled back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Workflow
&lt;/h2&gt;

&lt;p&gt;Here's how a model promotion actually works end to end.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftwb59lufwlvzqmev2ya1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftwb59lufwlvzqmev2ya1.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A data scientist trains a new model, evaluates it against the validation set, and if it passes threshold, registers it in MLflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_run&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sklearn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metrics&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;f1_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.97&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;run_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;active_run&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tracking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MlflowClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;model_uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;runs:/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;mv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_model_version&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fraud-detector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_uri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# mv.version == "47"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That registration triggers a CI pipeline (GitHub Actions or Tekton, depending on your setup) that opens a pull request bumping the version in the dev environment's values file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;values.yaml structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;environments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;47"&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://ml-models-prod/fraud-detector/v47"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
      &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

  &lt;span class="na"&gt;staging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;45"&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://ml-models-prod/fraud-detector/v45"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
      &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
      &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;

  &lt;span class="na"&gt;prod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;43"&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://ml-models-prod/fraud-detector/v43"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
      &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
      &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;KServe InferenceService (stable):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-prod&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;predictor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-s3-sa&lt;/span&gt;
    &lt;span class="na"&gt;sklearn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://ml-models-prod/fraud-detector/v43"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
    &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
    &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
    &lt;span class="na"&gt;scaleTarget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
    &lt;span class="na"&gt;scaleMetric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;concurrency&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;KServe InferenceService (canary variant):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-prod&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;predictor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-s3-sa&lt;/span&gt;
    &lt;span class="na"&gt;sklearn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://ml-models-prod/fraud-detector/v47"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
    &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
    &lt;span class="na"&gt;canaryTrafficPercent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ArgoCD ApplicationSet for multi-environment management:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ApplicationSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-serving&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;generators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dev&lt;/span&gt;
            &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dev-cluster&lt;/span&gt;
            &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-dev&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;
            &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging-cluster&lt;/span&gt;
            &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-staging&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
            &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-cluster&lt;/span&gt;
            &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-prod&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fraud-detector-{{env}}"&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/org/ml-gitops&lt;/span&gt;
        &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HEAD&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;environments/{{env}}"&lt;/span&gt;
        &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;valueFiles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;values.yaml&lt;/span&gt;
      &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{cluster}}"&lt;/span&gt;
        &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{namespace}}"&lt;/span&gt;
      &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CreateNamespace=true&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;RespectIgnoreDifferences=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Istio VirtualService for canary traffic split:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VirtualService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-vs&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-prod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;fraud-detector.ml-serving-prod.svc.cluster.local&lt;/span&gt;
  &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;x-canary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;exact&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
      &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-predictor-canary&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-predictor-default&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-predictor-canary&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the PR merges to dev, ArgoCD picks up the change within 3 minutes (the default sync interval) and applies the updated InferenceService. The model downloads from S3, the serving pod comes up, and the endpoint starts responding. At this point you can run your automated evaluation suite against the dev endpoint.&lt;/p&gt;

&lt;p&gt;Promoting to staging is another PR. A human reviews it, checks the dev evaluation results, and approves. Merge, ArgoCD syncs, done. Production promotion follows the same pattern but includes an additional step: the canary InferenceService gets deployed first with 10% traffic, and a GitHub Actions workflow monitors Prometheus metrics for a configured soak period (we use 2 hours for most models) before opening the full-cutover PR automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Drift Detection
&lt;/h2&gt;

&lt;p&gt;Prediction drift is the sneaky failure mode. The model is technically serving, latency looks fine, but the distribution of predictions has shifted because the input data changed. You won't catch this with a liveness probe.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fya0whws7chpxtyg3q0me.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fya0whws7chpxtyg3q0me.png" alt=" " width="800" height="622"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;KServe's sklearn server exposes prediction histograms as Prometheus metrics out of the box. You define alerting rules that fire when the distribution deviates beyond a threshold from the baseline captured at deployment time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prometheus PrometheusRule for drift alerting:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring.coreos.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PrometheusRule&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-drift&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving-prod&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-prometheus&lt;/span&gt;
    &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alert-rules&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector.drift&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2m&lt;/span&gt;
      &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PredictionDriftDetected&lt;/span&gt;
          &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;abs(&lt;/span&gt;
              &lt;span class="s"&gt;avg_over_time(fraud_detector_prediction_mean[10m])&lt;/span&gt;
              &lt;span class="s"&gt;- avg_over_time(fraud_detector_prediction_mean[60m] offset 1d)&lt;/span&gt;
            &lt;span class="s"&gt;) &amp;gt; 0.15&lt;/span&gt;
          &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
            &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prediction&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;distribution&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;shift&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;detected&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fraud-detector"&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mean&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;prediction&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;shifted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;by&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;humanizePercentage&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;yesterday's&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;baseline.&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Check&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;schema&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;changes."&lt;/span&gt;

        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ModelLatencyHigh&lt;/span&gt;
          &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;histogram_quantile(0.99,&lt;/span&gt;
              &lt;span class="s"&gt;sum(rate(fraud_detector_request_duration_seconds_bucket[5m])) by (le)&lt;/span&gt;
            &lt;span class="s"&gt;) &amp;gt; 0.5&lt;/span&gt;
          &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
            &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p99&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;above&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;500ms&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fraud-detector"&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p99&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}s.&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SLA&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;500ms."&lt;/span&gt;

        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ModelErrorRateHigh&lt;/span&gt;
          &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;rate(fraud_detector_request_total{status_code=~"5.."}[5m])&lt;/span&gt;
            &lt;span class="s"&gt;/&lt;/span&gt;
            &lt;span class="s"&gt;rate(fraud_detector_request_total[5m]) &amp;gt; 0.01&lt;/span&gt;
          &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
            &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;above&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fraud-detector"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When this alert fires, it sends to PagerDuty (or your alert routing of choice via AlertManager). The on-call engineer's first action is to check whether a canary is active. If it is, rolling back is a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git revert HEAD~1
git push origin main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ArgoCD detects the revert within 3 minutes and redeploys the previous InferenceService version. In practice, our rollbacks averaged 4 minutes from decision to stable serving.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time to deploy new model version&lt;/td&gt;
&lt;td&gt;4 to 6 hours&lt;/td&gt;
&lt;td&gt;8 minutes to production canary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollback capability&lt;/td&gt;
&lt;td&gt;None (manual rebuild)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;git revert&lt;/code&gt;, avg 4 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Drift detection time&lt;/td&gt;
&lt;td&gt;6 hours (user reports)&lt;/td&gt;
&lt;td&gt;15 minutes (automated alert)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment audit trail&lt;/td&gt;
&lt;td&gt;Slack messages&lt;/td&gt;
&lt;td&gt;Full Git history with PR reviews&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Environment parity&lt;/td&gt;
&lt;td&gt;Best effort&lt;/td&gt;
&lt;td&gt;Enforced via ApplicationSet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config drift prevention&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;ArgoCD selfHeal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The number that surprised me most was the drift detection improvement. We caught a data schema change within 15 minutes on the new system. The same type of change previously went undetected for 6 hours before a user complaint surfaced it. That's not a monitoring win, it's a business outcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with the values.yaml contract.&lt;/strong&gt; The shape of that file is the most important design decision you'll make. Get the team to agree on it before writing any ArgoCD config. Everything else follows from it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;S3 artifact URIs in the InferenceService spec, not model names.&lt;/strong&gt; MLflow stage names ("Production", "Staging") are mutable. If you reference a stage name in your InferenceService spec, two different model versions could map to the same stage name over time, and your Git history loses meaning. Reference the explicit S3 URI with the version number baked in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;selfHeal is non-negotiable.&lt;/strong&gt; Turn it on in your ArgoCD sync policy. Without selfHeal, a manual kubectl edit on the InferenceService will drift silently and nobody will notice until it matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Canary soak time depends on your traffic volume.&lt;/strong&gt; For a high-volume fraud model processing 50k requests per minute, 30 minutes of canary is enough to get statistically significant signal. For a low-volume model processing 100 requests per day, 2 hours of canary at 10% gives you 20 requests through the new version. Adjust accordingly, or route specific customers to the canary instead of random percentage splitting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model cold start affects canary rollouts.&lt;/strong&gt; Large models take time to download from S3 and load into memory. A 2GB model on a cold node might take 3 to 4 minutes before it's ready to serve. Account for this in your readiness probe timeouts and don't let your monitoring system flag the canary as failing during the startup window.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The repository structure I've described looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ml-gitops/
├── environments/
│   ├── dev/
│   │   ├── values.yaml
│   │   └── templates/
│   │       ├── inference-service.yaml
│   │       └── virtual-service.yaml
│   ├── staging/
│   │   ├── values.yaml
│   │   └── templates/
│   └── prod/
│       ├── values.yaml
│       └── templates/
├── base/
│   ├── inference-service-template.yaml
│   └── prometheus-rules.yaml
└── applicationset.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prerequisites before you start:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes cluster (1.28 or newer)&lt;/li&gt;
&lt;li&gt;KServe 0.12 or newer installed&lt;/li&gt;
&lt;li&gt;ArgoCD 2.9 or newer installed&lt;/li&gt;
&lt;li&gt;Istio 1.20 or newer installed&lt;/li&gt;
&lt;li&gt;MLflow tracking server accessible from the cluster&lt;/li&gt;
&lt;li&gt;S3 bucket with appropriate IRSA or Workload Identity configured for KServe pods&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ArgoCD ApplicationSet in this post assumes a Helm-based templating approach where each environment folder contains a values.yaml and a templates directory with the InferenceService and VirtualService manifests. You could also use Kustomize overlays. The concepts are identical.&lt;/p&gt;

&lt;p&gt;Start with dev only. Get one model version deploying cleanly through ArgoCD before adding staging and prod. Add the canary workflow only after the basic promotion gate is working reliably.&lt;/p&gt;

&lt;p&gt;The jump from "it works in dev" to "it's reliable in prod" is mostly about the Prometheus alerting and the canary soak automation. Those two pieces are what make the system trustworthy enough for the team to stop second-guessing every deployment.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kserve.github.io/website/" rel="noopener noreferrer"&gt;KServe Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://argo-cd.readthedocs.io/en/stable/user-guide/application-set/" rel="noopener noreferrer"&gt;ArgoCD ApplicationSets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlflow.org/docs/latest/model-registry.html" rel="noopener noreferrer"&gt;MLflow Model Registry&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://istio.io/latest/docs/concepts/traffic-management/" rel="noopener noreferrer"&gt;Istio Traffic Management&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus-operator.dev/docs/operator/api/" rel="noopener noreferrer"&gt;Prometheus Operator API&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>GitOps for ML Model Deployment: A Real Pipeline, Not a Toy Demo</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sun, 08 Mar 2026 06:27:15 +0000</pubDate>
      <link>https://forem.com/mateenali66/gitops-for-ml-model-deployment-a-real-pipeline-not-a-toy-demo-1lk8</link>
      <guid>https://forem.com/mateenali66/gitops-for-ml-model-deployment-a-real-pipeline-not-a-toy-demo-1lk8</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I replaced ad-hoc model deployments with a fully declarative GitOps pipeline using KServe and ArgoCD. Every model version lives in Git, every change goes through a PR, and rollbacks take one &lt;code&gt;git revert&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Every ML team I've worked with has the same dirty secret: their model deployments are snowflakes.&lt;/p&gt;

&lt;p&gt;The Python script that "works on the data scientist's machine." The Slack message that says "hey can you deploy the new model." The SSH session into the GPU node that nobody documented. Meanwhile, the same team's microservices are humming along with ArgoCD, automated rollbacks, PR-gated deploys, full audit trails.&lt;/p&gt;

&lt;p&gt;That gap is embarrassing, and it's completely unnecessary.&lt;/p&gt;

&lt;p&gt;KServe got accepted into CNCF as an Incubating project in September 2025. The tooling to close this gap is mature enough for production. Here's what the actual problem looks like in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Someone manually SSHes into a node and runs a deployment script. No record of what version went live.&lt;/li&gt;
&lt;li&gt;A model update silently replaces the previous one. There's no rollback path.&lt;/li&gt;
&lt;li&gt;Two data scientists think different model versions are running in staging. Both are right, sort of.&lt;/li&gt;
&lt;li&gt;An incident happens. Nobody can tell what changed or when.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've lived through all of these. The fix isn't a better runbook or more Slack discipline. It's treating model deployments the same way we treat application deployments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6a0tytpn5lm76ukafy5i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6a0tytpn5lm76ukafy5i.png" alt=" " width="800" height="979"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Tried First (And Why It Failed)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Attempt 1: Wrapping deployments in shell scripts
&lt;/h3&gt;

&lt;p&gt;The first instinct was to write a &lt;code&gt;deploy_model.sh&lt;/code&gt; that calls &lt;code&gt;kubectl apply&lt;/code&gt; with the right image tag. This is better than nothing, but it's not GitOps. The script lives somewhere, gets edited ad-hoc, and there's still no PR-gated workflow. The script is the new snowflake.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 2: Baking models into Docker images
&lt;/h3&gt;

&lt;p&gt;The idea: train the model, package the weights into a Docker image, deploy the image via a normal &lt;code&gt;Deployment&lt;/code&gt;. This works surprisingly well for small models under a few hundred MB. It breaks down fast when the model is 2GB or 14GB. Your Docker build times blow up, your registry costs climb, and now your CI pipeline is bottlenecked on model artifact size.&lt;/p&gt;

&lt;p&gt;More importantly, you lose the semantic layer. Your Git history shows &lt;code&gt;model:sha256-abc123&lt;/code&gt; instead of &lt;code&gt;fraud-detector/v2.5.0 sklearn 2 replicas 50 RPS target&lt;/code&gt;. The config and the artifact are fused. That's hard to review and harder to reason about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 3: What actually worked
&lt;/h3&gt;

&lt;p&gt;Separate the artifact from the config. The model weights live in S3, content-addressed and immutable. Git holds the pointer and all the serving configuration. A Kubernetes controller keeps the cluster in sync with what Git says. That's it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;The stack I use and recommend:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model serving&lt;/td&gt;
&lt;td&gt;KServe v0.14+&lt;/td&gt;
&lt;td&gt;Kubernetes-native CRD, multi-framework, built-in canary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitOps controller&lt;/td&gt;
&lt;td&gt;ArgoCD&lt;/td&gt;
&lt;td&gt;Declarative sync, health checks, rollback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model storage&lt;/td&gt;
&lt;td&gt;S3&lt;/td&gt;
&lt;td&gt;Content-addressable, versioned, immutable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model versioning&lt;/td&gt;
&lt;td&gt;MLflow&lt;/td&gt;
&lt;td&gt;Tracks lineage from training to deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ingress&lt;/td&gt;
&lt;td&gt;Istio&lt;/td&gt;
&lt;td&gt;Traffic splitting for canary rollouts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets&lt;/td&gt;
&lt;td&gt;AWS IRSA&lt;/td&gt;
&lt;td&gt;No credentials in Git, ever&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;KServe is the linchpin. It exposes a single &lt;code&gt;InferenceService&lt;/code&gt; CRD that ArgoCD manages like any other Kubernetes resource.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install KServe
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# cert-manager is a prerequisite&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/cert-manager/cert-manager/releases/download/v1.17.0/cert-manager.yaml

kubectl create ns kserve

helm &lt;span class="nb"&gt;install &lt;/span&gt;kserve-crd oci://ghcr.io/kserve/charts/kserve-crd &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; v0.14.1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kserve

helm &lt;span class="nb"&gt;install &lt;/span&gt;kserve oci://ghcr.io/kserve/charts/kserve &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; v0.14.1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kserve &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; kserve.controller.deploymentMode&lt;span class="o"&gt;=&lt;/span&gt;RawDeployment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I use &lt;code&gt;RawDeployment&lt;/code&gt; mode. It uses standard Kubernetes Deployments and Services instead of Knative, which means fewer moving parts, better compatibility with existing Prometheus and HPA setups, and no cold-start complexity on the critical path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Structure your Git repo
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;models/
├── base/
│   └── kustomization.yaml
├── fraud-detector/
│   ├── kustomization.yaml
│   ├── inference-service.yaml
│   └── service-account.yaml
├── image-classifier/
│   ├── kustomization.yaml
│   └── inference-service.yaml
└── overlays/
    ├── staging/
    │   └── kustomization.yaml
    └── production/
        └── kustomization.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kustomize overlays let you parameterize resource limits, replica counts, and model URIs per environment without duplicating YAML.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Define the InferenceService
&lt;/h3&gt;

&lt;p&gt;This is the core resource. Here's a real example for a scikit-learn fraud detection model stored in S3:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# models/fraud-detector/inference-service.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
    &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-platform&lt;/span&gt;
    &lt;span class="na"&gt;model-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2.4.1"&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;serving.kserve.io/deploymentMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RawDeployment&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;predictor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;scaleTarget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
    &lt;span class="na"&gt;scaleMetric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rps&lt;/span&gt;
    &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-s3-sa&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;modelFormat&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sklearn&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://prod-ml-models/fraud-detector/v2.4.1"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1Gi"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
      &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SKLEARN_SERVER_WORKERS&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;storageUri&lt;/code&gt; is the version pointer. Bumping &lt;code&gt;v2.4.1&lt;/code&gt; to &lt;code&gt;v2.5.0&lt;/code&gt; and raising a PR is your deploy-new-model workflow.&lt;/p&gt;

&lt;p&gt;For GPU workloads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# models/image-classifier/inference-service.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;image-classifier&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.3.0"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;predictor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
    &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-s3-sa&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;modelFormat&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pytorch&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://prod-ml-models/image-classifier/v1.3.0"&lt;/span&gt;
      &lt;span class="na"&gt;runtimeVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;23.08-py3"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
          &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
          &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;accelerator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-a10g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Wire up the S3 service account
&lt;/h3&gt;

&lt;p&gt;Don't put AWS credentials in manifests. Use IRSA on EKS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# models/fraud-detector/service-account.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-s3-sa&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789012:role/kserve-model-reader&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The IAM role needs &lt;code&gt;s3:GetObject&lt;/code&gt; and &lt;code&gt;s3:ListBucket&lt;/code&gt; on your model bucket. KServe's storage initializer picks up the IRSA token automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Create the ArgoCD Application
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# argocd/apps/ml-models.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-models&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
  &lt;span class="na"&gt;finalizers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;resources-finalizer.argocd.argoproj.io&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-platform&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/phonotech/ml-manifests&lt;/span&gt;
    &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;models/overlays/production&lt;/span&gt;
  &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://kubernetes.default.svc&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
  &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CreateNamespace=true&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;RespectIgnoreDifferences=true&lt;/span&gt;
    &lt;span class="na"&gt;retry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
      &lt;span class="na"&gt;backoff&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
        &lt;span class="na"&gt;factor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
        &lt;span class="na"&gt;maxDuration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3m&lt;/span&gt;
  &lt;span class="na"&gt;ignoreDifferences&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
      &lt;span class="na"&gt;jsonPointers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/status&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/metadata/annotations/serving.kserve.io~1deploymentMode&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ignoreDifferences&lt;/code&gt; block is critical. KServe's controller writes back to the &lt;code&gt;InferenceService&lt;/code&gt; status and some annotations. Without it, ArgoCD will perpetually detect drift and attempt to re-sync, creating a noisy feedback loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: The deployment workflow
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsfpkxi8khe6ktal86qg1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsfpkxi8khe6ktal86qg1.png" alt=" " width="800" height="205"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's what a model update looks like end to end:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data scientist trains a new model, registers the artifact in MLflow, uploads weights to &lt;code&gt;s3://prod-ml-models/fraud-detector/v2.5.0/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;They open a PR updating &lt;code&gt;storageUri&lt;/code&gt; and the &lt;code&gt;model-version&lt;/code&gt; label in &lt;code&gt;inference-service.yaml&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;PR gets reviewed and merged to &lt;code&gt;main&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;ArgoCD detects the diff within 3 minutes (or immediately with webhooks), syncs the new &lt;code&gt;InferenceService&lt;/code&gt; spec&lt;/li&gt;
&lt;li&gt;KServe's storage initializer pulls the new weights into the pod&lt;/li&gt;
&lt;li&gt;New revision comes up healthy, traffic cuts over&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The model version is in Git history. You can &lt;code&gt;git revert&lt;/code&gt; it. You can see exactly what changed between &lt;code&gt;v2.4.1&lt;/code&gt; and &lt;code&gt;v2.5.0&lt;/code&gt; in the PR diff.&lt;/p&gt;

&lt;p&gt;To trigger ArgoCD immediately via webhook from GitHub Actions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/sync-models.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Notify ArgoCD on model manifest change&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;models/**'&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;sync&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Trigger ArgoCD sync&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -s -X POST \&lt;/span&gt;
            &lt;span class="s"&gt;-H "Authorization: Bearer ${{ secrets.ARGOCD_TOKEN }}" \&lt;/span&gt;
            &lt;span class="s"&gt;https://argocd.internal.ca/api/v1/applications/ml-models/sync&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Canary rollouts
&lt;/h3&gt;

&lt;p&gt;KServe's built-in canary support is where this pattern earns its keep.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Step 1: Deploy canary at 10% traffic&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;predictor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;canaryTrafficPercent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;modelFormat&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sklearn&lt;/span&gt;
      &lt;span class="na"&gt;storageUri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://prod-ml-models/fraud-detector/v2.5.0"&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;KServe automatically routes 90% to the last stable revision and 10% to v2.5.0. If the new model performs well, merge another PR bumping &lt;code&gt;canaryTrafficPercent&lt;/code&gt; to 50, then promote to 100 by removing the field. If the canary is bad, set &lt;code&gt;canaryTrafficPercent: 0&lt;/code&gt; to pin back to stable immediately.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;RawDeployment&lt;/code&gt; mode, you handle canary at the Istio level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# istio/virtualservice-fraud-detector.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VirtualService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-serving&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;fraud-detector.ml-serving.svc.cluster.local&lt;/span&gt;
  &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-v2-4-1-predictor&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fraud-detector-v2-5-0-predictor&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both the &lt;code&gt;InferenceService&lt;/code&gt; and the &lt;code&gt;VirtualService&lt;/code&gt; are in Git. The traffic split is in Git. Everything is auditable and revertible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgt50j7pwdqb5nzjsnm57.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgt50j7pwdqb5nzjsnm57.png" alt=" " width="800" height="1459"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;I won't pretend I have clean before/after numbers from a single project because this pattern spans multiple engagements. Here's what consistently holds:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model deployment method&lt;/td&gt;
&lt;td&gt;Manual SSH or ad-hoc scripts&lt;/td&gt;
&lt;td&gt;PR-gated, Git-backed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;None or Slack history&lt;/td&gt;
&lt;td&gt;Full Git history&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollback time&lt;/td&gt;
&lt;td&gt;30 minutes to hours&lt;/td&gt;
&lt;td&gt;One &lt;code&gt;git revert&lt;/code&gt;, seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Canary traffic split&lt;/td&gt;
&lt;td&gt;Not possible without Istio knowledge&lt;/td&gt;
&lt;td&gt;Config field in YAML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to detect config drift&lt;/td&gt;
&lt;td&gt;Never (no baseline)&lt;/td&gt;
&lt;td&gt;Continuous, ArgoCD UI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secret management&lt;/td&gt;
&lt;td&gt;Often hard-coded or in &lt;code&gt;.env&lt;/code&gt; files&lt;/td&gt;
&lt;td&gt;IRSA, no credentials in Git&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The operational improvement that surprises people most: the on-call burden drops significantly when you can answer "what version is running, what changed, who approved it" in under 30 seconds by looking at Git.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The &lt;code&gt;ignoreDifferences&lt;/code&gt; config is not optional.&lt;/strong&gt; Skip it and you'll spend a weekend wondering why ArgoCD is perpetually out of sync when nothing real has changed. KServe mutates its own resources. Tell ArgoCD which fields to ignore.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Model size determines your storage strategy.&lt;/strong&gt; Under 500MB, the default S3 init container approach is fine. Over a few GB, you need a shared model cache PVC or a pre-baked image. Planning this up front saves a painful migration later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Always set &lt;code&gt;nodeSelector&lt;/code&gt; for GPU workloads.&lt;/strong&gt; Without it, your &lt;code&gt;InferenceService&lt;/code&gt; might land on a CPU node and silently fall back to CPU inference. Set the affinity, set the tolerations, pin it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Start with &lt;code&gt;RawDeployment&lt;/code&gt; mode.&lt;/strong&gt; Knative is powerful but it adds complexity. Get the core pattern working first, then add Knative if you genuinely need scale-to-zero economics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. GitOps creates friction on purpose.&lt;/strong&gt; The PR workflow adds a step that direct &lt;code&gt;kubectl apply&lt;/code&gt; doesn't. That step is the point. If your team resents the friction, they haven't lived through the 2am incident where nobody knows what changed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The five things you actually need to get started:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;KServe installed (Helm, RawDeployment mode, cert-manager prerequisite)&lt;/li&gt;
&lt;li&gt;A models-manifests repo with &lt;code&gt;InferenceService&lt;/code&gt; YAML per model, Kustomize overlays for environments&lt;/li&gt;
&lt;li&gt;ArgoCD Application pointing at &lt;code&gt;overlays/production&lt;/code&gt;, &lt;code&gt;selfHeal: true&lt;/code&gt;, with &lt;code&gt;ignoreDifferences&lt;/code&gt; on KServe status fields&lt;/li&gt;
&lt;li&gt;IRSA or Workload Identity for S3 access&lt;/li&gt;
&lt;li&gt;Branch protection on &lt;code&gt;main&lt;/code&gt; so model version bumps require PR review&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The canary rollout and GitHub Actions webhook are enhancements. Get the core working first.&lt;/p&gt;




</description>
      <category>kubernetes</category>
      <category>mlops</category>
      <category>gitops</category>
      <category>argocd</category>
    </item>
    <item>
      <title>I Migrated a Real Production Codebase from Terraform to OpenTofu (Here's What Broke)</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sun, 08 Mar 2026 06:25:03 +0000</pubDate>
      <link>https://forem.com/mateenali66/i-migrated-a-real-production-codebase-from-terraform-to-opentofu-heres-what-broke-4j1b</link>
      <guid>https://forem.com/mateenali66/i-migrated-a-real-production-codebase-from-terraform-to-opentofu-heres-what-broke-4j1b</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Migrating a standard AWS Terraform codebase to OpenTofu took half a day, most of which was CI pipeline updates. The S3 native locking alone made it worth it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I've been writing Terraform since version 0.8. Watched it grow from a scrappy infrastructure tool into the de-facto standard for cloud automation. I've migrated teams from CloudFormation to Terraform, written custom providers, debugged state corruption at 2 AM. Terraform is baked into how I think about infrastructure.&lt;/p&gt;

&lt;p&gt;So when HashiCorp switched to the Business Source License in August 2023, I did what most practitioners did: I shrugged, bookmarked the OpenTofu repo, and went back to building.&lt;/p&gt;

&lt;p&gt;That bookmark sat there for two years.&lt;/p&gt;

&lt;p&gt;The BSL doesn't prevent you from using Terraform. It prevents you from building a product or service that's "substantially similar" to Terraform Cloud or Terraform Enterprise. For most teams running internal infrastructure, the risk is low. But once you're building a platform team that exposes self-service infrastructure to internal customers, or packaging IaC automation as part of a managed service, your legal team might want a conversation. And once "get legal sign-off on our IaC toolchain" is on the agenda, you've already lost an afternoon you'll never get back.&lt;/p&gt;

&lt;p&gt;For a Phono Technologies project, we were building a lightweight CI/CD orchestration layer for client infrastructure. The moment I tried to describe it, I realized I was describing exactly what the BSL restricts. The ambiguity was real enough that I wanted it gone.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Tried First (And Why It Failed)
&lt;/h2&gt;

&lt;p&gt;My first instinct was to just drop in the &lt;code&gt;tofu&lt;/code&gt; binary and run &lt;code&gt;tofu init&lt;/code&gt;. Simple enough.&lt;/p&gt;

&lt;p&gt;It almost worked. Until I checked where providers were being pulled from.&lt;/p&gt;

&lt;p&gt;OpenTofu fetches providers from &lt;code&gt;registry.opentofu.org&lt;/code&gt;, not &lt;code&gt;registry.terraform.io&lt;/code&gt;. The registries mirror each other for HashiCorp providers, but your existing &lt;code&gt;.terraform.lock.hcl&lt;/code&gt; was generated against Terraform's registry. The provider hashes don't match.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: Failed to install provider

To install this provider, OpenTofu needs to verify that the checksums in
.terraform.lock.hcl match the provider packages downloaded from the registry.
The following packages are required but the checksums don't match:
  registry.opentofu.org/hashicorp/aws v5.82.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also ran into teammates who still had the old Terraform-generated lock files. Some ran &lt;code&gt;tofu plan&lt;/code&gt; on their local branches and got hash mismatches in the other direction. The lesson: this has to be a coordinated team migration, not a quiet swap on your own laptop.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;The codebase: a mid-sized AWS platform for a SaaS client. Around 8,000 lines of Terraform across 12 modules. Standard providers: &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;kubernetes&lt;/code&gt;, &lt;code&gt;helm&lt;/code&gt;, &lt;code&gt;random&lt;/code&gt;, &lt;code&gt;tls&lt;/code&gt;. S3 backend for state, one workspace per environment. CI via GitHub Actions. No Terraform Cloud, no HCP.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fon677jiral0tagg9875q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fon677jiral0tagg9875q.png" alt=" " width="800" height="1043"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Back up everything
&lt;/h3&gt;

&lt;p&gt;Before touching anything, tag the current state in git and pull a snapshot of your state file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git tag pre-opentofu-migration

terraform state pull &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; terraform.tfstate.backup-&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y%m%d&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're on S3, enable versioning before you start. You want a timestamped rollback point. Non-negotiable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Install tofu alongside terraform
&lt;/h3&gt;

&lt;p&gt;The two binaries coexist without conflict:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;opentofu
tofu &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="c"&gt;# OpenTofu v1.11.4&lt;/span&gt;
&lt;span class="c"&gt;# on darwin_arm64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep &lt;code&gt;terraform&lt;/code&gt; installed until you're confident the migration is complete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Delete the lock file and re-init
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;rm&lt;/span&gt; .terraform.lock.hcl
tofu init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;tofu init&lt;/code&gt; regenerates the lock file with hashes for both &lt;code&gt;registry.opentofu.org&lt;/code&gt; and &lt;code&gt;registry.terraform.io&lt;/code&gt; providers, signed by OpenTofu's key infrastructure. Commit the new lock file and announce to your team to re-run &lt;code&gt;tofu init&lt;/code&gt; on their local copies.&lt;/p&gt;

&lt;p&gt;Once you commit the new lock file, treat the repo as an OpenTofu project. Don't run &lt;code&gt;terraform init&lt;/code&gt; on the same directory afterward. The two binaries will fight over hashes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Check your &lt;code&gt;terraform {}&lt;/code&gt; block
&lt;/h3&gt;

&lt;p&gt;You don't have to rename it. OpenTofu still accepts the &lt;code&gt;terraform {}&lt;/code&gt; block. Your existing HCL works without modification.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This works fine in OpenTofu, no changes needed&lt;/span&gt;
&lt;span class="nx"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;required_version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"&amp;gt;= 1.5.0"&lt;/span&gt;

  &lt;span class="nx"&gt;required_providers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;aws&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hashicorp/aws"&lt;/span&gt;
      &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"~&amp;gt; 5.0"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;bucket&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"my-terraform-state"&lt;/span&gt;
    &lt;span class="nx"&gt;key&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production/terraform.tfstate"&lt;/span&gt;
    &lt;span class="nx"&gt;region&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
    &lt;span class="nx"&gt;encrypt&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;dynamodb_table&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-state-locks"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can leave it as &lt;code&gt;terraform {}&lt;/code&gt; or rename it to &lt;code&gt;tofu {}&lt;/code&gt;. Both work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Verify with &lt;code&gt;tofu plan&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;tofu plan &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;migration-test.tfplan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected result: no changes. If you see changes, do not apply. Investigate first. It usually means a provider version difference or a schema update.&lt;/p&gt;

&lt;p&gt;I got zero changes across all three environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Drop DynamoDB for S3 native locking
&lt;/h3&gt;

&lt;p&gt;This is where OpenTofu pulls ahead. OpenTofu 1.10.0 added native conditional writes for S3 state locking. No DynamoDB table required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"my-state-bucket"&lt;/span&gt;
  &lt;span class="nx"&gt;key&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"prod/terraform.tfstate"&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
  &lt;span class="nx"&gt;encrypt&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;dynamodb_table&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-locks"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"my-state-bucket"&lt;/span&gt;
  &lt;span class="nx"&gt;key&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"prod/terraform.tfstate"&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
  &lt;span class="nx"&gt;encrypt&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;use_lockfile&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fewer moving parts. One less AWS service to manage. Simpler IAM permissions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fujqnfnrt73mfc7mow595.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fujqnfnrt73mfc7mow595.png" alt=" " width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 7: Update your CI pipeline
&lt;/h3&gt;

&lt;p&gt;Every place your pipeline runs &lt;code&gt;terraform&lt;/code&gt;, you need &lt;code&gt;tofu&lt;/code&gt;. In GitHub Actions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hashicorp/setup-terraform@v3&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;terraform_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.9.5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;opentofu/setup-opentofu@v1&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tofu_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.11.4"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;opentofu/setup-opentofu&lt;/code&gt; action is the official GitHub Action. Clean swap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;State locking dependencies&lt;/td&gt;
&lt;td&gt;S3 + DynamoDB&lt;/td&gt;
&lt;td&gt;S3 only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DynamoDB tables&lt;/td&gt;
&lt;td&gt;3 (one per environment)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Migration time&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;4 hours (including CI updates)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plan output differences&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sensitive values in state&lt;/td&gt;
&lt;td&gt;Persisted&lt;/td&gt;
&lt;td&gt;Ephemeral (with 1.11 features)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The operational simplicity of dropping DynamoDB is hard to quantify in a table. It's one less service in IAM policies, one less resource to manage in the state backend module, one less thing that can drift or get misconfigured.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Coordinate the lock file migration as a team.&lt;/strong&gt; If half your team is still running &lt;code&gt;terraform init&lt;/code&gt;, you'll get hash conflicts. Announce the cutover date, have everyone delete and regenerate their lock files on the same day.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pin your OpenTofu version in CI.&lt;/strong&gt; The 1.11.x patch cycle had a notable regression in 1.11.0 that was fixed in 1.11.2. The team moves fast. Pin to a specific minor version in CI and upgrade deliberately.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The &lt;code&gt;terraform {}&lt;/code&gt; block is fine.&lt;/strong&gt; Don't waste time renaming it. The binary changed; the HCL didn't.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The point of no return is &lt;code&gt;tofu apply&lt;/code&gt;.&lt;/strong&gt; After you run apply, the state metadata reflects OpenTofu's version. You can still read the state with Terraform, but you'll get warnings. Decide before you apply whether you're committed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ephemeral values are worth understanding.&lt;/strong&gt; OpenTofu 1.11.0 introduced ephemeral resources and write-only attributes. Sensitive credentials can be used without ever landing in state. If you've been papering over this with Vault workarounds, it's worth reading the docs before you finish the migration.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;ephemeral&lt;/span&gt; &lt;span class="s2"&gt;"aws_secretsmanager_secret_version"&lt;/span&gt; &lt;span class="s2"&gt;"db_password"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;secret_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_secretsmanager_secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"kubernetes_secret_v1"&lt;/span&gt; &lt;span class="s2"&gt;"db_credentials"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;metadata&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"db-credentials"&lt;/span&gt;
    &lt;span class="nx"&gt;namespace&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"app"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;data_wo&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;password&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ephemeral&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_secretsmanager_secret_version&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;db_password&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;secret_string&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;data_wo_revision&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenTofu Migration Guide:&lt;/strong&gt; &lt;a href="https://opentofu.org/docs/intro/migration/migration-guide/" rel="noopener noreferrer"&gt;opentofu.org/docs/intro/migration&lt;/a&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>opensource</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Drift Detection in Air-Gapped Workloads: What Nobody Tells You</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sat, 21 Feb 2026 06:32:18 +0000</pubDate>
      <link>https://forem.com/mateenali66/drift-detection-in-air-gapped-workloads-what-nobody-tells-you-3eb9</link>
      <guid>https://forem.com/mateenali66/drift-detection-in-air-gapped-workloads-what-nobody-tells-you-3eb9</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Standard drift detection breaks in air-gapped environments because every major tool assumes cloud API access. The fix is decentralized reconciliation with local state management, not trying to force connected tools into disconnected networks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Assumption That Breaks Everything
&lt;/h2&gt;

&lt;p&gt;Every popular drift detection tool makes the same assumption: your infrastructure can reach the internet.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;terraform plan&lt;/code&gt; calls AWS APIs. Argo CD pulls from remote Git repos. Spacelift runs scans from a SaaS control plane. These tools work brilliantly in connected environments. The moment you drop them into an air-gapped network, they go silent.&lt;/p&gt;

&lt;p&gt;I've spent the better part of a decade building infrastructure for organizations where connectivity isn't optional, it's forbidden. Government agencies, defense contractors, healthcare systems, financial trading floors. These environments are disconnected by design, not by accident. And drift detection in these networks is a fundamentally different problem than what most DevOps engineers encounter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Air-Gapped Workloads Drift Differently
&lt;/h2&gt;

&lt;p&gt;In a connected environment, drift happens and gets caught relatively fast. Someone clicks through the console, Terraform Cloud flags it on the next scan, you fix it. The feedback loop is tight.&lt;/p&gt;

&lt;p&gt;In air-gapped environments, drift accumulates silently.&lt;/p&gt;

&lt;p&gt;A sysadmin patches a node manually because the automated pipeline can't reach the package mirror. A developer tweaks a ConfigMap directly because the GitOps controller lost sync with the local Git server. An operator scales a deployment by hand during an incident and forgets to commit the change.&lt;/p&gt;

&lt;p&gt;These changes compound. By the time anyone runs a manual audit, the gap between declared state and actual state can be enormous.&lt;/p&gt;

&lt;p&gt;The core problem: &lt;strong&gt;connected drift detection is continuous and automated. Disconnected drift detection is episodic and manual.&lt;/strong&gt; That gap is where compliance violations, security incidents, and late night pages live.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Doesn't Work (And Why Teams Keep Trying)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Terraform Plan Over VPN
&lt;/h3&gt;

&lt;p&gt;The most common first attempt: tunnel &lt;code&gt;terraform plan&lt;/code&gt; through a VPN into the air-gapped network.&lt;/p&gt;

&lt;p&gt;Problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency kills the feedback loop.&lt;/strong&gt; Provider API calls that take milliseconds on the internet take seconds over a restricted VPN. A plan that runs in 30 seconds now takes 15 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partial connectivity isn't air-gapped.&lt;/strong&gt; If your "air-gapped" network has a VPN tunnel to SaaS tooling, your security team has questions. Valid ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State file synchronization becomes a bottleneck.&lt;/strong&gt; Remote state backends (S3, Consul) need connectivity. Local state files create merge conflicts when multiple operators work simultaneously.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  GitOps Controllers Pointed at External Repos
&lt;/h3&gt;

&lt;p&gt;Flux CD and Argo CD are excellent GitOps tools. But pointing them at a GitHub repo from an air-gapped cluster means... you don't have an air-gapped cluster anymore.&lt;/p&gt;

&lt;p&gt;Running a local Git server (Gitea, GitLab) inside the perimeter fixes the connectivity problem but creates a new one: keeping the local repo in sync with the source of truth requires a deliberate, auditable transfer process. USB drives, data diodes, or scheduled one-way syncs all introduce delay. That delay is where drift happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Periodic Manual Audits
&lt;/h3&gt;

&lt;p&gt;The fallback everyone hates: someone SSHes in, runs a bunch of comparison scripts, and writes a report.&lt;/p&gt;

&lt;p&gt;This catches drift after the fact. In regulated environments, "we check quarterly" doesn't satisfy auditors who want continuous compliance evidence. And manual audits miss things. Every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works
&lt;/h2&gt;

&lt;p&gt;After iterating through the failures above across multiple engagements, three patterns consistently work in production air-gapped environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: Decentralized Policy Agents
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5e6so0bjwpy9ojikbd5a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5e6so0bjwpy9ojikbd5a.png" alt=" " width="800" height="1225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Instead of a central control plane that reaches into clusters, deploy autonomous policy agents inside each air-gapped cluster.&lt;/p&gt;

&lt;p&gt;Each agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stores the desired state locally (pulled in during the last approved sync window)&lt;/li&gt;
&lt;li&gt;Runs a continuous reconciliation loop comparing desired vs. actual state&lt;/li&gt;
&lt;li&gt;Logs every deviation to a local audit store&lt;/li&gt;
&lt;li&gt;Remediates automatically when configured to do so, or raises alerts for manual review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the pattern that Spectro Cloud Palette uses, and it's the right mental model. The cluster enforces its own policy. It doesn't need to phone home.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: OPA Gatekeeper constraint running locally&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;constraints.gatekeeper.sh/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;K8sRequiredLabels&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require-team-label&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Namespace"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;team"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost-center"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;environment"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gatekeeper runs entirely inside the cluster. No external connectivity needed. Violations are logged locally and can be exported during sync windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Local State Snapshots with Diff-on-Sync
&lt;/h3&gt;

&lt;p&gt;For Terraform managed infrastructure, maintain state snapshots inside the air-gapped environment.&lt;/p&gt;

&lt;p&gt;The workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Declare state&lt;/strong&gt; in your IaC repo outside the air gap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transfer the repo&lt;/strong&gt; into the environment through your approved media (data diode, approved USB, one-way sync)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run &lt;code&gt;terraform plan&lt;/code&gt;&lt;/strong&gt; inside the air gap against local provider endpoints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot the actual state&lt;/strong&gt; after each apply&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diff the snapshot&lt;/strong&gt; against the expected state on a cron schedule&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export the diff report&lt;/strong&gt; during the next sync window&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight: the state file and the provider APIs both live inside the perimeter. &lt;code&gt;terraform plan&lt;/code&gt; works fine when everything it needs to reach is local.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# drift_check.sh - runs inside the air-gapped environment&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y%m%d_%H%M%S&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;DRIFT_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/var/log/drift-reports"&lt;/span&gt;

terraform plan &lt;span class="nt"&gt;-detailed-exitcode&lt;/span&gt; &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DRIFT_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/plan_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.tfplan"&lt;/span&gt; 2&amp;gt;&amp;amp;1 | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;tee&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DRIFT_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/drift_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.log"&lt;/span&gt;

&lt;span class="nv"&gt;EXIT_CODE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PIPESTATUS&lt;/span&gt;&lt;span class="p"&gt;[0]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$EXIT_CODE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-eq&lt;/span&gt; 2 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"DRIFT_DETECTED"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DRIFT_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/status_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="c"&gt;# Alert local monitoring&lt;/span&gt;
  curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://alertmanager.local:9093/api/v1/alerts &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'[{"labels":{"alertname":"InfrastructureDrift","severity":"warning"}}]'&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern 3: Immutable Baselines with Checksum Verification
&lt;/h3&gt;

&lt;p&gt;For the most sensitive environments (defense, critical infrastructure), treat infrastructure state like a software artifact.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Build a golden baseline&lt;/strong&gt; of every resource's expected configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate checksums&lt;/strong&gt; (SHA-256) for each configuration artifact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy a lightweight agent&lt;/strong&gt; that periodically recalculates checksums on live resources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Any mismatch triggers an immediate alert&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is coarser than Terraform drift detection, but it works without any provider APIs. It's closer to file integrity monitoring (think AIDE or OSSEC) applied to infrastructure configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# baseline_check.py - infrastructure checksum verification
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_resource_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resource_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Capture current state of a Kubernetes resource.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resource_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resource_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Strip volatile fields that change on every read
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resourceVersion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;creationTimestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;managedFields&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;checksum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate deterministic checksum of resource state.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;canonical&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;verify_baseline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_file&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Compare live state against stored baseline checksums.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;drift_detected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_resource_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;drift_detected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MISSING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="n"&gt;current_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;checksum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_hash&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;drift_detected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MODIFIED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;actual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;current_hash&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;drift_detected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Choosing the Right Pattern
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;th&gt;Audit Trail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Decentralized Agents&lt;/td&gt;
&lt;td&gt;Kubernetes clusters&lt;/td&gt;
&lt;td&gt;Real-time&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local State Snapshots&lt;/td&gt;
&lt;td&gt;Terraform/IaC resources&lt;/td&gt;
&lt;td&gt;Minutes (cron)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Checksum Baselines&lt;/td&gt;
&lt;td&gt;High-security environments&lt;/td&gt;
&lt;td&gt;Minutes (cron)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In practice, most air-gapped environments use a combination. Gatekeeper handles Kubernetes policy enforcement in real time. Terraform drift checks run on a cron inside the perimeter. Checksum baselines provide an additional layer for the security team.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Compliance Angle
&lt;/h2&gt;

&lt;p&gt;Auditors care about three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Can you prove your infrastructure matches the declared state?&lt;/strong&gt; Drift reports with timestamps answer this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How quickly do you detect deviations?&lt;/strong&gt; "Within minutes" beats "at the next quarterly audit."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What happens when drift is detected?&lt;/strong&gt; Automated remediation or documented manual review processes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Air-gapped environments often have stricter compliance requirements than connected ones. The irony is that their tooling for meeting those requirements is worse. Building local drift detection infrastructure closes that gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons From the Field
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Treat sync windows as deployment events.&lt;/strong&gt; When new policy or desired state enters the air-gapped environment, that transfer should go through the same review process as a production deployment. Because it is one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Log everything locally, export periodically.&lt;/strong&gt; Build a local ELK or Loki stack inside the perimeter. Drift events, remediation actions, audit logs. Export summaries during sync windows for central visibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Test your drift detection in staging first.&lt;/strong&gt; Introduce intentional drift in a staging cluster and verify your agents catch it. I've seen teams deploy Gatekeeper and assume it works, only to discover six months later that their constraints had a typo that prevented enforcement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Don't fight the air gap.&lt;/strong&gt; The biggest mistake is trying to poke holes in the network boundary to make connected tools work. Every hole is an attack surface. Build for disconnection. It's simpler in the long run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Version your baselines.&lt;/strong&gt; When the approved state changes (through a sync window), update the baseline checksums and keep the old ones. This gives you a historical record of what the environment should have looked like at any point in time.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>terraform</category>
      <category>security</category>
    </item>
    <item>
      <title>OpenClaw for SRE: Self-Hosted AI Agents That Actually Respond to Incidents</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Sat, 21 Feb 2026 06:27:44 +0000</pubDate>
      <link>https://forem.com/mateenali66/openclaw-for-sre-self-hosted-ai-agents-that-actually-respond-to-incidents-5279</link>
      <guid>https://forem.com/mateenali66/openclaw-for-sre-self-hosted-ai-agents-that-actually-respond-to-incidents-5279</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; OpenClaw is a self-hosted AI agent framework that connects to Slack, Teams, and other channels. For SRE teams, it's a way to build incident response automation that runs entirely on your infrastructure, with custom skills for runbook execution, alert triage, and operational context.&lt;/p&gt;




&lt;h2&gt;
  
  
  The SRE Automation Gap
&lt;/h2&gt;

&lt;p&gt;Every SRE team I've worked with has the same problem: too many alerts, not enough context, and runbooks that exist but don't get followed at 3 AM.&lt;/p&gt;

&lt;p&gt;The typical incident response flow looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;PagerDuty fires an alert&lt;/li&gt;
&lt;li&gt;On-call engineer wakes up, opens laptop&lt;/li&gt;
&lt;li&gt;Checks Slack for context (is anyone else awake?)&lt;/li&gt;
&lt;li&gt;Opens Grafana, tries to find the relevant dashboard&lt;/li&gt;
&lt;li&gt;Searches Confluence for the runbook&lt;/li&gt;
&lt;li&gt;Realizes the runbook is outdated&lt;/li&gt;
&lt;li&gt;Starts troubleshooting from scratch&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 2 through 6 consume 15 to 30 minutes before any real diagnosis begins. For a P1 incident at scale, that's the difference between a blip and an outage that hits the status page.&lt;/p&gt;

&lt;p&gt;SaaS tools like PagerDuty's AIOps and Rootly have started addressing this with AI-powered incident assistants. They work well, but they require sending your operational data to third-party services. For organizations with strict data residency requirements, that's a non-starter.&lt;/p&gt;

&lt;p&gt;OpenClaw fills that gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  What OpenClaw Actually Is
&lt;/h2&gt;

&lt;p&gt;OpenClaw is an open-source, self-hosted framework for running AI agents across messaging platforms. It launched in late 2025 as a personal AI assistant project and has rapidly grown into something more interesting: a platform for building operational automation.&lt;/p&gt;

&lt;p&gt;The core architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-channel gateway&lt;/strong&gt;: Connects to Slack, Microsoft Teams, Discord, WhatsApp, Telegram. Messages from any channel get normalized into a unified format.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM provider abstraction&lt;/strong&gt;: Works with multiple model providers. You bring your own API keys. Switch providers without changing your skills or workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent memory&lt;/strong&gt;: Maintains conversational context across interactions. The agent remembers what happened in the last incident, what commands were run, what the outcome was.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skills framework&lt;/strong&gt;: A plugin system that lets you extend the agent with custom capabilities. This is where the SRE value lives.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything runs on your infrastructure. Docker Compose for simple setups, Kubernetes for production. Your data stays on your servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why SRE Teams Should Care
&lt;/h2&gt;

&lt;p&gt;The skills framework is what makes OpenClaw interesting for operations work. A "skill" in OpenClaw is essentially a structured capability with defined inputs, outputs, and permissions.&lt;/p&gt;

&lt;p&gt;For SRE, that means you can build skills like:&lt;/p&gt;

&lt;h3&gt;
  
  
  Incident Triage
&lt;/h3&gt;

&lt;p&gt;An agent that automatically pulls context when an alert fires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SKILL.md: incident-triage

Inputs: alert_name, service, severity
Actions:
  1. Query Prometheus for related metrics (last 30 min)
  2. Check recent deployments from deploy tracker
  3. Pull relevant runbook from internal wiki
  4. Summarize findings in incident channel

Permissions: read-only access to Prometheus API, deploy API, wiki API
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When PagerDuty fires an alert and posts to Slack, the OpenClaw agent picks it up, runs the triage skill, and drops a summary into the incident channel before the on-call engineer has finished logging in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Runbook Execution
&lt;/h3&gt;

&lt;p&gt;Instead of linking to a Confluence page that may or may not be current, encode runbooks as executable skills:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SKILL.md: restart-service

Inputs: service_name, environment
Actions:
  1. Verify service exists in target environment
  2. Check current health status
  3. Execute rolling restart via Kubernetes API
  4. Monitor health checks for 5 minutes
  5. Report success/failure to incident channel

Permissions: kubernetes API (limited to restart operations)
Guardrails: requires confirmation for production, auto-approve for staging
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The on-call engineer says "restart the payment service in staging" in Slack, and the agent executes the runbook step by step, reporting progress as it goes. No SSH-ing into bastion hosts. No copy-pasting commands from a wiki.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alert Correlation
&lt;/h3&gt;

&lt;p&gt;Connect the agent to your monitoring stack and let it correlate across signals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SKILL.md: correlate-alerts

Inputs: primary_alert
Actions:
  1. Query AlertManager for alerts fired within +/- 5 minutes
  2. Query deployment tracker for recent changes
  3. Check dependent service health
  4. Identify common root cause patterns
  5. Suggest investigation path

Permissions: read-only AlertManager API, deploy tracker, service catalog
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of an engineer manually checking five dashboards to figure out why the checkout service is slow, the agent correlates: "Three alerts fired in the last 10 minutes: high latency on checkout, connection pool exhaustion on payments DB, and a deployment to the payments service 12 minutes ago. Likely cause: the payments deploy."&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting It Up for SRE
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8911qxmaysp2hcwsyv6i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8911qxmaysp2hcwsyv6i.png" alt=" " width="800" height="695"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Deploy the Agent
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml (simplified)&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.8"&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openclaw&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openclaw/openclaw:latest&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./config:/home/openclaw/.openclaw&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./skills:/home/openclaw/skills&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3000:3000"&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Configure Messaging Channels
&lt;/h3&gt;

&lt;p&gt;Point it at your Slack workspace. The agent appears as a bot user in your incident channels. Teams that use Microsoft Teams or Discord can connect those instead, same agent, different channel.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Build SRE Skills
&lt;/h3&gt;

&lt;p&gt;Each skill is a directory with a &lt;code&gt;SKILL.md&lt;/code&gt; that defines its behavior and a set of supporting scripts or API integrations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;skills/
├── incident-triage/
│   ├── SKILL.md
│   ├── prometheus_query.py
│   └── deploy_check.py
├── restart-service/
│   ├── SKILL.md
│   └── k8s_restart.py
├── correlate-alerts/
│   ├── SKILL.md
│   └── alertmanager_client.py
└── status-page-update/
    ├── SKILL.md
    └── statuspage_api.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Connect to Your Monitoring Stack
&lt;/h3&gt;

&lt;p&gt;The agent needs read access to your observability tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Integration&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Access Level&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prometheus/VictoriaMetrics&lt;/td&gt;
&lt;td&gt;Metrics queries&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AlertManager&lt;/td&gt;
&lt;td&gt;Alert correlation&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes API&lt;/td&gt;
&lt;td&gt;Service health, restarts&lt;/td&gt;
&lt;td&gt;Scoped RBAC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deploy tracker&lt;/td&gt;
&lt;td&gt;Recent changes&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal wiki&lt;/td&gt;
&lt;td&gt;Runbooks&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;StatusPage&lt;/td&gt;
&lt;td&gt;Incident communication&lt;/td&gt;
&lt;td&gt;Write&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Principle of least privilege applies. The agent should have the minimum permissions needed for each skill.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;Here's a realistic incident timeline with OpenClaw:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;00:00&lt;/strong&gt; - AlertManager fires: "Checkout latency &amp;gt; 2s for 5 minutes"&lt;br&gt;
&lt;strong&gt;00:01&lt;/strong&gt; - PagerDuty pages on-call, posts to #incident-checkout in Slack&lt;br&gt;
&lt;strong&gt;00:01&lt;/strong&gt; - OpenClaw agent detects the alert, runs incident-triage skill&lt;br&gt;
&lt;strong&gt;00:02&lt;/strong&gt; - Agent posts triage summary:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Incident Triage: checkout-latency-high&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related Alerts (last 10 min):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;payments-db-connection-pool-exhaustion (fired 00:00)&lt;/li&gt;
&lt;li&gt;payments-service-error-rate-high (fired 00:01)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Recent Deployments:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;payments-service v2.14.3 deployed 12 min ago by &lt;a class="mentioned-user" href="https://dev.to/sarah"&gt;@sarah&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Relevant Runbook:&lt;/strong&gt; Payments DB Connection Pool&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Suggested Action:&lt;/strong&gt; The payments deploy correlates with connection pool exhaustion. Consider rolling back payments-service to v2.14.2.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;00:03&lt;/strong&gt; - On-call engineer logs in, sees the full context already assembled&lt;br&gt;
&lt;strong&gt;00:04&lt;/strong&gt; - Engineer: "rollback payments-service to v2.14.2 in production"&lt;br&gt;
&lt;strong&gt;00:04&lt;/strong&gt; - Agent: "Rolling back payments-service to v2.14.2 in production. This will trigger a rolling update. Confirm? (yes/no)"&lt;br&gt;
&lt;strong&gt;00:04&lt;/strong&gt; - Engineer: "yes"&lt;br&gt;
&lt;strong&gt;00:05&lt;/strong&gt; - Agent executes rollback, monitors health checks&lt;br&gt;
&lt;strong&gt;00:08&lt;/strong&gt; - Agent: "Rollback complete. Checkout latency back to normal (avg 180ms). Payments DB connection pool utilization dropped from 98% to 45%."&lt;/p&gt;

&lt;p&gt;Total time from alert to resolution: 8 minutes. Without the agent, that same incident takes 25 to 40 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Guardrails Matter
&lt;/h2&gt;

&lt;p&gt;Letting an AI agent interact with production infrastructure requires guardrails. OpenClaw's skill framework supports this through permission scoping and confirmation gates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production safeguards:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Skills that modify production require explicit confirmation&lt;/li&gt;
&lt;li&gt;Read-only skills execute automatically (triage, correlation)&lt;/li&gt;
&lt;li&gt;Write operations go through a confirmation flow in the messaging channel&lt;/li&gt;
&lt;li&gt;All actions are logged with who triggered them and what the agent did&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scope limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each skill declares its required permissions&lt;/li&gt;
&lt;li&gt;Kubernetes RBAC limits what the agent can actually do&lt;/li&gt;
&lt;li&gt;API keys are scoped to specific operations&lt;/li&gt;
&lt;li&gt;No "do anything" root access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a replacement for your incident commander or your on-call engineers. It's a tool that handles the first 5 minutes of context gathering so humans can focus on the hard parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Falls Short
&lt;/h2&gt;

&lt;p&gt;OpenClaw is still young. A few things to be aware of:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill development is manual.&lt;/strong&gt; There's no marketplace or library of pre-built SRE skills. You're building integrations from scratch. If you've built Slack bots or PagerDuty integrations before, the effort is similar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM costs add up.&lt;/strong&gt; Every incident interaction consumes API tokens. For high-alert-volume environments, the cost of LLM calls during incidents needs to be factored into the budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering is real work.&lt;/strong&gt; The quality of the agent's triage and correlation depends heavily on how well the skills are designed. Poorly defined skills produce noisy, unhelpful outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not a replacement for observability.&lt;/strong&gt; The agent is only as good as the data it can access. If your monitoring has gaps, the agent inherits those gaps.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use It
&lt;/h2&gt;

&lt;p&gt;OpenClaw for SRE makes sense when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your organization has data residency or security requirements that rule out SaaS incident tools&lt;/li&gt;
&lt;li&gt;You already have a solid observability stack (Prometheus, Grafana, AlertManager) and want to add an intelligence layer on top&lt;/li&gt;
&lt;li&gt;Your team has the engineering capacity to build and maintain custom skills&lt;/li&gt;
&lt;li&gt;Incident response time is a critical metric you're trying to improve&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It doesn't make sense when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're a small team that can handle alerts manually&lt;/li&gt;
&lt;li&gt;You don't have a mature observability foundation yet (fix that first)&lt;/li&gt;
&lt;li&gt;You want a turnkey solution with no custom development&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>opensource</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>The Hidden Tax on Your Cloud Bill: How Data Transfer Costs Are Silently Draining Your Budget</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Fri, 02 Jan 2026 14:36:49 +0000</pubDate>
      <link>https://forem.com/mateenali66/the-hidden-tax-on-your-cloud-bill-how-data-transfer-costs-are-silently-draining-your-budget-mc1</link>
      <guid>https://forem.com/mateenali66/the-hidden-tax-on-your-cloud-bill-how-data-transfer-costs-are-silently-draining-your-budget-mc1</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Cloud data transfer costs can account for 10-30% of your cloud bill, yet most teams don't understand the pricing until they get shocked by a massive invoice. I break down exactly where these costs hide, compare AWS, GCP, and Azure pricing, and show you how to potentially save 60-80% on egress fees.&lt;/p&gt;




&lt;h2&gt;
  
  
  The $2,657 Overnight Surprise
&lt;/h2&gt;

&lt;p&gt;A developer shared a 13.7 GB file that went viral. His AWS bill jumped from $23 to $2,657 overnight. Every download by every user worldwide was charged at $0.09/GB. No warning, no cap, just a bill.&lt;/p&gt;

&lt;p&gt;This story is more common than you think. And it is why I spent the last month researching cloud data transfer pricing across AWS, GCP, and Azure.&lt;/p&gt;

&lt;p&gt;What I found was eye opening.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Basics: Why Cloud Providers Love Egress
&lt;/h2&gt;

&lt;p&gt;Here is the fundamental asymmetry of cloud pricing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data IN (ingress):&lt;/strong&gt; FREE across all major providers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data OUT (egress):&lt;/strong&gt; $0.05 to $0.23 per GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not accidental. Cloud providers want your data to flow in freely. Getting it out? That will cost you. It is often called the "Hotel California" model of cloud computing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Current Pricing Comparison (2025)
&lt;/h2&gt;

&lt;p&gt;I verified these numbers across official documentation and third party sources:&lt;/p&gt;

&lt;h3&gt;
  
  
  Egress to Internet (US Regions)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;AWS&lt;/th&gt;
&lt;th&gt;GCP Premium&lt;/th&gt;
&lt;th&gt;Azure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;td&gt;100 GB/month&lt;/td&gt;
&lt;td&gt;1 GiB&lt;/td&gt;
&lt;td&gt;100 GB/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First 10 TB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.09/GB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.12/GiB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.087/GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10-50 TB&lt;/td&gt;
&lt;td&gt;$0.085/GB&lt;/td&gt;
&lt;td&gt;$0.11/GiB&lt;/td&gt;
&lt;td&gt;$0.083/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50-150 TB&lt;/td&gt;
&lt;td&gt;$0.07/GB&lt;/td&gt;
&lt;td&gt;$0.08/GiB&lt;/td&gt;
&lt;td&gt;$0.07/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;150+ TB&lt;/td&gt;
&lt;td&gt;$0.05/GB&lt;/td&gt;
&lt;td&gt;$0.08/GiB&lt;/td&gt;
&lt;td&gt;$0.05/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Quick math:&lt;/strong&gt; 10 TB of monthly egress costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS: $900&lt;/li&gt;
&lt;li&gt;GCP Premium: $1,100&lt;/li&gt;
&lt;li&gt;Azure: $870&lt;/li&gt;
&lt;li&gt;Cloudflare R2: $0 (yes, zero)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcieqn8758pjpkjyzzwz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcieqn8758pjpkjyzzwz.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hidden Costs Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Standard egress is just the tip of the iceberg. Here is where the real money disappears.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. NAT Gateway: The Silent Budget Killer
&lt;/h3&gt;

&lt;p&gt;If you run workloads in private subnets (which you should for security), traffic to the internet goes through a NAT Gateway. The cost?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hourly charge:&lt;/strong&gt; $0.045/hour ($32.85/month per gateway)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data processing:&lt;/strong&gt; $0.045/GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let me break down a real scenario. You have 100 GB going to S3 through a NAT Gateway:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NAT processing&lt;/td&gt;
&lt;td&gt;100 GB x $0.045 = $4.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internet egress&lt;/td&gt;
&lt;td&gt;100 GB x $0.09 = $9.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$13.50&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;But here is the thing: &lt;strong&gt;S3 traffic through a VPC Gateway Endpoint is FREE.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One developer at Geocodio documented a "$1,000 AWS mistake" where traffic to AWS services in the same region was routed through NAT Gateway. All of that was avoidable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56f1wjn8tx5ps3ry6slf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56f1wjn8tx5ps3ry6slf.png" alt=" " width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Cross-AZ Traffic: Death by a Thousand Cuts
&lt;/h3&gt;

&lt;p&gt;Every time data crosses between Availability Zones, you pay $0.01/GB in each direction. That is $0.02/GB round trip.&lt;/p&gt;

&lt;p&gt;Seems small? Consider this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your app server is in AZ-1&lt;/li&gt;
&lt;li&gt;Your database (Multi-AZ RDS) is in AZ-2&lt;/li&gt;
&lt;li&gt;Every query response crosses zones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A database doing 10 TB of response traffic monthly costs an extra $200 just in cross-AZ fees. Multiply that across all your services.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F71tgf32b6dw4fk2io9ct.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F71tgf32b6dw4fk2io9ct.png" alt=" " width="800" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Load Balancer Data Processing
&lt;/h3&gt;

&lt;p&gt;Your Application Load Balancer processes all that traffic. When requests come in on one AZ and targets live in another, you pay twice.&lt;/p&gt;

&lt;p&gt;GCP Load Balancing charges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inbound data processed: $0.008/GiB&lt;/li&gt;
&lt;li&gt;Outbound data processed: $0.008/GiB&lt;/li&gt;
&lt;li&gt;Plus forwarding rules: $0.025/hour (first 5), $0.01/hour each additional&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Public IPv4 Addresses (AWS, 2024)
&lt;/h3&gt;

&lt;p&gt;New as of February 2024, AWS charges $0.005/hour for all public IPv4 addresses. That is $3.60/month per IP, in-use or idle.&lt;/p&gt;

&lt;p&gt;10 public IPs sitting there? That is $36/month before you transfer any data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where This Gets Expensive: Multi-Cloud and Hybrid
&lt;/h2&gt;

&lt;p&gt;Moving data between clouds or to on-premises is where costs really add up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 1: Over the Internet (Expensive)
&lt;/h3&gt;

&lt;p&gt;AWS to GCP via public internet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS egress: $0.09/GB&lt;/li&gt;
&lt;li&gt;GCP ingress: FREE&lt;/li&gt;
&lt;li&gt;Total: $0.09/GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For 50 TB monthly: &lt;strong&gt;$4,500&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: Dedicated Interconnect (Better Economics)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Port Fee (10 Gbps)&lt;/th&gt;
&lt;th&gt;Data Transfer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS Direct Connect&lt;/td&gt;
&lt;td&gt;$2.25/hour (~$1,643/mo)&lt;/td&gt;
&lt;td&gt;$0.02/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Azure ExpressRoute&lt;/td&gt;
&lt;td&gt;$3,400/month&lt;/td&gt;
&lt;td&gt;$0.025/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCP Cross-Cloud Interconnect&lt;/td&gt;
&lt;td&gt;$5.60/hour (~$4,032/mo)&lt;/td&gt;
&lt;td&gt;Same as inter-region&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For high volume transfers, dedicated connections pay for themselves quickly. At 50 TB monthly, Direct Connect saves ~$3,500 compared to internet egress.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvncxswfaih2vfiouhh6u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvncxswfaih2vfiouhh6u.png" alt=" " width="800" height="1560"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Companies That Solved This
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Dropbox: $75 Million Saved
&lt;/h3&gt;

&lt;p&gt;Dropbox was one of S3's largest customers. In 2015-2016, they built their own storage infrastructure called "Magic Pocket" and migrated off AWS.&lt;/p&gt;

&lt;p&gt;The result: &lt;strong&gt;$74.6 million in savings over two years.&lt;/strong&gt; First year alone saved $39.5 million.&lt;/p&gt;

&lt;p&gt;At their scale, owning infrastructure beats renting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Basecamp/37signals: $10 Million Over Five Years
&lt;/h3&gt;

&lt;p&gt;In 2023, Basecamp left AWS and Google Cloud. Their results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total projected savings: &lt;strong&gt;$10+ million&lt;/strong&gt; over five years&lt;/li&gt;
&lt;li&gt;Already saving: &lt;strong&gt;$1 million/year&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;S3 exit alone: &lt;strong&gt;$5,000/day&lt;/strong&gt; ($150K/month)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DHH (their founder) wrote extensively about this. They bought ~$600K in hardware and added no new staff. The payback period was less than a year.&lt;/p&gt;

&lt;h3&gt;
  
  
  Netflix: Built Their Own CDN
&lt;/h3&gt;

&lt;p&gt;Netflix does not stream videos out of AWS. The egress costs would be astronomical. Instead, they built Open Connect, their own CDN with appliances placed directly in ISP networks.&lt;/p&gt;

&lt;p&gt;Quote from an industry analyst: "The underlying economics of data transfer does not reflect how the cloud providers price for it. We are still paying 1990s prices for bandwidth when we are in the cloud."&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Reduce Your Egress Costs
&lt;/h2&gt;

&lt;p&gt;Based on my research, here are the highest impact optimizations:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. VPC Gateway Endpoints for S3/DynamoDB (100% Savings)
&lt;/h3&gt;

&lt;p&gt;These are free and route traffic directly to S3/DynamoDB without touching NAT Gateway or the internet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Terraform example&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_endpoint"&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;service_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"com.amazonaws.${var.region}.s3"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_endpoint_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Gateway"&lt;/span&gt;
  &lt;span class="nx"&gt;route_table_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_route_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. CloudFront for Content Delivery (60-80% Reduction)
&lt;/h3&gt;

&lt;p&gt;CloudFront to S3 origin is free. You only pay CloudFront egress, which is cheaper than S3 direct:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Cost per 10K requests&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;S3 direct&lt;/td&gt;
&lt;td&gt;$0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudFront&lt;/td&gt;
&lt;td&gt;$0.0075&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Plus caching means you serve from edge instead of origin.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Compression Before Transfer (50-80% Reduction)
&lt;/h3&gt;

&lt;p&gt;Compress everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gzip for general purpose&lt;/li&gt;
&lt;li&gt;Brotli for text content (better ratio than Gzip)&lt;/li&gt;
&lt;li&gt;Delta encoding for incremental updates&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Same-AZ Deployment (Eliminate Cross-AZ)
&lt;/h3&gt;

&lt;p&gt;If high availability is not critical for a workload, keep everything in one AZ. Same-AZ traffic is free.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Consider Cloudflare R2 for Storage Heavy Workloads
&lt;/h3&gt;

&lt;p&gt;R2 has zero egress fees. For a workload with 10 TB storage and 50 TB monthly egress:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS S3&lt;/td&gt;
&lt;td&gt;$230 (storage) + $4,500 (egress) = $4,730&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloudflare R2&lt;/td&gt;
&lt;td&gt;$150 (storage only)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That is a 97% reduction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdvcryfhot09djid1agno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdvcryfhot09djid1agno.png" alt=" " width="800" height="1003"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Monitoring and Alerting
&lt;/h2&gt;

&lt;p&gt;You cannot optimize what you do not measure. Set up:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AWS Cost Explorer&lt;/strong&gt; with daily data transfer breakdown&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudWatch alarms&lt;/strong&gt; on NAT Gateway bytes processed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget alerts&lt;/strong&gt; specifically for data transfer line items&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VPC Flow Logs&lt;/strong&gt; to understand traffic patterns (but watch the logging costs)
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check your data transfer costs (AWS CLI)&lt;/span&gt;
aws ce get-cost-and-usage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--time-period&lt;/span&gt; &lt;span class="nv"&gt;Start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2024-01-01,End&lt;span class="o"&gt;=&lt;/span&gt;2024-01-31 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--granularity&lt;/span&gt; MONTHLY &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metrics&lt;/span&gt; &lt;span class="s2"&gt;"UnblendedCost"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="s1"&gt;'{"Dimensions":{"Key":"USAGE_TYPE_GROUP","Values":["EC2: Data Transfer - Internet (Out)","EC2: Data Transfer - Region to Region (Out)"]}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Regulatory Push
&lt;/h2&gt;

&lt;p&gt;The European Data Act (effective September 2025) is forcing cloud providers toward transparent pricing and easier data portability. All three major providers now offer egress fee waivers for complete cloud departures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AWS:&lt;/strong&gt; Waiver upon account team approval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GCP:&lt;/strong&gt; Waiver for full migration off platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure:&lt;/strong&gt; 100GB credits for 60 days&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The catch? These only apply to complete departures, not ongoing multi-cloud operations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ingress is free, egress is not.&lt;/strong&gt; Plan your architecture with data gravity in mind.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;NAT Gateway is the biggest hidden cost.&lt;/strong&gt; Use VPC Gateway Endpoints for S3/DynamoDB.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cross-AZ traffic adds up.&lt;/strong&gt; $0.01/GB each way on every hop.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;At scale, consider alternatives.&lt;/strong&gt; Dropbox saved $75M, Basecamp saves $1M/year.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CDN and compression are low-hanging fruit.&lt;/strong&gt; 60-80% reduction for content delivery.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloudflare R2 has zero egress.&lt;/strong&gt; Seriously consider it for storage-heavy workloads.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Monitor proactively.&lt;/strong&gt; One viral file can turn a $23 bill into $2,657 overnight.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/ec2/pricing/on-demand/" rel="noopener noreferrer"&gt;AWS Data Transfer Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/vpc/network-pricing" rel="noopener noreferrer"&gt;GCP Network Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://azure.microsoft.com/en-us/pricing/details/bandwidth/" rel="noopener noreferrer"&gt;Azure Bandwidth Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://basecamp.com/cloud-exit" rel="noopener noreferrer"&gt;Basecamp Cloud Exit&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.cloudflare.com/r2/pricing/" rel="noopener noreferrer"&gt;Cloudflare R2 Pricing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;What is the worst data transfer bill you have received? I would love to hear your horror stories and optimization wins in the comments.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>gcp</category>
      <category>cloud</category>
      <category>azure</category>
    </item>
    <item>
      <title>Never Commit Secrets Again: Generate .env Files from AWS Secrets Manager</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Fri, 12 Dec 2025 17:49:13 +0000</pubDate>
      <link>https://forem.com/mateenali66/never-commit-secrets-again-generate-env-files-from-aws-secrets-manager-46f4</link>
      <guid>https://forem.com/mateenali66/never-commit-secrets-again-generate-env-files-from-aws-secrets-manager-46f4</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Store secrets in AWS Secrets Manager. Generate .env files on demand with a Python script. Never commit credentials again.&lt;br&gt;
a&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Every team commits secrets eventually. GitHub detected over 12 million exposed credentials last year through their secret scanning.&lt;/p&gt;

&lt;p&gt;The usual approaches all have failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;.gitignore&lt;/strong&gt; fails when developers forget to add it, or clone fresh and ask for the file via Slack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SOPS encryption&lt;/strong&gt; still puts files in git, adds key management overhead, and creates merge conflict nightmares&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;.env.example&lt;/strong&gt; templates get stale and require manual copying&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We needed something better: secrets that live outside the repository entirely, with a frictionless developer experience.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;Secrets live in AWS Secrets Manager. Developers run one command to generate their .env file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make &lt;span class="nb"&gt;env&lt;/span&gt;
&lt;span class="c"&gt;# .env is generated locally, ready to use&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The file is gitignored. It never touches version control. When secrets change in AWS, developers regenerate and get the latest values.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hw0h6s61fxaz0puyn5i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hw0h6s61fxaz0puyn5i.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Organize Secrets in AWS
&lt;/h3&gt;

&lt;p&gt;Structure your secrets by application and environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/myapp/dev/database      → {"DB_HOST": "...", "DB_PASSWORD": "..."}
/myapp/dev/api-keys      → {"STRIPE_KEY": "...", "SENDGRID_KEY": "..."}
/myapp/prod/database     → {"DB_HOST": "...", "DB_PASSWORD": "..."}
/myapp/prod/api-keys     → {"STRIPE_KEY": "...", "SENDGRID_KEY": "..."}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create secrets using AWS CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws secretsmanager create-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; /myapp/dev/database &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret-string&lt;/span&gt; &lt;span class="s1"&gt;'{"DB_HOST":"localhost","DB_PASSWORD":"devpass123"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. The Python Script
&lt;/h3&gt;

&lt;p&gt;Here's the full script that generates .env files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Generate .env file from AWS Secrets Manager.

Usage:
    python generate_env.py dev
    python generate_env.py prod --force
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;botocore.exceptions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ClientError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NoCredentialsError&lt;/span&gt;

&lt;span class="c1"&gt;# Configuration
&lt;/span&gt;&lt;span class="n"&gt;APP_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myapp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;AWS_REGION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AWS_REGION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ENV_FILE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;SECRET_KEYS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api-keys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;third-party&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_secret&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secret_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetch a secret from AWS Secrets Manager.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secretsmanager&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_secret_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SecretId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SecretString&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ClientError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ResourceNotFoundException&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Warning: Secret &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;secret_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; not found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_aws_credentials&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Check if AWS credentials are configured.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;identity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_caller_identity&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authenticated as: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;identity&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Arn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;NoCredentialsError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: AWS credentials not found.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Fix with one of:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  1. aws configure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  2. Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  3. Use IAM role (if on AWS)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_all_secrets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetch all secrets for the environment.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;all_secrets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;SECRET_KEYS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;secret_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;APP_NAME&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Fetching: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;secret_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;all_secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;get_secret&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secret_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;all_secrets&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_env_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate .env content from secrets.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;# Auto-generated from AWS Secrets Manager&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;# DO NOT COMMIT THIS FILE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;
        &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;environment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;staging&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--force&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;store_true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ENV_FILE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generating .env for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;validate_aws_credentials&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;secrets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_all_secrets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AWS_REGION&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Error: No secrets found at /&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;APP_NAME&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; secret values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_env_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;force&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; exists. Overwrite? [y/N]: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generated: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Shell Wrapper and Makefile
&lt;/h3&gt;

&lt;p&gt;Create a shell wrapper for convenience:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# generate-env.sh&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt;
&lt;span class="nv"&gt;ENV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;1&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;dev&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import boto3"&lt;/span&gt; 2&amp;gt;/dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;pip3 &lt;span class="nb"&gt;install &lt;/span&gt;boto3 &lt;span class="nt"&gt;--quiet&lt;/span&gt;
&lt;span class="k"&gt;fi

&lt;/span&gt;python3 &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;dirname&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$0&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;/generate_env.py"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENV&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="p"&gt;@&lt;/span&gt;:2&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add Makefile targets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;.PHONY&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;env env-dev env-prod&lt;/span&gt;

&lt;span class="nl"&gt;env&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;@&lt;/span&gt;./scripts/generate-env.sh dev

&lt;span class="nl"&gt;env-dev&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;@&lt;/span&gt;./scripts/generate-env.sh dev

&lt;span class="nl"&gt;env-prod&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;@&lt;/span&gt;./scripts/generate-env.sh prod

&lt;span class="nl"&gt;env-dry&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;@&lt;/span&gt;./scripts/generate-env.sh dev &lt;span class="nt"&gt;--dry-run&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. GitHub Actions with OIDC
&lt;/h3&gt;

&lt;p&gt;No stored credentials needed. Use OIDC to assume an AWS role:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure AWS Credentials&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789012:role/github-actions&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Generate .env&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;pip install boto3&lt;/span&gt;
          &lt;span class="s"&gt;python scripts/generate_env.py prod --force&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;# Your deployment commands&lt;/span&gt;
          &lt;span class="s"&gt;echo "Deploying..."&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Cleanup&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always()&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rm -f .env&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. GitLab CI
&lt;/h3&gt;

&lt;p&gt;Same pattern with GitLab's OIDC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deploy&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python:3.11-slim&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pip install boto3&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;export $(printf "AWS_ACCESS_KEY_ID=%s AWS_SECRET_ACCESS_KEY=%s AWS_SESSION_TOKEN=%s"&lt;/span&gt;
      &lt;span class="s"&gt;$(aws sts assume-role-with-web-identity&lt;/span&gt;
      &lt;span class="s"&gt;--role-arn ${AWS_ROLE_ARN}&lt;/span&gt;
      &lt;span class="s"&gt;--role-session-name "gitlab-${CI_PIPELINE_ID}"&lt;/span&gt;
      &lt;span class="s"&gt;--web-identity-token ${CI_JOB_JWT_V2}&lt;/span&gt;
      &lt;span class="s"&gt;--query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]'&lt;/span&gt;
      &lt;span class="s"&gt;--output text))&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;python scripts/generate_env.py prod --force&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;echo "Deploying..."&lt;/span&gt;
  &lt;span class="na"&gt;after_script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;rm -f .env&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  IAM Permissions
&lt;/h2&gt;

&lt;p&gt;Developers need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"secretsmanager:GetSecretValue"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:secretsmanager:us-east-1:*:secret:/myapp/dev/*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CI/CD roles need access to prod secrets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"secretsmanager:GetSecretValue"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:secretsmanager:us-east-1:*:secret:/myapp/prod/*"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  OIDC Setup for GitHub Actions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Create the OIDC provider in AWS:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws iam create-open-id-connect-provider &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--url&lt;/span&gt; https://token.actions.githubusercontent.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--client-id-list&lt;/span&gt; sts.amazonaws.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Create the trust policy:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Federated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts:AssumeRoleWithWebIdentity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:aud"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StringLike"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:sub"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"repo:your-org/your-repo:*"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After implementing this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Secrets in git&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rotation time&lt;/td&gt;
&lt;td&gt;2 hours&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup time&lt;/td&gt;
&lt;td&gt;45 min&lt;/td&gt;
&lt;td&gt;10 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Slack secret sharing&lt;/td&gt;
&lt;td&gt;Weekly&lt;/td&gt;
&lt;td&gt;Never&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Repository Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;myapp/
├── scripts/
│   ├── generate_env.py
│   └── generate-env.sh
├── .github/
│   └── workflows/
│       └── deploy.yml
├── .gitignore          # includes .env
├── .env.example        # dummy values for reference
├── Makefile
└── README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The full code is available at &lt;a href="https://github.com/mateenali66/secrets-env-generator" rel="noopener noreferrer"&gt;github.com/mateenali66/secrets-env-generator&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Clone it, configure your AWS credentials, create some test secrets, and run &lt;code&gt;make env&lt;/code&gt;.&lt;/p&gt;




&lt;p&gt;Questions? Contact me on &lt;a href="https://mateen.tech" rel="noopener noreferrer"&gt;&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>security</category>
      <category>secret</category>
      <category>dotenv</category>
    </item>
    <item>
      <title>AWS DevOps Agent: What AWS Isn't Telling You (And Why Your Job Is Safe)</title>
      <dc:creator>Mateen Anjum</dc:creator>
      <pubDate>Tue, 09 Dec 2025 07:25:49 +0000</pubDate>
      <link>https://forem.com/mateenali66/aws-devops-agent-what-aws-isnt-telling-you-and-why-your-job-is-safe-2kdi</link>
      <guid>https://forem.com/mateenali66/aws-devops-agent-what-aws-isnt-telling-you-and-why-your-job-is-safe-2kdi</guid>
      <description>&lt;p&gt;AWS announced DevOps Agent at re:Invent 2025, calling it a "frontier agent that acts as an experienced DevOps engineer." The marketing promises autonomous incident investigation, root cause analysis, and proactive prevention.&lt;/p&gt;

&lt;p&gt;I spent the past week digging into the documentation, testing the preview, and analyzing what AWS carefully avoided mentioning. Here's what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;AWS DevOps Agent is a powerful diagnostic assistant, not an autonomous operator. It can investigate incidents and suggest fixes, but it cannot execute them. The preview is free, but AWS hasn't disclosed GA pricing. Your job is safe because someone still needs to actually fix things.&lt;/p&gt;

&lt;h2&gt;
  
  
  What DevOps Agent Actually Does
&lt;/h2&gt;

&lt;p&gt;Think of it as a 24/7 on-call engineer that never sleeps, never gets tired, and never forgets to check the logs. When an alert fires at 2 AM, it immediately starts investigating.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2piw1zrau48a53g70tr2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2piw1zrau48a53g70tr2.png" alt="Investigation Flow" width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core capabilities:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Incident Investigation&lt;/td&gt;
&lt;td&gt;Correlates metrics, logs, traces, and code changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Root Cause Analysis&lt;/td&gt;
&lt;td&gt;Identifies probable cause using topology understanding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mitigation Plans&lt;/td&gt;
&lt;td&gt;Suggests steps to fix with rollback guidance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prevention Analysis&lt;/td&gt;
&lt;td&gt;Analyzes historical incidents to prevent recurrence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stakeholder Updates&lt;/td&gt;
&lt;td&gt;Posts findings to Slack channels and tickets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;What makes it "frontier":&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS calls this a frontier agent because it can run autonomously for hours or days. It doesn't need you to guide it step by step. Give it an alert, and it figures out what to investigate, which logs to pull, which deployments to check.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Integration Ecosystem
&lt;/h2&gt;

&lt;p&gt;This is where DevOps Agent gets interesting. It's not locked into AWS services.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqv43hu6m8gcpn8m0jd03.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqv43hu6m8gcpn8m0jd03.png" alt="Integration Ecosystem" width="800" height="823"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Built-in integrations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Observability: CloudWatch, Datadog, Dynatrace, New Relic, Splunk&lt;/li&gt;
&lt;li&gt;CI/CD: GitHub Actions, GitLab CI/CD&lt;/li&gt;
&lt;li&gt;Ticketing: ServiceNow (native), PagerDuty (webhook)&lt;/li&gt;
&lt;li&gt;Collaboration: Slack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The MCP wildcard:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Through Model Context Protocol servers, you can connect anything: Prometheus, Grafana, custom internal tools, proprietary systems. This is the underrated feature. While competitors lock you into their ecosystem, AWS lets you bring your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AWS Isn't Telling You: Pricing
&lt;/h2&gt;

&lt;p&gt;Here's where it gets murky.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwuuzqu3o7n5hhunfbppa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwuuzqu3o7n5hhunfbppa.png" alt="Pricing Knows vs Unknowns" width="800" height="209"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Preview limits (documented):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;20 incident resolution hours per month&lt;/li&gt;
&lt;li&gt;10 incident prevention hours per month&lt;/li&gt;
&lt;li&gt;1,000 chat messages per month&lt;/li&gt;
&lt;li&gt;10 Agent Spaces maximum&lt;/li&gt;
&lt;li&gt;3 concurrent investigations&lt;/li&gt;
&lt;li&gt;1 concurrent prevention task&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GA pricing (unknown):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per hour? Per investigation? Per seat? Per account?&lt;/li&gt;
&lt;li&gt;Third-party tool API costs passed through?&lt;/li&gt;
&lt;li&gt;Bedrock model usage fees?&lt;/li&gt;
&lt;li&gt;Multi-region pricing?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hidden costs during preview:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Queries and API calls made to other AWS and non-AWS services may generate charges from those services."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Translation: Your CloudWatch and X-Ray bills might increase. If Datadog charges per query, those costs are on you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My speculation on GA pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Based on Bedrock pricing and similar services, expect something like $50-150 per investigation hour. A complex incident taking 4 hours of agent time could cost $200-600. For organizations with frequent incidents, this adds up quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What DevOps Agent Cannot Do
&lt;/h2&gt;

&lt;p&gt;This is the part AWS marketing glosses over.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8w3qqw092ul5bvxhp9li.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8w3qqw092ul5bvxhp9li.png" alt="Can vs Cannot" width="800" height="232"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It cannot:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Execute fixes (only recommends)&lt;/li&gt;
&lt;li&gt;Deploy code changes&lt;/li&gt;
&lt;li&gt;Modify infrastructure&lt;/li&gt;
&lt;li&gt;Make policy decisions&lt;/li&gt;
&lt;li&gt;Handle unprecedented situations&lt;/li&gt;
&lt;li&gt;Operate autonomously in regulated industries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The regulatory reality:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For healthcare, finance, or any regulated industry, DevOps Agent is a diagnostic assistant. It cannot be an autonomous operator. Compliance requires human decision-making for changes. This alone disqualifies the "replacement" narrative.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Your Job Is Safe
&lt;/h2&gt;

&lt;p&gt;I've seen the LinkedIn panic. "AI is coming for DevOps jobs!" Let me explain why that's wrong.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fib3p7k76n0f0zjatl3dg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fib3p7k76n0f0zjatl3dg.png" alt="Human Agent Collaboration" width="800" height="202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent does:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detects and monitors&lt;/li&gt;
&lt;li&gt;Investigates and correlates&lt;/li&gt;
&lt;li&gt;Reports and recommends&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you still do:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement the actual fix&lt;/li&gt;
&lt;li&gt;Deploy changes to production&lt;/li&gt;
&lt;li&gt;Verify the fix worked&lt;/li&gt;
&lt;li&gt;Make architectural decisions&lt;/li&gt;
&lt;li&gt;Handle the weird edge cases&lt;/li&gt;
&lt;li&gt;Build new infrastructure&lt;/li&gt;
&lt;li&gt;Coordinate across teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Commonwealth Bank example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS cited that Commonwealth Bank found a root cause in under 15 minutes using DevOps Agent, versus hours manually. Notice what they didn't say: the agent fixed it. An engineer still had to implement the solution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DevOps Agent doesn't reduce headcount. It reduces MTTR.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your value isn't in correlating logs. Your value is in knowing what to do with that information. The agent accelerates the boring parts so you can focus on the interesting ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent Spaces: The Security Model
&lt;/h2&gt;

&lt;p&gt;One thing AWS got right is the security boundary model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Each Agent Space:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has its own dedicated IAM role&lt;/li&gt;
&lt;li&gt;Defines exactly which accounts it can access&lt;/li&gt;
&lt;li&gt;Controls which tools are connected&lt;/li&gt;
&lt;li&gt;Isolates data from other Agent Spaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Access patterns:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Admins configure via AWS Console&lt;/li&gt;
&lt;li&gt;Operators interact via standalone web app&lt;/li&gt;
&lt;li&gt;IAM Identity Center or direct IAM authentication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Resource discovery:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CloudFormation stacks (including CDK) are auto-discovered&lt;/li&gt;
&lt;li&gt;Terraform and console resources need tags&lt;/li&gt;
&lt;li&gt;No tags = invisible to the agent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your infrastructure is a mess of untagged resources, DevOps Agent won't help much. This is actually a feature: it forces infrastructure hygiene.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Limitations I Found
&lt;/h2&gt;

&lt;p&gt;Testing revealed issues AWS documentation doesn't highlight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investigation accuracy varies:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One tester reported that when the time between two alarms was around 40 minutes, the agent couldn't find the root cause and required a re-run. The agent isn't infallible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;English only:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No multilingual support. If your team operates in other languages, this limits adoption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;US East only (for now):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent runs in us-east-1, though it can monitor resources in any region. Multi-region redundancy isn't available during preview.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context dependency:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent's effectiveness directly correlates to how well you've connected tools and tagged resources. Garbage in, garbage out still applies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investigation gaps feature:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To its credit, DevOps Agent explicitly shows "Investigation Gaps," things it couldn't analyze due to missing logs, absent SSH access, or incomplete telemetry. This transparency is valuable but confirms the limitations.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Maximize Effectiveness
&lt;/h2&gt;

&lt;p&gt;If you're going to use this, do it right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Connect everything:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't just connect CloudWatch. Add your GitHub repos so it can correlate deployments. Connect Slack so it updates your incident channel. Add Datadog or whatever you use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Use MCP for custom tools:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Have internal tools? Build an MCP server. The protocol is open and documented. This is how you get real value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Tag your resources:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If it's not in CloudFormation, tag it. Use consistent key-value pairs across your infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Create runbooks:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DevOps Agent supports runbooks as "pre-loaded guidance." Create them for your common incident patterns. This gives the agent hints about where to look.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Start with one Agent Space:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't create 10 spaces immediately. Start with one team or application, learn the patterns, then expand.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;AWS DevOps Agent is genuinely useful. It's not a gimmick. The ability to have something correlating data across 5 different tools at 3 AM while you sleep is valuable.&lt;/p&gt;

&lt;p&gt;But it's not magic. It's not replacing anyone. It's a sophisticated diagnostic tool that still requires human judgment to act on its findings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have frequent incidents and high MTTR&lt;/li&gt;
&lt;li&gt;Your observability tools are already well-integrated&lt;/li&gt;
&lt;li&gt;You want to reduce on-call burden (not headcount)&lt;/li&gt;
&lt;li&gt;You're willing to invest in proper tagging and MCP setup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip it if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're in a heavily regulated industry requiring human approval for all changes&lt;/li&gt;
&lt;li&gt;Your infrastructure is poorly documented&lt;/li&gt;
&lt;li&gt;You expect it to fix things, not just find them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The preview is free. Try it. But go in with realistic expectations.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your take on AI agents in DevOps? Have you tested the preview? Drop a comment below.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/" rel="noopener noreferrer"&gt;AWS DevOps Agent Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/aws/aws-devops-agent-helps-you-accelerate-incident-response-and-improve-system-reliability-preview/" rel="noopener noreferrer"&gt;AWS Blog Announcement&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/devopsagent/latest/userguide/userguide-public-preview-pricing-and-limits.html" rel="noopener noreferrer"&gt;Preview Pricing &amp;amp; Limits&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>ai</category>
      <category>cloud</category>
    </item>
  </channel>
</rss>
