Forem: Mateen Anjum

Kubernetes v1.36 Drops April 22: What Platform Engineers Actually Need to Know

Mateen Anjum — Sat, 18 Apr 2026 04:54:58 +0000

TL;DR: Kubernetes v1.36 releases April 22, 2026. The headline features are DRA GPU partitioning, workload-aware preemption for AI/ML jobs, and the permanent removal of the gitRepo volume plugin. Ingress-nginx is also officially retired. If you run AI inference workloads or care about cluster security, this release is not optional reading.

Why This Release Matters More Than Most

The CNCF's 2025 annual survey dropped a number that stopped a lot of people mid-scroll: 66% of organizations hosting generative AI models now use Kubernetes for some or all of their inference workloads. That's not a trend, that's a fait accompli. Kubernetes is the AI compute substrate whether you planned for it or not.

v1.36 is the release that leans into that reality. The bulk of the new work is in Dynamic Resource Allocation (DRA), gang scheduling, and topology-aware placement, all of which exist because running distributed AI/ML jobs on Kubernetes has historically been painful. This release makes it less painful.

But there are also breaking changes and security fixes that affect everyone, not just the ML crowd. Let me walk through what actually matters.

The Breaking Changes First

gitRepo Volume Plugin: Gone for Good

If you're still using gitRepo volumes, stop reading and go fix that right now. The plugin has been deprecated since v1.11 and is now permanently disabled in v1.36. No feature flag, no workaround.

The reason it's gone is serious: gitRepo allowed attackers to run code as root on the node. It was a known attack vector for years. The right replacement is an init container running git clone, or a git-sync sidecar. Both are well-documented and production-proven.

# Before (broken in v1.36)
volumes:
  - name: code
    gitRepo:
      repository: "https://github.com/example/repo"
      revision: "main"

# After: use an init container
initContainers:
  - name: git-sync
    image: registry.k8s.io/git-sync/git-sync:v4.2.1
    args:
      - --repo=https://github.com/example/repo
      - --branch=main
      - --root=/git
      - --one-time
    volumeMounts:
      - name: code
        mountPath: /git

Ingress-NGINX Is Retired

SIG Network and the Security Response Committee retired ingress-nginx on March 24, 2026. No more releases, no more security patches. Existing deployments keep running, but you're on your own for CVEs from here.

The community's recommended alternatives are Envoy Gateway (CNCF graduated), Cilium Gateway API, and Traefik. If you're on ingress-nginx in production, this is your migration window. Don't wait for the next CVE to force your hand.

service.spec.externalIPs Deprecated

The externalIPs field in Service specs is being deprecated (full removal planned for v1.43). It's been a known vector for man-in-the-middle attacks since CVE-2020-8554. You'll see deprecation warnings starting in v1.36. Migrate to LoadBalancer services, NodePort, or Gateway API.

The AI/ML Features That Actually Change How You Work

DRA: Partitionable Devices (Beta)

This is the one I'm most excited about. v1.36 promotes DRA support for partitionable devices to beta, meaning it's enabled by default. A single GPU can now be split into multiple logical units and allocated to different workloads.

Before this, if you had an H100 and a workload that only needed 20% of it, you either wasted 80% or ran a separate MIG configuration outside Kubernetes. Now the scheduler handles it natively.

apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaim
metadata:
  name: partial-gpu
spec:
  devices:
    requests:
    - name: gpu-slice
      deviceClassName: nvidia.com/gpu
      count: 1
      # Request a partition, not the whole device
      selectors:
      - cel:
          expression: device.attributes["nvidia.com/gpu"].partitionable == true

For platform teams running shared GPU clusters, this is a significant cost lever. You can pack more inference workloads onto the same hardware without sacrificing isolation.

Workload-Aware Preemption (Alpha)

Standard Kubernetes preemption works pod-by-pod. For distributed AI/ML jobs, that's a disaster: preempt one pod from a training job and the whole job stalls, wasting all the resources it's still holding.

v1.36 introduces workload-aware preemption via PodGroups. The scheduler now treats a group of related pods as a single entity. When it needs to make room for a high-priority job, it preempts entire groups rather than individual pods.

apiVersion: scheduling.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: training-job-a
spec:
  minMember: 8
  priorityClassName: high-priority
  gangSchedulingPolicy:
    disruptionMode: PodGroup  # preempt the whole group, not individual pods

This is alpha, so it's off by default. But if you're running Kueue or JobSet for batch AI workloads, this is worth enabling in a test cluster now.

Pod-Level Resource Managers (Alpha)

For HPC and AI/ML workloads, NUMA alignment matters. Previously, the Topology Manager only worked at the container level. If you had a training container plus logging and monitoring sidecars in the same pod, you couldn't guarantee they all landed on the same NUMA node.

v1.36 adds pod-scope resource management: you can now set pod.spec.resources and have the Topology Manager treat the entire pod as a single scheduling unit. All containers get resources from the same NUMA node.

spec:
  resources:
    requests:
      cpu: "16"
      memory: "64Gi"
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/numa-node
      whenUnsatisfiable: DoNotSchedule

DRA Resource Availability Visibility (Alpha)

Finally, a native way to answer "how many GPUs are free in this cluster?" without writing custom tooling.

kubectl create -f - <<EOF
apiVersion: resource.k8s.io/v1alpha1
kind: ResourcePoolStatusRequest
metadata:
  name: check-gpus
spec:
  driver: nvidia.com/gpu
EOF

kubectl get rpsr/check-gpus -o yaml
# Returns: totalDevices, allocatedDevices, availableDevices per node

This is alpha, but it's the kind of operational visibility that platform teams have been hacking around for years.

The Stability Improvements

SELinux Volume Labeling: Now GA

Faster pod startup on SELinux-enforcing systems. This replaces recursive file relabeling with a single mount-time label, which can cut pod startup time significantly on large volumes. It's been in beta since v1.28 and is now stable and on by default.

If you're running RHEL or any SELinux-enforcing OS, you'll notice this immediately.

External ServiceAccount Token Signing: GA

The kube-apiserver can now delegate token signing to external KMS or HSM systems. For clusters with strict key management requirements (financial services, healthcare, government), this removes a significant compliance gap.

Graceful Leader Transition (Alpha)

Control plane components (kube-controller-manager, kube-scheduler) used to call os.Exit() when losing leader election, forcing a full restart. v1.36 introduces graceful transitions: the component moves to follower state and re-enters the election without restarting. Faster failover, less noise in your control plane logs.

Stale Controller Mitigation (Alpha)

Large clusters with high churn have always had a subtle bug: a controller creates a resource, its cache hasn't updated yet, and it tries to create the same resource again. v1.36 adds cache freshness tracking so controllers check whether their local state is current before reconciling. Fewer duplicate creates, fewer spurious errors in busy clusters.

HPA Scale-to-Zero (Alpha)

The Horizontal Pod Autoscaler can now scale deployments to zero replicas based on external metrics (queue depth, custom metrics). When the queue is empty, the deployment goes to zero. When work arrives, it scales back up. This is the missing piece for event-driven workloads that don't need to run 24/7.

What to Do Before April 22

Audit gitRepo volumes. Run kubectl get pods -A -o json | jq '.items[].spec.volumes[]? | select(.gitRepo != null)'. If you get output, you have work to do.
Plan your ingress-nginx migration. Check kubectl get ingressclass and kubectl get pods -A | grep ingress-nginx. If you're running it, pick a replacement and start testing.
Check for externalIPs usage. kubectl get svc -A -o json | jq '.items[] | select(.spec.externalIPs != null) | .metadata.name'
Enable DRA partitionable devices in staging. If you run GPU workloads, this is worth testing before it becomes the default everywhere.
Read the full changelog. The CHANGELOG-1.36.md is dense but worth scanning for anything specific to your stack.

The Bigger Picture

v1.36 isn't a flashy release. There's no single feature that rewrites how Kubernetes works. What it is, is a release that takes the AI/ML workload story seriously at the scheduler and resource allocation level, while cleaning up years of accumulated security debt.

The gitRepo removal and ingress-nginx retirement are overdue. The DRA work is genuinely new capability. And the gang scheduling improvements are the kind of thing that makes distributed training jobs actually reliable on Kubernetes instead of just theoretically possible.

If you're running AI inference at scale, v1.36 is the release you've been waiting for. If you're running anything else, it's a solid maintenance release with a few security items you can't ignore.

Resources:

ingress-nginx Is Dead: How I Migrated to Gateway API Before It Became a Liability

Mateen Anjum — Tue, 07 Apr 2026 18:15:05 +0000

ingress-nginx was archived on March 24, 2026 after a string of critical CVEs including a 9.8 CVSS unauthenticated RCE. Gateway API v1.4 is the CNCF-graduated replacement. I used ingress2gateway 1.0 to convert 40+ Ingress resources to HTTPRoutes, validated the output, and cut over with zero downtime. Here's exactly how I did it.

Why This Happened

In March 2025, CVE-2025-1974 (dubbed "IngressNightmare") dropped: a CVSS 9.8 unauthenticated remote code execution vulnerability in ingress-nginx's admission webhook. Any attacker with network access to the webhook could execute arbitrary code inside the controller pod, which typically has broad cluster permissions. That was bad enough on its own.

Then came 2026. Four more HIGH-severity CVEs landed in quick succession:

CVE	Severity	What It Does
CVE-2025-1974	CRITICAL 9.8	Unauthenticated RCE via admission webhook
CVE-2026-1580	HIGH	Config injection leading to privilege escalation
CVE-2026-24512	HIGH	Path injection through nginx config manipulation
CVE-2026-24513	HIGH	Authentication bypass
CVE-2026-24514	HIGH	Annotation abuse for unauthorized access

On March 24, 2026, the ingress-nginx repository was officially archived. Read-only. No more patches. No more CVE fixes. If you're still running it, you're running unpatched software with known critical vulnerabilities.

This wasn't a surprise deprecation. The Kubernetes community had been building Gateway API for years as the successor to the Ingress resource. But the CVE storm turned "migrate when convenient" into "migrate now."

Gateway API: What Actually Changed

Gateway API isn't just "Ingress v2." It fundamentally changes how traffic routing is modeled in Kubernetes by splitting responsibilities across three layers:

Layer 1: GatewayClass (Infrastructure Admin)

The infrastructure team defines what gateway implementation is available. Think of it as the "which load balancer technology" decision.

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: production-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller

Layer 2: Gateway (Cluster Operator)

The platform team creates Gateway resources that bind to a GatewayClass. This is where you define listeners, ports, TLS certificates, and which namespaces can attach routes.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: main-gateway
  namespace: gateway-infra
spec:
  gatewayClassName: production-gateway
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        mode: Terminate
        certificateRefs:
          - name: wildcard-tls
      allowedRoutes:
        namespaces:
          from: Selector
          selector:
            matchLabels:
              gateway-access: "true"
    - name: http
      protocol: HTTP
      port: 80

Layer 3: HTTPRoute (Application Developer)

Application teams define their own routing rules without touching the gateway configuration. They just reference the Gateway they want to attach to.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-api
  namespace: my-api
spec:
  parentRefs:
    - name: main-gateway
      namespace: gateway-infra
  hostnames:
    - "api.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1
      backendRefs:
        - name: api-service
          port: 8080

This separation matters because it maps to how teams actually operate. Infrastructure admins pick the implementation. Platform engineers configure the gateway. App developers define their routes. Nobody steps on each other's toes, and RBAC enforces the boundaries.

Why This Is Better Than Annotations

With ingress-nginx, everything was shoved into annotations. Rate limiting, CORS, timeouts, rewrites, all of it crammed into nginx.ingress.kubernetes.io/* strings that were:

Non-standard: Every controller had its own annotation format
Unvalidated: Typo an annotation name? Silent failure
Unstructured: Complex configs as string values
Non-portable: Locked to one implementation

Gateway API uses typed CRD fields. Your IDE autocompletes them. The API server validates them. They work across implementations.

The Migration: Using ingress2gateway 1.0

On March 20, 2026, ingress2gateway 1.0 shipped with support for 30+ ingress-nginx annotations. This was the tool that made bulk migration practical.

Step 1: Install

brew install ingress2gateway
# or
go install github.com/kubernetes-sigs/ingress2gateway@v1.0.0

Step 2: Scan and Convert

# Convert everything cluster-wide
ingress2gateway print --providers=ingress-nginx --all-namespaces > gwapi.yaml

# Or target a specific namespace
ingress2gateway print --namespace my-api --providers=ingress-nginx > gwapi.yaml

# If you've chosen your implementation, use emitter flags
ingress2gateway print --emitter envoy-gateway --providers=ingress-nginx --all-namespaces > gwapi.yaml

Step 3: Review the Output

Here's what a typical translation looks like.

Before (Ingress with ingress-nginx annotations):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-api
  annotations:
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://app.example.com"
    nginx.ingress.kubernetes.io/cors-allow-methods: "GET, POST, OPTIONS"
    nginx.ingress.kubernetes.io/cors-enable: "true"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/use-regex: "true"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - api.example.com
      secretName: api-tls
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /api/v[0-9]+/users
            pathType: ImplementationSpecific
            backend:
              service:
                name: users-service
                port:
                  number: 8080

After (Gateway API HTTPRoute):

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-api
spec:
  parentRefs:
    - name: main-gateway
      namespace: gateway-infra
  hostnames:
    - "api.example.com"
  rules:
    - matches:
        - path:
            type: RegularExpression
            value: "/api/v[0-9]+/users"
      filters:
        - type: ResponseHeaderModifier
          responseHeaderModifier:
            set:
              - name: Access-Control-Allow-Origin
                value: "https://app.example.com"
              - name: Access-Control-Allow-Methods
                value: "GET, POST, OPTIONS"
      timeouts:
        backendRequest: 60s
      backendRefs:
        - name: users-service
          port: 8080

The structure is cleaner. CORS headers are explicit. The regex path type is a first-class field instead of being toggled by an annotation. Timeouts are typed durations, not string-encoded integers.

What ingress2gateway Cannot Translate

The tool is good, but it's not magic. Watch for these:

Custom Lua snippets. If you used nginx.ingress.kubernetes.io/server-snippet or configuration-snippet with custom Lua or raw nginx config, those have no Gateway API equivalent. You'll need to reimplement that logic in your application or use implementation-specific policies.

Rate limiting. ingress-nginx rate limiting annotations don't map to standard Gateway API fields. Most implementations offer their own rate limiting CRDs (like Envoy Gateway's BackendTrafficPolicy).

ModSecurity / WAF rules. If you had ModSecurity enabled via annotations, you'll need a separate WAF solution or an implementation that supports it natively.

Session affinity. Cookie-based session affinity annotations need implementation-specific configuration in Gateway API.

Custom error pages. These were nginx-specific and need to be handled at the application level or through implementation extensions.

ingress2gateway will print warnings for annotations it can't convert. Read every warning. I found three services silently losing rate limiting configs that would have caused issues in production.

Choosing a Gateway API Implementation

Gateway API is a spec. You need an implementation. Here's how I evaluated the main options:

Implementation	Backed By	Best For	Notes
Envoy Gateway	Envoy Proxy / CNCF	General purpose, feature-rich	Strong community, good docs
kgateway	Solo.io	Advanced traffic management	Commercial support available
Cilium Gateway	Isovalent/Cisco	eBPF-native networking	Great if you already run Cilium CNI
NGINX Gateway Fabric	F5/NGINX	Familiar nginx users	Uses nginx under the hood
Istio Waypoint	Google/Solo.io	Service mesh integration	If you're already on Istio

I went with Envoy Gateway. It's CNCF-backed, has broad feature coverage, and doesn't require buying into a service mesh. The --emitter envoy-gateway flag in ingress2gateway generates implementation-specific extensions where needed, which saved manual work.

My Migration Checklist

Here's the checklist I followed. Steal it.

Pre-migration:
[ ] Inventory all Ingress resources: kubectl get ingress --all-namespaces
[ ] Document custom annotations per Ingress
[ ] Identify any custom nginx configs (ConfigMap, snippets)
[ ] Install Gateway API CRDs: kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml
[ ] Deploy chosen Gateway API implementation

Conversion:
[ ] Run ingress2gateway print and capture output
[ ] Review ALL warnings from ingress2gateway
[ ] Manually handle untranslatable annotations
[ ] Create GatewayClass and Gateway resources
[ ] Create ReferenceGrant resources for cross-namespace refs

Validation:
[ ] Apply HTTPRoutes to staging cluster
[ ] Test every endpoint (automated: curl + expected status codes)
[ ] Verify TLS termination works
[ ] Check CORS headers in browser dev tools
[ ] Validate regex paths match correctly
[ ] Load test to confirm no performance regression

Cutover:
[ ] Update DNS or switch load balancer target
[ ] Monitor error rates for 30 minutes
[ ] Keep old Ingress resources (don't delete yet)
[ ] After 48 hours stable: remove old Ingress resources
[ ] Uninstall ingress-nginx controller

Results

After migrating 40+ Ingress resources across 12 namespaces:

Metric	Before	After
Known CVEs	5 (1 critical)	0
Annotation sprawl	180+ annotations	0 (typed fields)
Cross-namespace routing	Manual workarounds	Native ReferenceGrant
Downtime during migration	N/A	Zero
Time to complete	N/A	3 days (including validation)

Lessons Learned

Don't wait for the archive notice. Gateway API has been stable since v1.0 (October 2023). I should have started earlier. The CVE pressure made this more stressful than it needed to be.

ingress2gateway is a starting point, not a finish line. It handled about 85% of our config automatically. The remaining 15% required understanding both the old nginx annotations and the new Gateway API model.

The three-layer model pays off immediately. Within a week of the migration, our app teams were creating their own HTTPRoutes without filing tickets to the platform team. That alone justified the effort.

Test regex paths carefully. The regex syntax between nginx and Gateway API implementations can differ subtly. I caught two path patterns that matched differently under Envoy than they did under nginx.

Keep the old Ingress resources around. Don't delete them the moment Gateway API routes are working. Give yourself a rollback window. I kept ours for 48 hours before cleanup.

Resources:

Your Security Scanner Was the Weapon: Inside the Trivy Supply Chain Attack

Mateen Anjum — Sat, 28 Mar 2026 17:40:45 +0000

TL;DR: Trivy, the most widely used container scanning action in GitHub Actions, was compromised on March 19, 2026. A threat actor poisoned 76 of its 77 version tags. Every pipeline that ran a scan silently handed over SSH keys, cloud credentials, Kubernetes tokens, and more. The scan appeared to succeed. You'd never know.

The Setup

I've had Trivy in my pipelines for years. Container scanning on every PR, every merge, every deploy. It's one of those things you set up once and stop thinking about, which is exactly what makes this attack so effective.

On March 19, 2026, a threat actor group called TeamPCP force-pushed malicious commits to 76 of the 77 version tags in the aquasecurity/trivy-action GitHub repository. All 7 tags in aquasecurity/setup-trivy were also compromised. If your workflow referenced Trivy by a tag (which is how basically everyone references GitHub Actions), you were running their code.

The scanner still ran. Your pipeline still went green. You had no idea.

How It Actually Happened

This attack didn't start on March 19. It started weeks earlier.

Late February 2026: An automated bot called "hackerbot-claw" exploited a misconfigured GitHub Actions workflow and stole a privileged Personal Access Token from Aqua Security's CI environment. The attacker used this to push malware to the Trivy VS Code extension on Open VSX.

March 1: Aqua Security disclosed the incident publicly via a GitHub discussion and rotated credentials. Except the rotation was incomplete. One service account, one PAT, one residual access path, still live.

March 19, 17:43 UTC: Using the still-valid credentials, TeamPCP force-pushed malicious commits to 76 of 77 tags in trivy-action and all 7 tags in setup-trivy. The compromised commits spoofed legitimate maintainer identities. GitHub itself flagged them with "This commit does not belong to any branch on this repository" but that warning is easy to miss in a workflow log.

March 19, 18:22 UTC: A rogue commit published a malicious Trivy binary as v0.69.4 across every distribution channel simultaneously: GitHub Releases, GHCR, Docker Hub, ECR Public, deb/rpm repositories, and get.trivy.dev.

March 20, 05:40 UTC: Aqua remediated the trivy-action tags. The window was roughly 12 hours.

March 22: The attacker pushed additional malicious Docker Hub images (v0.69.5, v0.69.6, latest) using separately compromised Docker Hub credentials, bypassing all GitHub controls. Same day, 44 repositories in Aqua's aquasec-com GitHub org were defaced using a stolen service account token that bridged both orgs.

March 24: The campaign expanded to Checkmarx KICS and LiteLLM PyPI packages (1.82.7, 1.82.8).

The takeaway here is not just that a tool got compromised. It's that incomplete remediation turned a single breach into a three-week campaign.

What the Payload Did

This is the part that should make you uncomfortable.

The malicious entrypoint.sh prepended about 105 lines of attack code before the legitimate Trivy scanner logic. The scan completed normally. Your logs looked fine.

Stage 1: Process enumeration. The script scanned /proc/*/environ across all runner processes, extracting environment-level secrets, filtering for anything with env or ssh in the name.

Stage 2: Memory scraping. Here's where it gets clever. On GitHub-hosted runners, a base64-encoded Python script located the Runner.Worker process, read its memory maps via /proc/{PID}/maps, and scraped raw process memory via /proc/{PID}/mem. It was targeting GitHub Actions secrets specifically, looking for JSON structures matching "name":{"value":"...","isSecret":true}.

This matters because it bypassed GitHub Actions' secret masking. Values that would show as *** in your logs were extracted in plaintext from process memory. There's no log artifact. Nothing looked wrong.

On self-hosted runners, a different payload did broad filesystem harvesting across 50+ target paths:

SSH private keys and configs
Cloud credentials (AWS, GCP, Azure)
Kubernetes configs and service account tokens
Docker registry credentials
Database credentials (MySQL, PostgreSQL, MongoDB, Redis)
Terraform state files and variables
Ansible vaults and Helm values
.env files and API keys
Git credentials and shell history
Cryptocurrency wallet keys

Stage 3: Encryption. Collected data was encrypted with AES-256-CBC using a randomly generated session key, then the session key was wrapped with a hardcoded 4096-bit RSA public key. Only the attacker's server could decrypt it.

Stage 4: Exfiltration. Primary channel was an HTTPS POST to scan.aquasecurtiy[.]org, a typosquatted domain designed to look like Aqua Security's legitimate domain (note the extra 'i' in 'security'). Resolves to 45.148.10.212 in Amsterdam.

Fallback: if the C2 was unreachable and a GitHub PAT was in the environment, the malware created a public repository named tpcp-docs and uploaded stolen data as release assets, turning GitHub itself into the exfiltration channel.

Are You Affected?

Check these specific exposure windows:

Component	Affected Versions	Exposure Window	Safe
trivy binary	v0.69.4	~3h (Mar 19)	v0.69.3 or earlier
trivy Docker Hub	v0.69.5, v0.69.6, latest	~10h (Mar 22–24)	v0.69.3 or earlier
trivy-action	Tags 0.0.1–0.34.2	~12h (Mar 19–20)	v0.35.0+ or SHA-pinned
setup-trivy	All 7 tags	~12h (Mar 19–20)	SHA-pinned
LiteLLM PyPI	1.82.7, 1.82.8	Mar 24+	1.82.6 or earlier

If you ran Trivy in any pipeline during those windows and weren't pinning to a commit SHA, you have to assume secrets were stolen. All of them. Every secret accessible from that runner environment.

What You Need to Change

This is the remediation checklist, ordered by priority.

1. Rotate first, investigate second

If you were in the exposure window, rotate everything the runner could have touched. Don't wait for confirmation. Treat every secret as compromised:

AWS access keys and IAM roles
GCP service account keys
Azure service principals
Kubernetes service account tokens
Docker registry credentials
SSH keys
Database credentials
GitHub PATs and tokens

2. Pin actions to commit SHAs

This is the single most effective structural change. Tags are mutable. Commit SHAs are not.

# Bad — this is what everyone does, and what got compromised
- uses: aquasecurity/trivy-action@0.24.0

# Good — SHA-pinned, immutable
- uses: aquasecurity/trivy-action@57a97c7843d7da7a7b4f8ce2a0c4e3b7f0c2e1d  # 0.35.0

Yes, it's more work to update. That's the point. Renovatebot or Dependabot can automate SHA updates if you configure them for Actions.

3. Switch to OIDC for cloud authentication

Long-lived cloud credentials in CI are a liability. OIDC lets your runner authenticate to AWS, GCP, or Azure without storing static keys:

# AWS example
- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::ACCOUNT:role/github-actions-role
    aws-region: us-east-1

Nothing to steal if there's nothing stored. The credentials are ephemeral and scoped to the job.

4. Restrict runner permissions

GitHub Actions runners get GITHUB_TOKEN by default. Scope it down:

permissions:
  contents: read
  security-events: write
  # Nothing else

Most workflows need far less than the default. Less permission means smaller blast radius.

5. Audit non-human identities

The Trivy attack persisted because one service account credential wasn't rotated. Audit all machine identities in your org:

GitHub PATs: Who issued them? When do they expire? Are they scoped minimally?
Service accounts: Which ones have write access to release infrastructure?
Bot accounts: Are any shared across orgs or repositories?

Long-lived, over-privileged service accounts are how a one-time breach becomes a three-week campaign.

6. Use secret scanning

GitGuardian, GitHub's native secret scanning, or both. The Trivy attacker used GitHub as a fallback exfiltration channel. If your credentials ever end up in a public repo, you want to know in minutes, not days.

7. Verify binaries before running them

For direct binary downloads (not GitHub Actions), verify checksums:

# Download the official checksums
curl -sSL https://github.com/aquasecurity/trivy/releases/download/v0.69.3/trivy_0.69.3_checksums.txt -o checksums.txt

# Verify your binary
sha256sum -c checksums.txt --ignore-missing

If your pipeline downloads and runs binaries from the internet, add checksum verification as a step.

The Real Lesson

The Trivy attack was technically sophisticated, but the root cause is unglamorous: incomplete credential rotation.

Aqua disclosed the initial breach on March 1 and rotated credentials. One PAT, one service account, one residual access path was left active. That's what TeamPCP used on March 19. The March 22 Docker Hub compromise used yet another separate credential that wasn't in scope of the original remediation.

When you rotate secrets after a breach, you need to be exhaustive. Enumerate every credential that could have been exposed, every service account that had access, every integration that used a compromised token. Rotation is not a task you do until it feels complete. It's a task you do until you've verified every access path is severed.

The other lesson: the attack surface for CI/CD is enormous. Your pipeline runs with access to secrets, cloud credentials, internal infrastructure. When you add a third-party action, you're trusting that maintainer's entire security posture, including their CI, their service accounts, and their credential management practices. SHA pinning doesn't eliminate that trust but it gives you a stable, auditable point you can reason about.

Immediate Checklist

[ ] Check pipeline logs for trivy-action usage between March 19–20
[ ] Check pipeline logs for trivy binary v0.69.4 usage on March 19
[ ] Check for Docker image usage of v0.69.5, v0.69.6, or latest between Mar 22–24
[ ] Rotate all secrets accessible from affected runners
[ ] Update trivy-action to v0.35.0 or pin to SHA
[ ] Check for LiteLLM usage of 1.82.7 or 1.82.8
[ ] Switch cloud auth to OIDC
[ ] Pin all third-party actions to commit SHAs
[ ] Restrict workflow permissions to minimum required
[ ] Audit service accounts and PATs for expiry and scope
[ ] Enable secret scanning on your org

References:

GitHub Actions costs are leaking, and most teams don't notice until it's too late

Mateen Anjum — Mon, 16 Mar 2026 05:24:12 +0000

Two years ago I was working on a connected vehicles platform running 40+ microservices on Kubernetes. CI was healthy, tests were passing, and nobody was paying attention to the GitHub Actions bill until it hit $4,200 in a single month.

The culprit was a matrix build that someone had extended to cover six Node versions. Nobody noticed because the cost didn't show up anywhere obvious. It wasn't flagged in any alert. The engineers who added the matrix jobs weren't thinking about cost. By the time finance asked the question, the pattern had been running for three months.

I started looking for a tool that could give us per-workflow cost visibility. Something that would let us answer "which workflows cost the most" and "did this PR make CI more expensive." I didn't find anything that fit, so I built CICosts.

What it does

CICosts installs as a GitHub App and receives a webhook event every time a workflow run completes. It multiplies the runner minutes by GitHub's published pricing for that runner type (Linux, Windows, macOS, self-hosted) and stores the result.

From there you get a dashboard showing cost by workflow, by repository, by branch, and over time. You can set alerts when a workflow exceeds a threshold. You can see trends, spot regressions after PRs merge, and compare costs across environments.

The math is straightforward. GitHub charges $0.008/minute for Linux runners, $0.016 for Windows, $0.08 for macOS. If a workflow runs for 12 minutes on Linux, that's $0.096. Not much in isolation. Run it 500 times a day across 30 repositories and it adds up fast.

The common patterns I see

After watching enough CI pipelines, a few patterns account for most of the waste:

Matrix explosions. A workflow that tests across 3 OS versions and 4 runtime versions runs 12 times per push. If the matrix was added incrementally over time, nobody may have thought through the cumulative cost.

macOS runners for non-macOS work. macOS runners cost 10x more than Linux. They're necessary for iOS builds and sometimes for Homebrew. They're not necessary for most backend services, but they show up there sometimes because someone copied a workflow template.

Test parallelism without caching. Running tests in parallel is good. Running them in parallel while re-downloading 200MB of dependencies on every run because the cache key is wrong is expensive.

Nightly builds that nobody needs. Workflows scheduled to run nightly that were set up to catch a specific class of bug that was fixed 18 months ago. The schedule never got cleaned up.

None of these are difficult to fix once you can see them. The problem is visibility.

Why it's now open source and free

I built this as a paid SaaS originally. The pricing was too restrictive for a product without an established reputation. If you're asking engineers to add a GitHub App to their organization and trust it with their CI data, "trust us, it's $29/month" is a hard sell when nobody's heard of you.

The honest version: the product was good and nobody knew about it. That's a distribution problem, not a product problem.

So the model is now simple. CICosts is MIT licensed, the code is on GitHub, and the hosted version at app.cicosts.dev is free with no usage limits. If your organization needs an SLA or wants a private deployment, that's the enterprise tier.

Getting started

Install it from GitHub:

https://github.com/phonotechnologies/cicosts-app
https://github.com/phonotechnologies/cicosts-api

Or use the hosted version directly at app.cicosts.dev. Add the GitHub App to your organization, and cost data starts flowing within a few minutes of your next workflow run.

The setup takes about five minutes. There's no code change required in your repos. The GitHub App receives webhook events automatically once installed.

What I'd do differently

If I were starting from zero, I'd make it open source from day one and focus entirely on getting the GitHub App installation experience right. The hardest part of a tool like this isn't the cost calculation. It's getting someone to trust it enough to install it.

Open source makes that easier. You can read the code. You can see exactly what data is being stored and what isn't. That matters when you're asking someone to add an app to their GitHub organization.

The code is on GitHub under the phonotechnologies organization. PRs welcome, especially around runner pricing updates and new alert types. If you run into something, open an issue.

GitOps for ML in 2026: Treat Your AI Models Like Microservices (Or Watch Them Drift Into Production Chaos)

Mateen Anjum — Sat, 14 Mar 2026 21:46:50 +0000

TL;DR: Apply the same GitOps discipline you use for application code to ML model deployments, and you get version history, rollback, and promotion gates that actually work, instead of the SSH-and-pray workflow most teams are still running.

The Problem

There's a model running in production right now that nobody on your team can explain. It was trained six weeks ago, deployed by someone who's since moved to a different team, and the only record of what version it is lives in a Slack message that's been buried under 4,000 other messages.

When it starts making bad predictions, what's your rollback plan? If your answer involves SSHing into a server, editing a config file by hand, and hoping the right weights get loaded, you're in the majority. That doesn't make it less of a disaster.

I spent the better part of last year helping platform teams get their ML deployment story straight. The pattern I kept seeing: teams had decent model training pipelines, reasonable experiment tracking in MLflow, and then a complete gap between "model registered" and "model serving traffic." The gap got filled with shell scripts, manual steps, and a whole lot of tribal knowledge.

The fix isn't a new tool. It's applying discipline you already have from application deployments to the model deployment layer.

Before we moved to GitOps for model deployments, a typical promotion cycle looked like this. A data scientist trains a new version, registers it in MLflow, then files a ticket. A platform engineer picks up the ticket, SSH-es into the model server, updates the model path, restarts the serving process, and manually validates that predictions look reasonable. Start to finish: 4 to 6 hours on a good day, longer when the engineer is in meetings or the server is being weird.

Rollback? There was no rollback. The best-case scenario was that someone remembered what the previous model path was.

What Most Teams Try First (And Why It Fails)

The first instinct is usually scripts. Someone writes a deploy.sh that takes a model version as an argument, connects to the serving infrastructure, and handles the update. This is better than pure manual steps, but it fails in a few predictable ways.

First, scripts don't have memory. You can run deploy.sh with model version 47, then run it again with version 51, and there's no audit trail of who ran what or why. When something goes wrong, you're back to grep-ing through logs and asking around.

Second, scripts don't handle promotion gates. You can't encode "this model can only go to production if it passed staging validation for 24 hours" in a shell script without it becoming a sprawling mess that nobody wants to maintain.

Third, and this one bites hardest: scripts assume the current state. If someone manually changes something on the serving infrastructure, your script has no way of detecting that drift. The next run might succeed or fail unpredictably depending on what changed and when.

MLflow solves the experiment tracking and model registry side well. You get version numbers, artifact storage in S3, stage transitions (Staging, Production), and a clean API. What MLflow doesn't give you is a Kubernetes-native way to declare "this cluster should be running model version 47 right now" and enforce that continuously.

That's where KServe and ArgoCD come in.

The Architecture

The full stack has five layers working together.

MLflow + S3 handle model artifacts. Every trained model version gets registered with MLflow, which stores the artifact URI pointing to a path in S3. The URI looks something like s3://ml-models-prod/fraud-detector/v47/model.pkl. MLflow's registry gives you a version number and stage metadata. The actual weights live in S3.

KServe InferenceService is the Kubernetes abstraction for serving. Instead of managing a Pod or Deployment by hand, you define an InferenceService custom resource that describes what model to load, from where, and how to scale. KServe handles the rest: downloading the artifact from S3, loading it into the serving framework (Triton, TorchServe, SKLearn Server), and exposing an HTTP endpoint.

Git holds the desired state. A values.yaml file in your repository specifies which model version each environment should run. Promoting from staging to production is a PR that bumps a version number. The PR is the change review, the approval gate, and the audit trail all at once.

ArgoCD reconciles the cluster to match what's in Git. When the PR merges, ArgoCD detects the change and applies the updated KServe InferenceService. If someone manually changes the InferenceService on the cluster, ArgoCD detects the drift and reverts it.

Istio manages traffic splitting. During canary promotion, a VirtualService routes 10% of traffic to the new model version while 90% continues to the stable version. If metrics look good after a soak period, you update the weights and do a full cutover.

Prometheus collects serving metrics. Latency (p99 in particular), throughput, and prediction distribution histograms give you the signals needed to decide whether a canary is healthy or needs to be rolled back.

The Workflow

Here's how a model promotion actually works end to end.

A data scientist trains a new model, evaluates it against the validation set, and if it passes threshold, registers it in MLflow:

import mlflow

with mlflow.start_run():
    mlflow.sklearn.log_model(model, "model")
    mlflow.log_metrics({"f1_score": 0.94, "auc": 0.97})
    run_id = mlflow.active_run().info.run_id

client = mlflow.tracking.MlflowClient()
model_uri = f"runs:/{run_id}/model"
mv = client.create_model_version("fraud-detector", model_uri, run_id)
# mv.version == "47"

That registration triggers a CI pipeline (GitHub Actions or Tekton, depending on your setup) that opens a pull request bumping the version in the dev environment's values file.

values.yaml structure:

environments:
  dev:
    model:
      name: fraud-detector
      version: "47"
      storageUri: "s3://ml-models-prod/fraud-detector/v47"
      resources:
        requests:
          cpu: "2"
          memory: "4Gi"
        limits:
          cpu: "4"
          memory: "8Gi"
      minReplicas: 1
      maxReplicas: 3

  staging:
    model:
      name: fraud-detector
      version: "45"
      storageUri: "s3://ml-models-prod/fraud-detector/v45"
      resources:
        requests:
          cpu: "2"
          memory: "4Gi"
        limits:
          cpu: "4"
          memory: "8Gi"
      minReplicas: 2
      maxReplicas: 5

  prod:
    model:
      name: fraud-detector
      version: "43"
      storageUri: "s3://ml-models-prod/fraud-detector/v43"
      resources:
        requests:
          cpu: "4"
          memory: "8Gi"
        limits:
          cpu: "8"
          memory: "16Gi"
      minReplicas: 5
      maxReplicas: 20

KServe InferenceService (stable):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving-prod
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  predictor:
    serviceAccountName: kserve-s3-sa
    sklearn:
      storageUri: "s3://ml-models-prod/fraud-detector/v43"
      resources:
        requests:
          cpu: "4"
          memory: "8Gi"
        limits:
          cpu: "8"
          memory: "16Gi"
    minReplicas: 5
    maxReplicas: 20
    scaleTarget: 80
    scaleMetric: concurrency

KServe InferenceService (canary variant):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving-prod
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  predictor:
    serviceAccountName: kserve-s3-sa
    sklearn:
      storageUri: "s3://ml-models-prod/fraud-detector/v47"
      resources:
        requests:
          cpu: "4"
          memory: "8Gi"
        limits:
          cpu: "8"
          memory: "16Gi"
    minReplicas: 1
    maxReplicas: 5
    canaryTrafficPercent: 10

ArgoCD ApplicationSet for multi-environment management:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: fraud-detector-serving
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          - env: dev
            cluster: dev-cluster
            namespace: ml-serving-dev
          - env: staging
            cluster: staging-cluster
            namespace: ml-serving-staging
          - env: prod
            cluster: prod-cluster
            namespace: ml-serving-prod
  template:
    metadata:
      name: "fraud-detector-{{env}}"
    spec:
      project: ml-serving
      source:
        repoURL: https://github.com/org/ml-gitops
        targetRevision: HEAD
        path: "environments/{{env}}"
        helm:
          valueFiles:
            - values.yaml
      destination:
        server: "{{cluster}}"
        namespace: "{{namespace}}"
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
          - RespectIgnoreDifferences=true

Istio VirtualService for canary traffic split:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: fraud-detector-vs
  namespace: ml-serving-prod
spec:
  hosts:
    - fraud-detector.ml-serving-prod.svc.cluster.local
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: fraud-detector-predictor-canary
            port:
              number: 80
          weight: 100
    - route:
        - destination:
            host: fraud-detector-predictor-default
            port:
              number: 80
          weight: 90
        - destination:
            host: fraud-detector-predictor-canary
            port:
              number: 80
          weight: 10

After the PR merges to dev, ArgoCD picks up the change within 3 minutes (the default sync interval) and applies the updated InferenceService. The model downloads from S3, the serving pod comes up, and the endpoint starts responding. At this point you can run your automated evaluation suite against the dev endpoint.

Promoting to staging is another PR. A human reviews it, checks the dev evaluation results, and approves. Merge, ArgoCD syncs, done. Production promotion follows the same pattern but includes an additional step: the canary InferenceService gets deployed first with 10% traffic, and a GitHub Actions workflow monitors Prometheus metrics for a configured soak period (we use 2 hours for most models) before opening the full-cutover PR automatically.

Drift Detection

Prediction drift is the sneaky failure mode. The model is technically serving, latency looks fine, but the distribution of predictions has shifted because the input data changed. You won't catch this with a liveness probe.

KServe's sklearn server exposes prediction histograms as Prometheus metrics out of the box. You define alerting rules that fire when the distribution deviates beyond a threshold from the baseline captured at deployment time.

Prometheus PrometheusRule for drift alerting:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fraud-detector-drift
  namespace: ml-serving-prod
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: fraud-detector.drift
      interval: 2m
      rules:
        - alert: PredictionDriftDetected
          expr: |
            abs(
              avg_over_time(fraud_detector_prediction_mean[10m])
              - avg_over_time(fraud_detector_prediction_mean[60m] offset 1d)
            ) > 0.15
          for: 10m
          labels:
            severity: warning
            model: fraud-detector
            env: prod
          annotations:
            summary: "Prediction distribution shift detected for fraud-detector"
            description: "Mean prediction shifted by {{ $value | humanizePercentage }} from yesterday's baseline. Check for input data schema changes."

        - alert: ModelLatencyHigh
          expr: |
            histogram_quantile(0.99,
              sum(rate(fraud_detector_request_duration_seconds_bucket[5m])) by (le)
            ) > 0.5
          for: 5m
          labels:
            severity: critical
            model: fraud-detector
            env: prod
          annotations:
            summary: "p99 latency above 500ms for fraud-detector"
            description: "p99 latency is {{ $value }}s. SLA threshold is 500ms."

        - alert: ModelErrorRateHigh
          expr: |
            rate(fraud_detector_request_total{status_code=~"5.."}[5m])
            /
            rate(fraud_detector_request_total[5m]) > 0.01
          for: 5m
          labels:
            severity: critical
            model: fraud-detector
            env: prod
          annotations:
            summary: "Error rate above 1% for fraud-detector"

When this alert fires, it sends to PagerDuty (or your alert routing of choice via AlertManager). The on-call engineer's first action is to check whether a canary is active. If it is, rolling back is a single command:

git revert HEAD~1
git push origin main

ArgoCD detects the revert within 3 minutes and redeploys the previous InferenceService version. In practice, our rollbacks averaged 4 minutes from decision to stable serving.

Results

Metric	Before	After
Time to deploy new model version	4 to 6 hours	8 minutes to production canary
Rollback capability	None (manual rebuild)	`git revert`, avg 4 minutes
Drift detection time	6 hours (user reports)	15 minutes (automated alert)
Deployment audit trail	Slack messages	Full Git history with PR reviews
Environment parity	Best effort	Enforced via ApplicationSet
Config drift prevention	None	ArgoCD selfHeal

The number that surprised me most was the drift detection improvement. We caught a data schema change within 15 minutes on the new system. The same type of change previously went undetected for 6 hours before a user complaint surfaced it. That's not a monitoring win, it's a business outcome.

Lessons Learned

Start with the values.yaml contract. The shape of that file is the most important design decision you'll make. Get the team to agree on it before writing any ArgoCD config. Everything else follows from it.

S3 artifact URIs in the InferenceService spec, not model names. MLflow stage names ("Production", "Staging") are mutable. If you reference a stage name in your InferenceService spec, two different model versions could map to the same stage name over time, and your Git history loses meaning. Reference the explicit S3 URI with the version number baked in.

selfHeal is non-negotiable. Turn it on in your ArgoCD sync policy. Without selfHeal, a manual kubectl edit on the InferenceService will drift silently and nobody will notice until it matters.

Canary soak time depends on your traffic volume. For a high-volume fraud model processing 50k requests per minute, 30 minutes of canary is enough to get statistically significant signal. For a low-volume model processing 100 requests per day, 2 hours of canary at 10% gives you 20 requests through the new version. Adjust accordingly, or route specific customers to the canary instead of random percentage splitting.

Model cold start affects canary rollouts. Large models take time to download from S3 and load into memory. A 2GB model on a cold node might take 3 to 4 minutes before it's ready to serve. Account for this in your readiness probe timeouts and don't let your monitoring system flag the canary as failing during the startup window.

Try It Yourself

The repository structure I've described looks like this:

ml-gitops/
├── environments/
│   ├── dev/
│   │   ├── values.yaml
│   │   └── templates/
│   │       ├── inference-service.yaml
│   │       └── virtual-service.yaml
│   ├── staging/
│   │   ├── values.yaml
│   │   └── templates/
│   └── prod/
│       ├── values.yaml
│       └── templates/
├── base/
│   ├── inference-service-template.yaml
│   └── prometheus-rules.yaml
└── applicationset.yaml

Prerequisites before you start:

Kubernetes cluster (1.28 or newer)
KServe 0.12 or newer installed
ArgoCD 2.9 or newer installed
Istio 1.20 or newer installed
MLflow tracking server accessible from the cluster
S3 bucket with appropriate IRSA or Workload Identity configured for KServe pods

The ArgoCD ApplicationSet in this post assumes a Helm-based templating approach where each environment folder contains a values.yaml and a templates directory with the InferenceService and VirtualService manifests. You could also use Kustomize overlays. The concepts are identical.

Start with dev only. Get one model version deploying cleanly through ArgoCD before adding staging and prod. Add the canary workflow only after the basic promotion gate is working reliably.

The jump from "it works in dev" to "it's reliable in prod" is mostly about the Prometheus alerting and the canary soak automation. Those two pieces are what make the system trustworthy enough for the team to stop second-guessing every deployment.

Resources:

GitOps for ML Model Deployment: A Real Pipeline, Not a Toy Demo

Mateen Anjum — Sun, 08 Mar 2026 06:27:15 +0000

TL;DR: I replaced ad-hoc model deployments with a fully declarative GitOps pipeline using KServe and ArgoCD. Every model version lives in Git, every change goes through a PR, and rollbacks take one git revert.

The Problem

Every ML team I've worked with has the same dirty secret: their model deployments are snowflakes.

The Python script that "works on the data scientist's machine." The Slack message that says "hey can you deploy the new model." The SSH session into the GPU node that nobody documented. Meanwhile, the same team's microservices are humming along with ArgoCD, automated rollbacks, PR-gated deploys, full audit trails.

That gap is embarrassing, and it's completely unnecessary.

KServe got accepted into CNCF as an Incubating project in September 2025. The tooling to close this gap is mature enough for production. Here's what the actual problem looks like in practice:

Someone manually SSHes into a node and runs a deployment script. No record of what version went live.
A model update silently replaces the previous one. There's no rollback path.
Two data scientists think different model versions are running in staging. Both are right, sort of.
An incident happens. Nobody can tell what changed or when.

I've lived through all of these. The fix isn't a better runbook or more Slack discipline. It's treating model deployments the same way we treat application deployments.

What I Tried First (And Why It Failed)

Attempt 1: Wrapping deployments in shell scripts

The first instinct was to write a deploy_model.sh that calls kubectl apply with the right image tag. This is better than nothing, but it's not GitOps. The script lives somewhere, gets edited ad-hoc, and there's still no PR-gated workflow. The script is the new snowflake.

Attempt 2: Baking models into Docker images

The idea: train the model, package the weights into a Docker image, deploy the image via a normal Deployment. This works surprisingly well for small models under a few hundred MB. It breaks down fast when the model is 2GB or 14GB. Your Docker build times blow up, your registry costs climb, and now your CI pipeline is bottlenecked on model artifact size.

More importantly, you lose the semantic layer. Your Git history shows model:sha256-abc123 instead of fraud-detector/v2.5.0 sklearn 2 replicas 50 RPS target. The config and the artifact are fused. That's hard to review and harder to reason about.

Attempt 3: What actually worked

Separate the artifact from the config. The model weights live in S3, content-addressed and immutable. Git holds the pointer and all the serving configuration. A Kubernetes controller keeps the cluster in sync with what Git says. That's it.

The Solution

The stack I use and recommend:

Layer	Tool	Why
Model serving	KServe v0.14+	Kubernetes-native CRD, multi-framework, built-in canary
GitOps controller	ArgoCD	Declarative sync, health checks, rollback
Model storage	S3	Content-addressable, versioned, immutable
Model versioning	MLflow	Tracks lineage from training to deployment
Ingress	Istio	Traffic splitting for canary rollouts
Secrets	AWS IRSA	No credentials in Git, ever

KServe is the linchpin. It exposes a single InferenceService CRD that ArgoCD manages like any other Kubernetes resource.

Step 1: Install KServe

# cert-manager is a prerequisite
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.0/cert-manager.yaml

kubectl create ns kserve

helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
  --version v0.14.1 \
  --namespace kserve

helm install kserve oci://ghcr.io/kserve/charts/kserve \
  --version v0.14.1 \
  --namespace kserve \
  --set kserve.controller.deploymentMode=RawDeployment

I use RawDeployment mode. It uses standard Kubernetes Deployments and Services instead of Knative, which means fewer moving parts, better compatibility with existing Prometheus and HPA setups, and no cold-start complexity on the critical path.

Step 2: Structure your Git repo

models/
├── base/
│   └── kustomization.yaml
├── fraud-detector/
│   ├── kustomization.yaml
│   ├── inference-service.yaml
│   └── service-account.yaml
├── image-classifier/
│   ├── kustomization.yaml
│   └── inference-service.yaml
└── overlays/
    ├── staging/
    │   └── kustomization.yaml
    └── production/
        └── kustomization.yaml

Kustomize overlays let you parameterize resource limits, replica counts, and model URIs per environment without duplicating YAML.

Step 3: Define the InferenceService

This is the core resource. Here's a real example for a scikit-learn fraud detection model stored in S3:

# models/fraud-detector/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving
  labels:
    app: fraud-detector
    team: ml-platform
    model-version: "2.4.1"
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    minReplicas: 2
    maxReplicas: 10
    scaleTarget: 50
    scaleMetric: rps
    serviceAccountName: kserve-s3-sa
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://prod-ml-models/fraud-detector/v2.4.1"
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"
        limits:
          cpu: "2"
          memory: "4Gi"
      env:
        - name: SKLEARN_SERVER_WORKERS
          value: "2"

The storageUri is the version pointer. Bumping v2.4.1 to v2.5.0 and raising a PR is your deploy-new-model workflow.

For GPU workloads:

# models/image-classifier/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: image-classifier
  namespace: ml-serving
  labels:
    model-version: "1.3.0"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 4
    serviceAccountName: kserve-s3-sa
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://prod-ml-models/image-classifier/v1.3.0"
      runtimeVersion: "23.08-py3"
      resources:
        requests:
          cpu: "2"
          memory: "8Gi"
          nvidia.com/gpu: "1"
        limits:
          cpu: "4"
          memory: "16Gi"
          nvidia.com/gpu: "1"
      nodeSelector:
        accelerator: nvidia-a10g

Step 4: Wire up the S3 service account

Don't put AWS credentials in manifests. Use IRSA on EKS:

# models/fraud-detector/service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kserve-s3-sa
  namespace: ml-serving
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/kserve-model-reader

The IAM role needs s3:GetObject and s3:ListBucket on your model bucket. KServe's storage initializer picks up the IRSA token automatically.

Step 5: Create the ArgoCD Application

# argocd/apps/ml-models.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ml-models
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: ml-platform
  source:
    repoURL: https://github.com/phonotech/ml-manifests
    targetRevision: main
    path: models/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: ml-serving
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - RespectIgnoreDifferences=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  ignoreDifferences:
    - group: serving.kserve.io
      kind: InferenceService
      jsonPointers:
        - /status
        - /metadata/annotations/serving.kserve.io~1deploymentMode

The ignoreDifferences block is critical. KServe's controller writes back to the InferenceService status and some annotations. Without it, ArgoCD will perpetually detect drift and attempt to re-sync, creating a noisy feedback loop.

Step 6: The deployment workflow

Here's what a model update looks like end to end:

Data scientist trains a new model, registers the artifact in MLflow, uploads weights to s3://prod-ml-models/fraud-detector/v2.5.0/
They open a PR updating storageUri and the model-version label in inference-service.yaml
PR gets reviewed and merged to main
ArgoCD detects the diff within 3 minutes (or immediately with webhooks), syncs the new InferenceService spec
KServe's storage initializer pulls the new weights into the pod
New revision comes up healthy, traffic cuts over

The model version is in Git history. You can git revert it. You can see exactly what changed between v2.4.1 and v2.5.0 in the PR diff.

To trigger ArgoCD immediately via webhook from GitHub Actions:

# .github/workflows/sync-models.yaml
name: Notify ArgoCD on model manifest change
on:
  push:
    branches: [main]
    paths:
      - 'models/**'

jobs:
  sync:
    runs-on: ubuntu-latest
    steps:
      - name: Trigger ArgoCD sync
        run: |
          curl -s -X POST \
            -H "Authorization: Bearer ${{ secrets.ARGOCD_TOKEN }}" \
            https://argocd.internal.ca/api/v1/applications/ml-models/sync

Canary rollouts

KServe's built-in canary support is where this pattern earns its keep.

# Step 1: Deploy canary at 10% traffic
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://prod-ml-models/fraud-detector/v2.5.0"
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"

KServe automatically routes 90% to the last stable revision and 10% to v2.5.0. If the new model performs well, merge another PR bumping canaryTrafficPercent to 50, then promote to 100 by removing the field. If the canary is bad, set canaryTrafficPercent: 0 to pin back to stable immediately.

In RawDeployment mode, you handle canary at the Istio level:

# istio/virtualservice-fraud-detector.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: fraud-detector
  namespace: ml-serving
spec:
  hosts:
    - fraud-detector.ml-serving.svc.cluster.local
  http:
    - route:
        - destination:
            host: fraud-detector-v2-4-1-predictor
            port:
              number: 8080
          weight: 90
        - destination:
            host: fraud-detector-v2-5-0-predictor
            port:
              number: 8080
          weight: 10

Both the InferenceService and the VirtualService are in Git. The traffic split is in Git. Everything is auditable and revertible.

Results

I won't pretend I have clean before/after numbers from a single project because this pattern spans multiple engagements. Here's what consistently holds:

Metric	Before	After
Model deployment method	Manual SSH or ad-hoc scripts	PR-gated, Git-backed
Audit trail	None or Slack history	Full Git history
Rollback time	30 minutes to hours	One `git revert`, seconds
Canary traffic split	Not possible without Istio knowledge	Config field in YAML
Time to detect config drift	Never (no baseline)	Continuous, ArgoCD UI
Secret management	Often hard-coded or in `.env` files	IRSA, no credentials in Git

The operational improvement that surprises people most: the on-call burden drops significantly when you can answer "what version is running, what changed, who approved it" in under 30 seconds by looking at Git.

Lessons Learned

1. The ignoreDifferences config is not optional. Skip it and you'll spend a weekend wondering why ArgoCD is perpetually out of sync when nothing real has changed. KServe mutates its own resources. Tell ArgoCD which fields to ignore.

2. Model size determines your storage strategy. Under 500MB, the default S3 init container approach is fine. Over a few GB, you need a shared model cache PVC or a pre-baked image. Planning this up front saves a painful migration later.

3. Always set nodeSelector for GPU workloads. Without it, your InferenceService might land on a CPU node and silently fall back to CPU inference. Set the affinity, set the tolerations, pin it.

4. Start with RawDeployment mode. Knative is powerful but it adds complexity. Get the core pattern working first, then add Knative if you genuinely need scale-to-zero economics.

5. GitOps creates friction on purpose. The PR workflow adds a step that direct kubectl apply doesn't. That step is the point. If your team resents the friction, they haven't lived through the 2am incident where nobody knows what changed.

Try It Yourself

The five things you actually need to get started:

KServe installed (Helm, RawDeployment mode, cert-manager prerequisite)
A models-manifests repo with InferenceService YAML per model, Kustomize overlays for environments
ArgoCD Application pointing at overlays/production, selfHeal: true, with ignoreDifferences on KServe status fields
IRSA or Workload Identity for S3 access
Branch protection on main so model version bumps require PR review

The canary rollout and GitHub Actions webhook are enhancements. Get the core working first.

I Migrated a Real Production Codebase from Terraform to OpenTofu (Here's What Broke)

Mateen Anjum — Sun, 08 Mar 2026 06:25:03 +0000

TL;DR: Migrating a standard AWS Terraform codebase to OpenTofu took half a day, most of which was CI pipeline updates. The S3 native locking alone made it worth it.

The Problem

I've been writing Terraform since version 0.8. Watched it grow from a scrappy infrastructure tool into the de-facto standard for cloud automation. I've migrated teams from CloudFormation to Terraform, written custom providers, debugged state corruption at 2 AM. Terraform is baked into how I think about infrastructure.

So when HashiCorp switched to the Business Source License in August 2023, I did what most practitioners did: I shrugged, bookmarked the OpenTofu repo, and went back to building.

That bookmark sat there for two years.

The BSL doesn't prevent you from using Terraform. It prevents you from building a product or service that's "substantially similar" to Terraform Cloud or Terraform Enterprise. For most teams running internal infrastructure, the risk is low. But once you're building a platform team that exposes self-service infrastructure to internal customers, or packaging IaC automation as part of a managed service, your legal team might want a conversation. And once "get legal sign-off on our IaC toolchain" is on the agenda, you've already lost an afternoon you'll never get back.

For a Phono Technologies project, we were building a lightweight CI/CD orchestration layer for client infrastructure. The moment I tried to describe it, I realized I was describing exactly what the BSL restricts. The ambiguity was real enough that I wanted it gone.

What I Tried First (And Why It Failed)

My first instinct was to just drop in the tofu binary and run tofu init. Simple enough.

It almost worked. Until I checked where providers were being pulled from.

OpenTofu fetches providers from registry.opentofu.org, not registry.terraform.io. The registries mirror each other for HashiCorp providers, but your existing .terraform.lock.hcl was generated against Terraform's registry. The provider hashes don't match.

Error: Failed to install provider

To install this provider, OpenTofu needs to verify that the checksums in
.terraform.lock.hcl match the provider packages downloaded from the registry.
The following packages are required but the checksums don't match:
  registry.opentofu.org/hashicorp/aws v5.82.0

I also ran into teammates who still had the old Terraform-generated lock files. Some ran tofu plan on their local branches and got hash mismatches in the other direction. The lesson: this has to be a coordinated team migration, not a quiet swap on your own laptop.

The Solution

The codebase: a mid-sized AWS platform for a SaaS client. Around 8,000 lines of Terraform across 12 modules. Standard providers: aws, kubernetes, helm, random, tls. S3 backend for state, one workspace per environment. CI via GitHub Actions. No Terraform Cloud, no HCP.

Step 1: Back up everything

Before touching anything, tag the current state in git and pull a snapshot of your state file:

git tag pre-opentofu-migration

terraform state pull > terraform.tfstate.backup-$(date +%Y%m%d)

If you're on S3, enable versioning before you start. You want a timestamped rollback point. Non-negotiable.

Step 2: Install tofu alongside terraform

The two binaries coexist without conflict:

brew install opentofu
tofu --version
# OpenTofu v1.11.4
# on darwin_arm64

Keep terraform installed until you're confident the migration is complete.

Step 3: Delete the lock file and re-init

rm .terraform.lock.hcl
tofu init

tofu init regenerates the lock file with hashes for both registry.opentofu.org and registry.terraform.io providers, signed by OpenTofu's key infrastructure. Commit the new lock file and announce to your team to re-run tofu init on their local copies.

Once you commit the new lock file, treat the repo as an OpenTofu project. Don't run terraform init on the same directory afterward. The two binaries will fight over hashes.

Step 4: Check your `terraform {}` block

You don't have to rename it. OpenTofu still accepts the terraform {} block. Your existing HCL works without modification.

# This works fine in OpenTofu, no changes needed
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"
  }
}

You can leave it as terraform {} or rename it to tofu {}. Both work.

Step 5: Verify with `tofu plan`

tofu plan -out=migration-test.tfplan

Expected result: no changes. If you see changes, do not apply. Investigate first. It usually means a provider version difference or a schema update.

I got zero changes across all three environments.

Step 6: Drop DynamoDB for S3 native locking

This is where OpenTofu pulls ahead. OpenTofu 1.10.0 added native conditional writes for S3 state locking. No DynamoDB table required.

Before:

backend "s3" {
  bucket         = "my-state-bucket"
  key            = "prod/terraform.tfstate"
  region         = "us-east-1"
  encrypt        = true
  dynamodb_table = "terraform-locks"
}

After:

backend "s3" {
  bucket       = "my-state-bucket"
  key          = "prod/terraform.tfstate"
  region       = "us-east-1"
  encrypt      = true
  use_lockfile = true
}

Fewer moving parts. One less AWS service to manage. Simpler IAM permissions.

Step 7: Update your CI pipeline

Every place your pipeline runs terraform, you need tofu. In GitHub Actions:

Before:

- uses: hashicorp/setup-terraform@v3
  with:
    terraform_version: "1.9.5"

After:

- uses: opentofu/setup-opentofu@v1
  with:
    tofu_version: "1.11.4"

The opentofu/setup-opentofu action is the official GitHub Action. Clean swap.

Results

Metric	Before	After
State locking dependencies	S3 + DynamoDB	S3 only
DynamoDB tables	3 (one per environment)	0
Migration time	N/A	4 hours (including CI updates)
Plan output differences	N/A	None
Sensitive values in state	Persisted	Ephemeral (with 1.11 features)

The operational simplicity of dropping DynamoDB is hard to quantify in a table. It's one less service in IAM policies, one less resource to manage in the state backend module, one less thing that can drift or get misconfigured.

Lessons Learned

Coordinate the lock file migration as a team. If half your team is still running terraform init, you'll get hash conflicts. Announce the cutover date, have everyone delete and regenerate their lock files on the same day.
Pin your OpenTofu version in CI. The 1.11.x patch cycle had a notable regression in 1.11.0 that was fixed in 1.11.2. The team moves fast. Pin to a specific minor version in CI and upgrade deliberately.
The terraform {} block is fine. Don't waste time renaming it. The binary changed; the HCL didn't.
The point of no return is tofu apply. After you run apply, the state metadata reflects OpenTofu's version. You can still read the state with Terraform, but you'll get warnings. Decide before you apply whether you're committed.
Ephemeral values are worth understanding. OpenTofu 1.11.0 introduced ephemeral resources and write-only attributes. Sensitive credentials can be used without ever landing in state. If you've been papering over this with Vault workarounds, it's worth reading the docs before you finish the migration.

ephemeral "aws_secretsmanager_secret_version" "db_password" {
  secret_id = aws_secretsmanager_secret.db.id
}

resource "kubernetes_secret_v1" "db_credentials" {
  metadata {
    name      = "db-credentials"
    namespace = "app"
  }

  data_wo = {
    password = ephemeral.aws_secretsmanager_secret_version.db_password.secret_string
  }

  data_wo_revision = 1
}

Try It Yourself

OpenTofu Migration Guide: opentofu.org/docs/intro/migration

Drift Detection in Air-Gapped Workloads: What Nobody Tells You

Mateen Anjum — Sat, 21 Feb 2026 06:32:18 +0000

TL;DR: Standard drift detection breaks in air-gapped environments because every major tool assumes cloud API access. The fix is decentralized reconciliation with local state management, not trying to force connected tools into disconnected networks.

The Assumption That Breaks Everything

Every popular drift detection tool makes the same assumption: your infrastructure can reach the internet.

terraform plan calls AWS APIs. Argo CD pulls from remote Git repos. Spacelift runs scans from a SaaS control plane. These tools work brilliantly in connected environments. The moment you drop them into an air-gapped network, they go silent.

I've spent the better part of a decade building infrastructure for organizations where connectivity isn't optional, it's forbidden. Government agencies, defense contractors, healthcare systems, financial trading floors. These environments are disconnected by design, not by accident. And drift detection in these networks is a fundamentally different problem than what most DevOps engineers encounter.

Why Air-Gapped Workloads Drift Differently

In a connected environment, drift happens and gets caught relatively fast. Someone clicks through the console, Terraform Cloud flags it on the next scan, you fix it. The feedback loop is tight.

In air-gapped environments, drift accumulates silently.

A sysadmin patches a node manually because the automated pipeline can't reach the package mirror. A developer tweaks a ConfigMap directly because the GitOps controller lost sync with the local Git server. An operator scales a deployment by hand during an incident and forgets to commit the change.

These changes compound. By the time anyone runs a manual audit, the gap between declared state and actual state can be enormous.

The core problem: connected drift detection is continuous and automated. Disconnected drift detection is episodic and manual. That gap is where compliance violations, security incidents, and late night pages live.

What Doesn't Work (And Why Teams Keep Trying)

Terraform Plan Over VPN

The most common first attempt: tunnel terraform plan through a VPN into the air-gapped network.

Problems:

Latency kills the feedback loop. Provider API calls that take milliseconds on the internet take seconds over a restricted VPN. A plan that runs in 30 seconds now takes 15 minutes.
Partial connectivity isn't air-gapped. If your "air-gapped" network has a VPN tunnel to SaaS tooling, your security team has questions. Valid ones.
State file synchronization becomes a bottleneck. Remote state backends (S3, Consul) need connectivity. Local state files create merge conflicts when multiple operators work simultaneously.

GitOps Controllers Pointed at External Repos

Flux CD and Argo CD are excellent GitOps tools. But pointing them at a GitHub repo from an air-gapped cluster means... you don't have an air-gapped cluster anymore.

Running a local Git server (Gitea, GitLab) inside the perimeter fixes the connectivity problem but creates a new one: keeping the local repo in sync with the source of truth requires a deliberate, auditable transfer process. USB drives, data diodes, or scheduled one-way syncs all introduce delay. That delay is where drift happens.

Periodic Manual Audits

The fallback everyone hates: someone SSHes in, runs a bunch of comparison scripts, and writes a report.

This catches drift after the fact. In regulated environments, "we check quarterly" doesn't satisfy auditors who want continuous compliance evidence. And manual audits miss things. Every time.

What Actually Works

After iterating through the failures above across multiple engagements, three patterns consistently work in production air-gapped environments.

Pattern 1: Decentralized Policy Agents

Instead of a central control plane that reaches into clusters, deploy autonomous policy agents inside each air-gapped cluster.

Each agent:

Stores the desired state locally (pulled in during the last approved sync window)
Runs a continuous reconciliation loop comparing desired vs. actual state
Logs every deviation to a local audit store
Remediates automatically when configured to do so, or raises alerts for manual review

This is the pattern that Spectro Cloud Palette uses, and it's the right mental model. The cluster enforces its own policy. It doesn't need to phone home.

# Example: OPA Gatekeeper constraint running locally
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-team-label
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Namespace"]
  parameters:
    labels:
      - key: "team"
      - key: "cost-center"
      - key: "environment"

Gatekeeper runs entirely inside the cluster. No external connectivity needed. Violations are logged locally and can be exported during sync windows.

Pattern 2: Local State Snapshots with Diff-on-Sync

For Terraform managed infrastructure, maintain state snapshots inside the air-gapped environment.

The workflow:

Declare state in your IaC repo outside the air gap
Transfer the repo into the environment through your approved media (data diode, approved USB, one-way sync)
Run terraform plan inside the air gap against local provider endpoints
Snapshot the actual state after each apply
Diff the snapshot against the expected state on a cron schedule
Export the diff report during the next sync window

The key insight: the state file and the provider APIs both live inside the perimeter. terraform plan works fine when everything it needs to reach is local.

#!/bin/bash
# drift_check.sh - runs inside the air-gapped environment
set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DRIFT_DIR="/var/log/drift-reports"

terraform plan -detailed-exitcode -out="${DRIFT_DIR}/plan_${TIMESTAMP}.tfplan" 2>&1 | \
  tee "${DRIFT_DIR}/drift_${TIMESTAMP}.log"

EXIT_CODE=${PIPESTATUS[0]}
if [ "$EXIT_CODE" -eq 2 ]; then
  echo "DRIFT_DETECTED" > "${DRIFT_DIR}/status_${TIMESTAMP}"
  # Alert local monitoring
  curl -s -X POST http://alertmanager.local:9093/api/v1/alerts \
    -d '[{"labels":{"alertname":"InfrastructureDrift","severity":"warning"}}]'
fi

Pattern 3: Immutable Baselines with Checksum Verification

For the most sensitive environments (defense, critical infrastructure), treat infrastructure state like a software artifact.

Build a golden baseline of every resource's expected configuration
Generate checksums (SHA-256) for each configuration artifact
Deploy a lightweight agent that periodically recalculates checksums on live resources
Any mismatch triggers an immediate alert

This is coarser than Terraform drift detection, but it works without any provider APIs. It's closer to file integrity monitoring (think AIDE or OSSEC) applied to infrastructure configuration.

# baseline_check.py - infrastructure checksum verification
import hashlib
import json
import subprocess

def get_resource_state(resource_type, resource_name):
    """Capture current state of a Kubernetes resource."""
    result = subprocess.run(
        ["kubectl", "get", resource_type, resource_name, "-o", "json"],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        return None

    state = json.loads(result.stdout)
    # Strip volatile fields that change on every read
    for field in ["resourceVersion", "uid", "creationTimestamp",
                  "managedFields", "generation"]:
        state.get("metadata", {}).pop(field, None)
    return state

def checksum(state):
    """Generate deterministic checksum of resource state."""
    canonical = json.dumps(state, sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()

def verify_baseline(baseline_file):
    """Compare live state against stored baseline checksums."""
    with open(baseline_file) as f:
        baseline = json.load(f)

    drift_detected = []
    for resource in baseline["resources"]:
        current = get_resource_state(resource["type"], resource["name"])
        if current is None:
            drift_detected.append({
                "resource": f"{resource['type']}/{resource['name']}",
                "status": "MISSING"
            })
            continue

        current_hash = checksum(current)
        if current_hash != resource["expected_hash"]:
            drift_detected.append({
                "resource": f"{resource['type']}/{resource['name']}",
                "status": "MODIFIED",
                "expected": resource["expected_hash"][:12],
                "actual": current_hash[:12]
            })

    return drift_detected

Choosing the Right Pattern

Pattern	Best For	Latency	Complexity	Audit Trail
Decentralized Agents	Kubernetes clusters	Real-time	Medium	Strong
Local State Snapshots	Terraform/IaC resources	Minutes (cron)	Low	Medium
Checksum Baselines	High-security environments	Minutes (cron)	Low	Strong

In practice, most air-gapped environments use a combination. Gatekeeper handles Kubernetes policy enforcement in real time. Terraform drift checks run on a cron inside the perimeter. Checksum baselines provide an additional layer for the security team.

The Compliance Angle

Auditors care about three things:

Can you prove your infrastructure matches the declared state? Drift reports with timestamps answer this.
How quickly do you detect deviations? "Within minutes" beats "at the next quarterly audit."
What happens when drift is detected? Automated remediation or documented manual review processes.

Air-gapped environments often have stricter compliance requirements than connected ones. The irony is that their tooling for meeting those requirements is worse. Building local drift detection infrastructure closes that gap.

Lessons From the Field

1. Treat sync windows as deployment events. When new policy or desired state enters the air-gapped environment, that transfer should go through the same review process as a production deployment. Because it is one.

2. Log everything locally, export periodically. Build a local ELK or Loki stack inside the perimeter. Drift events, remediation actions, audit logs. Export summaries during sync windows for central visibility.

3. Test your drift detection in staging first. Introduce intentional drift in a staging cluster and verify your agents catch it. I've seen teams deploy Gatekeeper and assume it works, only to discover six months later that their constraints had a typo that prevented enforcement.

4. Don't fight the air gap. The biggest mistake is trying to poke holes in the network boundary to make connected tools work. Every hole is an attack surface. Build for disconnection. It's simpler in the long run.

5. Version your baselines. When the approved state changes (through a sync window), update the baseline checksums and keep the old ones. This gives you a historical record of what the environment should have looked like at any point in time.

OpenClaw for SRE: Self-Hosted AI Agents That Actually Respond to Incidents

Mateen Anjum — Sat, 21 Feb 2026 06:27:44 +0000

TL;DR: OpenClaw is a self-hosted AI agent framework that connects to Slack, Teams, and other channels. For SRE teams, it's a way to build incident response automation that runs entirely on your infrastructure, with custom skills for runbook execution, alert triage, and operational context.

The SRE Automation Gap

Every SRE team I've worked with has the same problem: too many alerts, not enough context, and runbooks that exist but don't get followed at 3 AM.

The typical incident response flow looks like this:

PagerDuty fires an alert
On-call engineer wakes up, opens laptop
Checks Slack for context (is anyone else awake?)
Opens Grafana, tries to find the relevant dashboard
Searches Confluence for the runbook
Realizes the runbook is outdated
Starts troubleshooting from scratch

Steps 2 through 6 consume 15 to 30 minutes before any real diagnosis begins. For a P1 incident at scale, that's the difference between a blip and an outage that hits the status page.

SaaS tools like PagerDuty's AIOps and Rootly have started addressing this with AI-powered incident assistants. They work well, but they require sending your operational data to third-party services. For organizations with strict data residency requirements, that's a non-starter.

OpenClaw fills that gap.

What OpenClaw Actually Is

OpenClaw is an open-source, self-hosted framework for running AI agents across messaging platforms. It launched in late 2025 as a personal AI assistant project and has rapidly grown into something more interesting: a platform for building operational automation.

The core architecture:

Multi-channel gateway: Connects to Slack, Microsoft Teams, Discord, WhatsApp, Telegram. Messages from any channel get normalized into a unified format.
LLM provider abstraction: Works with multiple model providers. You bring your own API keys. Switch providers without changing your skills or workflows.
Persistent memory: Maintains conversational context across interactions. The agent remembers what happened in the last incident, what commands were run, what the outcome was.
Skills framework: A plugin system that lets you extend the agent with custom capabilities. This is where the SRE value lives.

Everything runs on your infrastructure. Docker Compose for simple setups, Kubernetes for production. Your data stays on your servers.

Why SRE Teams Should Care

The skills framework is what makes OpenClaw interesting for operations work. A "skill" in OpenClaw is essentially a structured capability with defined inputs, outputs, and permissions.

For SRE, that means you can build skills like:

Incident Triage

An agent that automatically pulls context when an alert fires:

SKILL.md: incident-triage

Inputs: alert_name, service, severity
Actions:
  1. Query Prometheus for related metrics (last 30 min)
  2. Check recent deployments from deploy tracker
  3. Pull relevant runbook from internal wiki
  4. Summarize findings in incident channel

Permissions: read-only access to Prometheus API, deploy API, wiki API

When PagerDuty fires an alert and posts to Slack, the OpenClaw agent picks it up, runs the triage skill, and drops a summary into the incident channel before the on-call engineer has finished logging in.

Runbook Execution

Instead of linking to a Confluence page that may or may not be current, encode runbooks as executable skills:

SKILL.md: restart-service

Inputs: service_name, environment
Actions:
  1. Verify service exists in target environment
  2. Check current health status
  3. Execute rolling restart via Kubernetes API
  4. Monitor health checks for 5 minutes
  5. Report success/failure to incident channel

Permissions: kubernetes API (limited to restart operations)
Guardrails: requires confirmation for production, auto-approve for staging

The on-call engineer says "restart the payment service in staging" in Slack, and the agent executes the runbook step by step, reporting progress as it goes. No SSH-ing into bastion hosts. No copy-pasting commands from a wiki.

Alert Correlation

Connect the agent to your monitoring stack and let it correlate across signals:

SKILL.md: correlate-alerts

Inputs: primary_alert
Actions:
  1. Query AlertManager for alerts fired within +/- 5 minutes
  2. Query deployment tracker for recent changes
  3. Check dependent service health
  4. Identify common root cause patterns
  5. Suggest investigation path

Permissions: read-only AlertManager API, deploy tracker, service catalog

Instead of an engineer manually checking five dashboards to figure out why the checkout service is slow, the agent correlates: "Three alerts fired in the last 10 minutes: high latency on checkout, connection pool exhaustion on payments DB, and a deployment to the payments service 12 minutes ago. Likely cause: the payments deploy."

Setting It Up for SRE

Step 1: Deploy the Agent

# docker-compose.yml (simplified)
version: "3.8"
services:
  openclaw:
    image: openclaw/openclaw:latest
    volumes:
      - ./config:/home/openclaw/.openclaw
      - ./skills:/home/openclaw/skills
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    ports:
      - "3000:3000"
    restart: unless-stopped

Step 2: Configure Messaging Channels

Point it at your Slack workspace. The agent appears as a bot user in your incident channels. Teams that use Microsoft Teams or Discord can connect those instead, same agent, different channel.

Step 3: Build SRE Skills

Each skill is a directory with a SKILL.md that defines its behavior and a set of supporting scripts or API integrations.

skills/
├── incident-triage/
│   ├── SKILL.md
│   ├── prometheus_query.py
│   └── deploy_check.py
├── restart-service/
│   ├── SKILL.md
│   └── k8s_restart.py
├── correlate-alerts/
│   ├── SKILL.md
│   └── alertmanager_client.py
└── status-page-update/
    ├── SKILL.md
    └── statuspage_api.py

Step 4: Connect to Your Monitoring Stack

The agent needs read access to your observability tools:

Integration	Purpose	Access Level
Prometheus/VictoriaMetrics	Metrics queries	Read-only
AlertManager	Alert correlation	Read-only
Kubernetes API	Service health, restarts	Scoped RBAC
Deploy tracker	Recent changes	Read-only
Internal wiki	Runbooks	Read-only
StatusPage	Incident communication	Write

Principle of least privilege applies. The agent should have the minimum permissions needed for each skill.

What This Looks Like in Practice

Here's a realistic incident timeline with OpenClaw:

00:00 - AlertManager fires: "Checkout latency > 2s for 5 minutes"
00:01 - PagerDuty pages on-call, posts to #incident-checkout in Slack
00:01 - OpenClaw agent detects the alert, runs incident-triage skill
00:02 - Agent posts triage summary:

Incident Triage: checkout-latency-high

Related Alerts (last 10 min):

payments-db-connection-pool-exhaustion (fired 00:00)

payments-service-error-rate-high (fired 00:01)

Recent Deployments:

payments-service v2.14.3 deployed 12 min ago by @sarah

Relevant Runbook: Payments DB Connection Pool

Suggested Action: The payments deploy correlates with connection pool exhaustion. Consider rolling back payments-service to v2.14.2.

00:03 - On-call engineer logs in, sees the full context already assembled
00:04 - Engineer: "rollback payments-service to v2.14.2 in production"
00:04 - Agent: "Rolling back payments-service to v2.14.2 in production. This will trigger a rolling update. Confirm? (yes/no)"
00:04 - Engineer: "yes"
00:05 - Agent executes rollback, monitors health checks
00:08 - Agent: "Rollback complete. Checkout latency back to normal (avg 180ms). Payments DB connection pool utilization dropped from 98% to 45%."

Total time from alert to resolution: 8 minutes. Without the agent, that same incident takes 25 to 40 minutes.

Guardrails Matter

Letting an AI agent interact with production infrastructure requires guardrails. OpenClaw's skill framework supports this through permission scoping and confirmation gates.

Production safeguards:

Skills that modify production require explicit confirmation
Read-only skills execute automatically (triage, correlation)
Write operations go through a confirmation flow in the messaging channel
All actions are logged with who triggered them and what the agent did

Scope limitations:

Each skill declares its required permissions
Kubernetes RBAC limits what the agent can actually do
API keys are scoped to specific operations
No "do anything" root access

This isn't a replacement for your incident commander or your on-call engineers. It's a tool that handles the first 5 minutes of context gathering so humans can focus on the hard parts.

Where It Falls Short

OpenClaw is still young. A few things to be aware of:

Skill development is manual. There's no marketplace or library of pre-built SRE skills. You're building integrations from scratch. If you've built Slack bots or PagerDuty integrations before, the effort is similar.

LLM costs add up. Every incident interaction consumes API tokens. For high-alert-volume environments, the cost of LLM calls during incidents needs to be factored into the budget.

Prompt engineering is real work. The quality of the agent's triage and correlation depends heavily on how well the skills are designed. Poorly defined skills produce noisy, unhelpful outputs.

Not a replacement for observability. The agent is only as good as the data it can access. If your monitoring has gaps, the agent inherits those gaps.

When to Use It

OpenClaw for SRE makes sense when:

Your organization has data residency or security requirements that rule out SaaS incident tools
You already have a solid observability stack (Prometheus, Grafana, AlertManager) and want to add an intelligence layer on top
Your team has the engineering capacity to build and maintain custom skills
Incident response time is a critical metric you're trying to improve

It doesn't make sense when:

You're a small team that can handle alerts manually
You don't have a mature observability foundation yet (fix that first)
You want a turnkey solution with no custom development

The Hidden Tax on Your Cloud Bill: How Data Transfer Costs Are Silently Draining Your Budget

Mateen Anjum — Fri, 02 Jan 2026 14:36:49 +0000

TL;DR: Cloud data transfer costs can account for 10-30% of your cloud bill, yet most teams don't understand the pricing until they get shocked by a massive invoice. I break down exactly where these costs hide, compare AWS, GCP, and Azure pricing, and show you how to potentially save 60-80% on egress fees.

The $2,657 Overnight Surprise

A developer shared a 13.7 GB file that went viral. His AWS bill jumped from $23 to $2,657 overnight. Every download by every user worldwide was charged at $0.09/GB. No warning, no cap, just a bill.

This story is more common than you think. And it is why I spent the last month researching cloud data transfer pricing across AWS, GCP, and Azure.

What I found was eye opening.

The Basics: Why Cloud Providers Love Egress

Here is the fundamental asymmetry of cloud pricing:

Data IN (ingress): FREE across all major providers
Data OUT (egress): $0.05 to $0.23 per GB

This is not accidental. Cloud providers want your data to flow in freely. Getting it out? That will cost you. It is often called the "Hotel California" model of cloud computing.

Current Pricing Comparison (2025)

I verified these numbers across official documentation and third party sources:

Egress to Internet (US Regions)

Tier	AWS	GCP Premium	Azure
Free tier	100 GB/month	1 GiB	100 GB/month
First 10 TB	$0.09/GB	$0.12/GiB	$0.087/GB
10-50 TB	$0.085/GB	$0.11/GiB	$0.083/GB
50-150 TB	$0.07/GB	$0.08/GiB	$0.07/GB
150+ TB	$0.05/GB	$0.08/GiB	$0.05/GB

Quick math: 10 TB of monthly egress costs:

AWS: $900
GCP Premium: $1,100
Azure: $870
Cloudflare R2: $0 (yes, zero)

The Hidden Costs Nobody Talks About

Standard egress is just the tip of the iceberg. Here is where the real money disappears.

1. NAT Gateway: The Silent Budget Killer

If you run workloads in private subnets (which you should for security), traffic to the internet goes through a NAT Gateway. The cost?

Hourly charge: $0.045/hour ($32.85/month per gateway)
Data processing: $0.045/GB

Let me break down a real scenario. You have 100 GB going to S3 through a NAT Gateway:

Component	Cost
NAT processing	100 GB x $0.045 = $4.50
Internet egress	100 GB x $0.09 = $9.00
Total	$13.50

But here is the thing: S3 traffic through a VPC Gateway Endpoint is FREE.

One developer at Geocodio documented a "$1,000 AWS mistake" where traffic to AWS services in the same region was routed through NAT Gateway. All of that was avoidable.

2. Cross-AZ Traffic: Death by a Thousand Cuts

Every time data crosses between Availability Zones, you pay $0.01/GB in each direction. That is $0.02/GB round trip.

Seems small? Consider this:

Your app server is in AZ-1
Your database (Multi-AZ RDS) is in AZ-2
Every query response crosses zones

A database doing 10 TB of response traffic monthly costs an extra $200 just in cross-AZ fees. Multiply that across all your services.

3. Load Balancer Data Processing

Your Application Load Balancer processes all that traffic. When requests come in on one AZ and targets live in another, you pay twice.

GCP Load Balancing charges:

Inbound data processed: $0.008/GiB
Outbound data processed: $0.008/GiB
Plus forwarding rules: $0.025/hour (first 5), $0.01/hour each additional

4. Public IPv4 Addresses (AWS, 2024)

New as of February 2024, AWS charges $0.005/hour for all public IPv4 addresses. That is $3.60/month per IP, in-use or idle.

10 public IPs sitting there? That is $36/month before you transfer any data.

Where This Gets Expensive: Multi-Cloud and Hybrid

Moving data between clouds or to on-premises is where costs really add up.

Option 1: Over the Internet (Expensive)

AWS to GCP via public internet:

AWS egress: $0.09/GB
GCP ingress: FREE
Total: $0.09/GB

For 50 TB monthly: $4,500

Option 2: Dedicated Interconnect (Better Economics)

Service	Port Fee (10 Gbps)	Data Transfer
AWS Direct Connect	$2.25/hour (~$1,643/mo)	$0.02/GB
Azure ExpressRoute	$3,400/month	$0.025/GB
GCP Cross-Cloud Interconnect	$5.60/hour (~$4,032/mo)	Same as inter-region

For high volume transfers, dedicated connections pay for themselves quickly. At 50 TB monthly, Direct Connect saves ~$3,500 compared to internet egress.

Companies That Solved This

Dropbox: $75 Million Saved

Dropbox was one of S3's largest customers. In 2015-2016, they built their own storage infrastructure called "Magic Pocket" and migrated off AWS.

The result: $74.6 million in savings over two years. First year alone saved $39.5 million.

At their scale, owning infrastructure beats renting.

Basecamp/37signals: $10 Million Over Five Years

In 2023, Basecamp left AWS and Google Cloud. Their results:

Total projected savings: $10+ million over five years
Already saving: $1 million/year
S3 exit alone: $5,000/day ($150K/month)

DHH (their founder) wrote extensively about this. They bought ~$600K in hardware and added no new staff. The payback period was less than a year.

Netflix: Built Their Own CDN

Netflix does not stream videos out of AWS. The egress costs would be astronomical. Instead, they built Open Connect, their own CDN with appliances placed directly in ISP networks.

Quote from an industry analyst: "The underlying economics of data transfer does not reflect how the cloud providers price for it. We are still paying 1990s prices for bandwidth when we are in the cloud."

How to Reduce Your Egress Costs

Based on my research, here are the highest impact optimizations:

1. VPC Gateway Endpoints for S3/DynamoDB (100% Savings)

These are free and route traffic directly to S3/DynamoDB without touching NAT Gateway or the internet.

# Terraform example
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids = [aws_route_table.private.id]
}

2. CloudFront for Content Delivery (60-80% Reduction)

CloudFront to S3 origin is free. You only pay CloudFront egress, which is cheaper than S3 direct:

Method	Cost per 10K requests
S3 direct	$0.05
CloudFront	$0.0075

Plus caching means you serve from edge instead of origin.

3. Compression Before Transfer (50-80% Reduction)

Compress everything:

Gzip for general purpose
Brotli for text content (better ratio than Gzip)
Delta encoding for incremental updates

4. Same-AZ Deployment (Eliminate Cross-AZ)

If high availability is not critical for a workload, keep everything in one AZ. Same-AZ traffic is free.

5. Consider Cloudflare R2 for Storage Heavy Workloads

R2 has zero egress fees. For a workload with 10 TB storage and 50 TB monthly egress:

Provider	Monthly Cost
AWS S3	$230 (storage) + $4,500 (egress) = $4,730
Cloudflare R2	$150 (storage only)

That is a 97% reduction.

Monitoring and Alerting

You cannot optimize what you do not measure. Set up:

AWS Cost Explorer with daily data transfer breakdown
CloudWatch alarms on NAT Gateway bytes processed
Budget alerts specifically for data transfer line items
VPC Flow Logs to understand traffic patterns (but watch the logging costs)

# Check your data transfer costs (AWS CLI)
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity MONTHLY \
  --metrics "UnblendedCost" \
  --filter '{"Dimensions":{"Key":"USAGE_TYPE_GROUP","Values":["EC2: Data Transfer - Internet (Out)","EC2: Data Transfer - Region to Region (Out)"]}}'

The Regulatory Push

The European Data Act (effective September 2025) is forcing cloud providers toward transparent pricing and easier data portability. All three major providers now offer egress fee waivers for complete cloud departures:

AWS: Waiver upon account team approval
GCP: Waiver for full migration off platform
Azure: 100GB credits for 60 days

The catch? These only apply to complete departures, not ongoing multi-cloud operations.

Key Takeaways

Ingress is free, egress is not. Plan your architecture with data gravity in mind.
NAT Gateway is the biggest hidden cost. Use VPC Gateway Endpoints for S3/DynamoDB.
Cross-AZ traffic adds up. $0.01/GB each way on every hop.
At scale, consider alternatives. Dropbox saved $75M, Basecamp saves $1M/year.
CDN and compression are low-hanging fruit. 60-80% reduction for content delivery.
Cloudflare R2 has zero egress. Seriously consider it for storage-heavy workloads.
Monitor proactively. One viral file can turn a $23 bill into $2,657 overnight.

Resources

What is the worst data transfer bill you have received? I would love to hear your horror stories and optimization wins in the comments.

Never Commit Secrets Again: Generate .env Files from AWS Secrets Manager

Mateen Anjum — Fri, 12 Dec 2025 17:49:13 +0000

TL;DR: Store secrets in AWS Secrets Manager. Generate .env files on demand with a Python script. Never commit credentials again.
a

The Problem

Every team commits secrets eventually. GitHub detected over 12 million exposed credentials last year through their secret scanning.

The usual approaches all have failure modes:

.gitignore fails when developers forget to add it, or clone fresh and ask for the file via Slack
SOPS encryption still puts files in git, adds key management overhead, and creates merge conflict nightmares
.env.example templates get stale and require manual copying

We needed something better: secrets that live outside the repository entirely, with a frictionless developer experience.

The Solution

Secrets live in AWS Secrets Manager. Developers run one command to generate their .env file:

make env
# .env is generated locally, ready to use

The file is gitignored. It never touches version control. When secrets change in AWS, developers regenerate and get the latest values.

Implementation

1. Organize Secrets in AWS

Structure your secrets by application and environment:

/myapp/dev/database      → {"DB_HOST": "...", "DB_PASSWORD": "..."}
/myapp/dev/api-keys      → {"STRIPE_KEY": "...", "SENDGRID_KEY": "..."}
/myapp/prod/database     → {"DB_HOST": "...", "DB_PASSWORD": "..."}
/myapp/prod/api-keys     → {"STRIPE_KEY": "...", "SENDGRID_KEY": "..."}

Create secrets using AWS CLI:

aws secretsmanager create-secret \
  --name /myapp/dev/database \
  --secret-string '{"DB_HOST":"localhost","DB_PASSWORD":"devpass123"}'

2. The Python Script

Here's the full script that generates .env files:

#!/usr/bin/env python3
"""
Generate .env file from AWS Secrets Manager.

Usage:
    python generate_env.py dev
    python generate_env.py prod --force
"""

import argparse
import json
import os
import sys
from pathlib import Path

import boto3
from botocore.exceptions import ClientError, NoCredentialsError

# Configuration
APP_NAME = "myapp"
AWS_REGION = os.environ.get("AWS_REGION", "us-east-1")
ENV_FILE = ".env"

SECRET_KEYS = ["database", "api-keys", "third-party"]


def get_secret(secret_name: str, region: str) -> dict:
    """Fetch a secret from AWS Secrets Manager."""
    client = boto3.client("secretsmanager", region_name=region)

    try:
        response = client.get_secret_value(SecretId=secret_name)
        return json.loads(response.get("SecretString", "{}"))
    except ClientError as e:
        if e.response["Error"]["Code"] == "ResourceNotFoundException":
            print(f"  Warning: Secret '{secret_name}' not found")
            return {}
        raise


def validate_aws_credentials() -> bool:
    """Check if AWS credentials are configured."""
    try:
        sts = boto3.client("sts")
        identity = sts.get_caller_identity()
        print(f"Authenticated as: {identity['Arn']}")
        return True
    except NoCredentialsError:
        print("Error: AWS credentials not found.")
        print("\nFix with one of:")
        print("  1. aws configure")
        print("  2. Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY")
        print("  3. Use IAM role (if on AWS)")
        return False


def fetch_all_secrets(environment: str, region: str) -> dict:
    """Fetch all secrets for the environment."""
    all_secrets = {}
    for key in SECRET_KEYS:
        secret_path = f"/{APP_NAME}/{environment}/{key}"
        print(f"  Fetching: {secret_path}")
        all_secrets.update(get_secret(secret_path, region))
    return all_secrets


def generate_env_content(secrets: dict) -> str:
    """Generate .env content from secrets."""
    lines = [
        "# Auto-generated from AWS Secrets Manager",
        "# DO NOT COMMIT THIS FILE",
        "",
    ]
    for key, value in sorted(secrets.items()):
        if isinstance(value, str) and " " in value:
            value = f'"{value}"'
        lines.append(f"{key}={value}")
    return "\n".join(lines) + "\n"


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("environment", choices=["dev", "staging", "prod"])
    parser.add_argument("-f", "--force", action="store_true")
    parser.add_argument("-o", "--output", default=ENV_FILE)
    args = parser.parse_args()

    print(f"Generating .env for '{args.environment}'\n")

    if not validate_aws_credentials():
        sys.exit(1)

    secrets = fetch_all_secrets(args.environment, AWS_REGION)

    if not secrets:
        print(f"\nError: No secrets found at /{APP_NAME}/{args.environment}/*")
        sys.exit(1)

    print(f"\nFound {len(secrets)} secret values")

    content = generate_env_content(secrets)
    path = Path(args.output)

    if path.exists() and not args.force:
        if input(f"{args.output} exists. Overwrite? [y/N]: ").lower() != "y":
            sys.exit(0)

    path.write_text(content)
    print(f"Generated: {args.output}")


if __name__ == "__main__":
    main()

3. Shell Wrapper and Makefile

Create a shell wrapper for convenience:

#!/bin/bash
# generate-env.sh
set -e
ENV=${1:-dev}

if ! python3 -c "import boto3" 2>/dev/null; then
    pip3 install boto3 --quiet
fi

python3 "$(dirname "$0")/generate_env.py" "$ENV" "${@:2}"

Add Makefile targets:

.PHONY: env env-dev env-prod

env:
    @./scripts/generate-env.sh dev

env-dev:
    @./scripts/generate-env.sh dev

env-prod:
    @./scripts/generate-env.sh prod

env-dry:
    @./scripts/generate-env.sh dev --dry-run

4. GitHub Actions with OIDC

No stored credentials needed. Use OIDC to assume an AWS role:

name: Deploy

on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions
          aws-region: us-east-1

      - name: Generate .env
        run: |
          pip install boto3
          python scripts/generate_env.py prod --force

      - name: Deploy
        run: |
          # Your deployment commands
          echo "Deploying..."

      - name: Cleanup
        if: always()
        run: rm -f .env

5. GitLab CI

Same pattern with GitLab's OIDC:

deploy:
  stage: deploy
  image: python:3.11-slim
  script:
    - pip install boto3
    - |
      export $(printf "AWS_ACCESS_KEY_ID=%s AWS_SECRET_ACCESS_KEY=%s AWS_SESSION_TOKEN=%s"
      $(aws sts assume-role-with-web-identity
      --role-arn ${AWS_ROLE_ARN}
      --role-session-name "gitlab-${CI_PIPELINE_ID}"
      --web-identity-token ${CI_JOB_JWT_V2}
      --query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]'
      --output text))
    - python scripts/generate_env.py prod --force
    - echo "Deploying..."
  after_script:
    - rm -f .env

IAM Permissions

Developers need:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["secretsmanager:GetSecretValue"],
      "Resource": "arn:aws:secretsmanager:us-east-1:*:secret:/myapp/dev/*"
    }
  ]
}

CI/CD roles need access to prod secrets:

{
  "Effect": "Allow",
  "Action": ["secretsmanager:GetSecretValue"],
  "Resource": "arn:aws:secretsmanager:us-east-1:*:secret:/myapp/prod/*"
}

OIDC Setup for GitHub Actions

Create the OIDC provider in AWS:

aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com

Create the trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:*"
        }
      }
    }
  ]
}

Results

After implementing this:

Metric	Before	After
Secrets in git	47	0
Rotation time	2 hours	5 minutes
Setup time	45 min	10 min
Slack secret sharing	Weekly	Never

Repository Structure

myapp/
├── scripts/
│   ├── generate_env.py
│   └── generate-env.sh
├── .github/
│   └── workflows/
│       └── deploy.yml
├── .gitignore          # includes .env
├── .env.example        # dummy values for reference
├── Makefile
└── README.md

Try It

The full code is available at github.com/mateenali66/secrets-env-generator.

Clone it, configure your AWS credentials, create some test secrets, and run make env.

Questions? Contact me on .

AWS DevOps Agent: What AWS Isn't Telling You (And Why Your Job Is Safe)

Mateen Anjum — Tue, 09 Dec 2025 07:25:49 +0000

AWS announced DevOps Agent at re:Invent 2025, calling it a "frontier agent that acts as an experienced DevOps engineer." The marketing promises autonomous incident investigation, root cause analysis, and proactive prevention.

I spent the past week digging into the documentation, testing the preview, and analyzing what AWS carefully avoided mentioning. Here's what I found.

TL;DR

AWS DevOps Agent is a powerful diagnostic assistant, not an autonomous operator. It can investigate incidents and suggest fixes, but it cannot execute them. The preview is free, but AWS hasn't disclosed GA pricing. Your job is safe because someone still needs to actually fix things.

What DevOps Agent Actually Does

Think of it as a 24/7 on-call engineer that never sleeps, never gets tired, and never forgets to check the logs. When an alert fires at 2 AM, it immediately starts investigating.

Core capabilities:

Capability	What It Does
Incident Investigation	Correlates metrics, logs, traces, and code changes
Root Cause Analysis	Identifies probable cause using topology understanding
Mitigation Plans	Suggests steps to fix with rollback guidance
Prevention Analysis	Analyzes historical incidents to prevent recurrence
Stakeholder Updates	Posts findings to Slack channels and tickets

What makes it "frontier":

AWS calls this a frontier agent because it can run autonomously for hours or days. It doesn't need you to guide it step by step. Give it an alert, and it figures out what to investigate, which logs to pull, which deployments to check.

The Integration Ecosystem

This is where DevOps Agent gets interesting. It's not locked into AWS services.

Built-in integrations:

Observability: CloudWatch, Datadog, Dynatrace, New Relic, Splunk
CI/CD: GitHub Actions, GitLab CI/CD
Ticketing: ServiceNow (native), PagerDuty (webhook)
Collaboration: Slack

The MCP wildcard:

Through Model Context Protocol servers, you can connect anything: Prometheus, Grafana, custom internal tools, proprietary systems. This is the underrated feature. While competitors lock you into their ecosystem, AWS lets you bring your own.

What AWS Isn't Telling You: Pricing

Here's where it gets murky.

Preview limits (documented):

20 incident resolution hours per month
10 incident prevention hours per month
1,000 chat messages per month
10 Agent Spaces maximum
3 concurrent investigations
1 concurrent prevention task

GA pricing (unknown):

Per hour? Per investigation? Per seat? Per account?
Third-party tool API costs passed through?
Bedrock model usage fees?
Multi-region pricing?

Hidden costs during preview:

"Queries and API calls made to other AWS and non-AWS services may generate charges from those services."

Translation: Your CloudWatch and X-Ray bills might increase. If Datadog charges per query, those costs are on you.

My speculation on GA pricing:

Based on Bedrock pricing and similar services, expect something like $50-150 per investigation hour. A complex incident taking 4 hours of agent time could cost $200-600. For organizations with frequent incidents, this adds up quickly.

What DevOps Agent Cannot Do

This is the part AWS marketing glosses over.

It cannot:

Execute fixes (only recommends)
Deploy code changes
Modify infrastructure
Make policy decisions
Handle unprecedented situations
Operate autonomously in regulated industries

The regulatory reality:

For healthcare, finance, or any regulated industry, DevOps Agent is a diagnostic assistant. It cannot be an autonomous operator. Compliance requires human decision-making for changes. This alone disqualifies the "replacement" narrative.

Why Your Job Is Safe

I've seen the LinkedIn panic. "AI is coming for DevOps jobs!" Let me explain why that's wrong.

What the agent does:

Detects and monitors
Investigates and correlates
Reports and recommends

What you still do:

Implement the actual fix
Deploy changes to production
Verify the fix worked
Make architectural decisions
Handle the weird edge cases
Build new infrastructure
Coordinate across teams

The Commonwealth Bank example:

AWS cited that Commonwealth Bank found a root cause in under 15 minutes using DevOps Agent, versus hours manually. Notice what they didn't say: the agent fixed it. An engineer still had to implement the solution.

DevOps Agent doesn't reduce headcount. It reduces MTTR.

Your value isn't in correlating logs. Your value is in knowing what to do with that information. The agent accelerates the boring parts so you can focus on the interesting ones.

Agent Spaces: The Security Model

One thing AWS got right is the security boundary model.

Each Agent Space:

Has its own dedicated IAM role
Defines exactly which accounts it can access
Controls which tools are connected
Isolates data from other Agent Spaces

Access patterns:

Admins configure via AWS Console
Operators interact via standalone web app
IAM Identity Center or direct IAM authentication

Resource discovery:

CloudFormation stacks (including CDK) are auto-discovered
Terraform and console resources need tags
No tags = invisible to the agent

If your infrastructure is a mess of untagged resources, DevOps Agent won't help much. This is actually a feature: it forces infrastructure hygiene.

Real Limitations I Found

Testing revealed issues AWS documentation doesn't highlight.

Investigation accuracy varies:

One tester reported that when the time between two alarms was around 40 minutes, the agent couldn't find the root cause and required a re-run. The agent isn't infallible.

English only:

No multilingual support. If your team operates in other languages, this limits adoption.

US East only (for now):

The agent runs in us-east-1, though it can monitor resources in any region. Multi-region redundancy isn't available during preview.

Context dependency:

The agent's effectiveness directly correlates to how well you've connected tools and tagged resources. Garbage in, garbage out still applies.

Investigation gaps feature:

To its credit, DevOps Agent explicitly shows "Investigation Gaps," things it couldn't analyze due to missing logs, absent SSH access, or incomplete telemetry. This transparency is valuable but confirms the limitations.

How to Maximize Effectiveness

If you're going to use this, do it right.

1. Connect everything:

Don't just connect CloudWatch. Add your GitHub repos so it can correlate deployments. Connect Slack so it updates your incident channel. Add Datadog or whatever you use.

2. Use MCP for custom tools:

Have internal tools? Build an MCP server. The protocol is open and documented. This is how you get real value.

3. Tag your resources:

If it's not in CloudFormation, tag it. Use consistent key-value pairs across your infrastructure.

4. Create runbooks:

DevOps Agent supports runbooks as "pre-loaded guidance." Create them for your common incident patterns. This gives the agent hints about where to look.

5. Start with one Agent Space:

Don't create 10 spaces immediately. Start with one team or application, learn the patterns, then expand.

The Bottom Line

AWS DevOps Agent is genuinely useful. It's not a gimmick. The ability to have something correlating data across 5 different tools at 3 AM while you sleep is valuable.

But it's not magic. It's not replacing anyone. It's a sophisticated diagnostic tool that still requires human judgment to act on its findings.

Use it if:

You have frequent incidents and high MTTR
Your observability tools are already well-integrated
You want to reduce on-call burden (not headcount)
You're willing to invest in proper tagging and MCP setup

Skip it if:

You're in a heavily regulated industry requiring human approval for all changes
Your infrastructure is poorly documented
You expect it to fix things, not just find them

The preview is free. Try it. But go in with realistic expectations.

What's your take on AI agents in DevOps? Have you tested the preview? Drop a comment below.

Resources: