Forem: Infraforge

When ArgoCD shows Healthy but Keycloak silently strips JWT claims

Muhammad Hassaan Javed — Fri, 22 May 2026 22:19:02 +0000

ArgoCD reported Synced and Healthy. The Keycloak Helm release was green. And the downstream timeline service was returning 401 on every authenticated request. That was the call we got: every dashboard says the platform is fine, and authentication is broken across three services. The JWTs auth-service was issuing had stopped carrying the groups claim and the email_verified claim about 40 minutes earlier, right after an ArgoCD auto-sync rolled out a Keycloak chart bump. Six OIDC clients had silently lost protocol mappers and role mappings during that sync, and we did not yet know it.

Problem signals:

ArgoCD shows Synced and Healthy on the Keycloak application, but downstream services return 401 on tokens they accepted an hour ago
JWTs decoded at jwt.io are missing claims that production code depends on (groups, email_verified, audience)
Engineers have been making emergency fixes directly in the Keycloak admin console during recent incidents and not committing them back
The realm import ConfigMap in git has not been touched in weeks, yet the live realm has clearly changed
Helm values for the Keycloak chart set realm import strategy to OVERWRITE or leave it unset (which defaults to OVERWRITE on most charts)

The sync that looked clean and quietly stripped six clients

ArgoCD said Healthy. Auth said 401.

Our first guess was wrong. The team had been staring at auth-service for 25 minutes when we joined the bridge, because the tokens it was issuing were obviously malformed. The groups claim was gone. The email_verified claim was gone on a different client. Surely auth-service had shipped a bad release. Except auth-service had not shipped in nine days, and the failure had started 40 minutes ago, not nine days ago.

The shape of the failure is what gave it away. Three OIDC clients had each lost a different mapper at the same moment. Auth-service had lost a groups protocol mapper. The profile service had lost an email_verified client scope mapping. The api gateway had lost role mappings for a downstream audience. Three services do not lose three unrelated pieces of OIDC config simultaneously unless something upstream rewrote all of them at once. The only thing that had touched Keycloak in that window was an ArgoCD auto-sync of the Keycloak Helm release.

We pulled the ArgoCD sync history and found the sync 41 minutes earlier. It was a chart version bump, nothing that should have changed realm content. But the chart ships a realm import ConfigMap, and the realm JSON inside that ConfigMap had not been updated in weeks. Meanwhile the live realm in the Keycloak PostgreSQL database had been edited through the admin console at least a dozen times during recent incidents. None of those console changes had been committed back to git.

So the chart redeployed the ConfigMap. The Keycloak init container read it. And the realm import ran with the strategy set to OVERWRITE. Every console change made during the previous two weeks of incident response got reverted to the stale git version, silently, with no error and no event surfaced to ArgoCD.

Diffing live realm state against the ConfigMap before doing anything destructive

Six clients had drifted and the next sync would make it worse

The first thing we did was not a fix. The first thing we did was freeze. Auto-sync was still enabled on the Keycloak ArgoCD application. If anyone touched a Helm value for any reason in the next hour, another sync would fire and a second OVERWRITE pass would run against whatever state we had managed to reconstruct. We paused auto-sync first and removed the self-heal annotation, then started the diagnosis.

# 1. Freeze the ArgoCD app so the next sync cannot fire mid-recovery
argocd app set keycloak --sync-policy none
argocd app set keycloak --self-heal=false

# 2. Pull live realm state from the Keycloak Admin REST API
TOKEN=$(curl -s -X POST "$KC/realms/master/protocol/openid-connect/token" \
  -d "grant_type=password" -d "client_id=admin-cli" \
  -d "username=$ADMIN_USER" -d "password=$ADMIN_PASS" | jq -r .access_token)

curl -s -H "Authorization: Bearer $TOKEN" \
  "$KC/admin/realms/primary/clients" | jq . > live-clients.json

curl -s -H "Authorization: Bearer $TOKEN" \
  "$KC/admin/realms/primary/client-scopes" | jq . > live-scopes.json

# 3. Extract the realm JSON ArgoCD just pushed
kubectl -n keycloak get cm keycloak-realm-import -o jsonpath='{.data.realm\.json}' \
  | jq . > configmap-realm.json

Snapshot live state before any reconciliation. The live API is now the source of truth, not the ConfigMap.

Diffing live-clients.json against the clients block in configmap-realm.json showed six clients with material differences. Two were missing protocol mappers entirely. Three had client scopes that had been removed. One had role mappings that were present in the ConfigMap but missing in production, which told us that client had also been changed in the console at some point and the change had been overwritten on a previous sync we had not even noticed. That last finding was the one that mattered most: this was not the first time the OVERWRITE strategy had quietly destroyed live config. It was just the first time the destruction had cascaded far enough to break downstream services.

Two write paths to the same realm. OVERWRITE makes one of them silently win.

Reconstructing realm state without invalidating active sessions

Why we did not re-import the ConfigMap

The obvious recovery path was to fix the realm JSON in git, commit it, and let ArgoCD re-sync. We did not do that, and the reason matters. A full realm re-import, even with the right content, runs through the Keycloak realm import flow on startup. Depending on the chart and the Keycloak version, that can rotate signing keys, drop active sessions, or invalidate refresh tokens. We had roughly 8,000 active user sessions at that moment. Forcing all of them to re-authenticate at 11pm during an active incident was not a recovery; it was a second outage on top of the first.

So we split the fix into two phases. Phase one was to restore live realm state using the Admin REST API directly, client by client, mapper by mapper. The REST API can add a protocol mapper or attach a client scope to a client without bouncing anything. Phase two was to update the ConfigMap in git to match the now-correct live state AND change the import strategy, so that the next ArgoCD sync would be a no-op rather than another OVERWRITE pass.

# Phase 1: restore each missing mapper live via Admin REST API
# Example: re-add the groups protocol mapper to auth-service client
CLIENT_ID=$(jq -r '.[] | select(.clientId=="auth-service") | .id' live-clients.json)

curl -s -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  "$KC/admin/realms/primary/clients/$CLIENT_ID/protocol-mappers/models" \
  -d '{
    "name": "groups",
    "protocol": "openid-connect",
    "protocolMapper": "oidc-group-membership-mapper",
    "config": {
      "claim.name": "groups",
      "full.path": "false",
      "id.token.claim": "true",
      "access.token.claim": "true",
      "userinfo.token.claim": "true"
    }
  }'

# Verify a freshly issued token now carries the claim before moving on
curl -s -X POST "$KC/realms/primary/protocol/openid-connect/token" \
  -d 'grant_type=client_credentials' \
  -d "client_id=auth-service" -d "client_secret=$SECRET" \
  | jq -r .access_token | cut -d. -f2 | base64 -d 2>/dev/null | jq .

Restore each mapper live, then verify the issued token actually carries the claim before moving to the next client.

We worked through the six clients in dependency order: auth-service first because every other service consumed its tokens, then the api gateway, then profile, then the rest. After each client we curl'd a fresh token and base64-decoded the payload to confirm the claim was present. Twenty-two minutes from the start of restoration, timeline-service was returning 200s again. No sessions dropped. No users re-authenticated. The Keycloak pods were never restarted.

What we changed so the next sync becomes a no-op

The one Helm value that should never be OVERWRITE

With live state correct, the dangerous artifact in the system was still the stale realm JSON in the ConfigMap and the OVERWRITE strategy that would re-apply it on any future sync. We exported the now-correct realm via the Admin API, ran it through a diff against what was in git, and committed the result. We also patched the Keycloak Helm values to set the realm import strategy to IGNORE_EXISTING.

# values.yaml for the Keycloak chart
extraEnv: |
  - name: KEYCLOAK_IMPORT_STRATEGY
    value: IGNORE_EXISTING
  # On Keycloak 22+ via Quarkus distribution:
  - name: KC_SPI_IMPORT_SINGLE_FILE_STRATEGY
    value: IGNORE_EXISTING

# For the operator/CR variant:
# spec:
#   realmImport:
#     strategy: IGNORE_EXISTING   # NOT OVERWRITE_EXISTING

IGNORE_EXISTING means the ConfigMap seeds a realm on first creation but never overwrites existing resources. This is the correct setting for any realm that humans also edit.

We re-enabled ArgoCD auto-sync and watched it run. The sync diffed clean: ConfigMap content matched live realm, import strategy was IGNORE_EXISTING, no resources were touched. Green for the right reason this time.

We changed two things in the way the team operates going forward. First, we wrote a small drift detector that runs nightly. It pulls the live realm via the Admin API, diffs it against the realm JSON in git, and posts to a Slack channel if they disagree. It is roughly 80 lines and it has caught two console-edits-not-committed in the six weeks since. Second, we now treat OVERWRITE as a forbidden value for any realm that is also editable in the admin console. If you want OVERWRITE semantics, you must also remove admin console write access for everyone except a break-glass account, because otherwise you are building a system where one of two writers silently destroys the other's work. We have written more about this category of GitOps failure in the ArgoCD and GitOps recovery cluster, and the same pattern shows up with Grafana dashboards, Argo Workflows templates, and anything else where humans and a controller both have write access to the same object.

When GitOps is silently rewriting your identity provider

If your realm config and your cluster disagree

The hard part of this kind of incident is not the Keycloak knowledge. It is recognizing that a green ArgoCD dashboard can coexist with a destroyed production configuration, and knowing which fixes preserve sessions versus which ones lock out every user in the building at midnight. The team we worked with had the Keycloak skills. What they did not have was a recovery sequence that prioritized live state capture over git reconciliation, and a clear rule about when to apply via the Admin API versus when to let ArgoCD do it.

We run these recovery engagements every week. The OVERWRITE-vs-IGNORE_EXISTING trap has hit two other teams this quarter, both on Keycloak, and we have seen the same shape on Grafana provisioning, Argo Workflows ClusterWorkflowTemplates, and a memorable case with Vault policies. The pattern is always: controller writes, human writes, controller wins on the next reconcile, nobody notices for hours.

If your identity provider, your dashboards, or any other system with human-editable state is sitting behind ArgoCD and you have ever wondered whether you are quietly losing changes, book an infrastructure review with our team and we will be on a bridge with you the same day. The first 30 minutes will tell you whether you have a drift problem, and from there we can scope a recovery that does not require kicking your users out.

Originally published at https://infraforge.agency/insights/keycloak-realm-overwrite-argocd-sync-drift/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Why a Terraform apply hangs 90 minutes on a custom provider with no timeout

Muhammad Hassaan Javed — Fri, 22 May 2026 10:14:24 +0000

Two hundred destroys that needed 40 seconds of real work hung for 90 minutes. The platform team kicked off a terraform apply to remove stale config entries from an internal service, watched the progress bar stop at minute 12, and then stared at a frozen terminal until someone finally ran kill -9. By that point the state file was half-updated, the DynamoDB lock was still held, and nobody was sure which of the 200 entries had actually been deleted. The custom Terraform provider doing the destroys had a synchronous HTTP call with no context timeout, and the backend behind it was rate-limiting at 5 RPS. Neither side was wrong on its own. The contract between them was broken.

Problem signals:

terraform apply prints no output for 20+ minutes after destroys begin, no progress, no errors
The backend service is healthy on its dashboard but throttling requests at a low RPS limit
kill -9 on the terraform process leaves the DynamoDB state lock held forever
After force-unlock, terraform state list shows resources that no longer exist in the cloud
The custom provider in use was written internally and has no timeouts {} block support documented

What the team thought was happening, and what was actually happening

Forty seconds of work, ninety minutes of silence

The first assumption was that the internal config service was hung. It was not. Its dashboard showed it healthy and serving requests, just slowly. The second assumption was that terraform was making progress and just not printing anything. That one was half true. Terraform was making progress, at exactly 5 deletes per second, which is the rate limit the backend was enforcing. With 200 entries that is 40 seconds of real work. The team waited 90 minutes.

The reason for the gap was a custom Terraform provider written by a previous platform team. Its DeleteResource function looked roughly like the snippet below. No context. No timeout. No retry-with-backoff. No progress emission back to Terraform's UI layer. When the backend returned a 429, the provider's HTTP client did its own internal retry, swallowed the error, and tried again. Forever. Because the provider never returned from Delete, Terraform's supervisor saw a working call and waited.

func resourceConfigEntryDelete(d *schema.ResourceData, meta interface{}) error {
    client := meta.(*ConfigClient)
    id := d.Id()

    // No context. No timeout. No bound on retries.
    for {
        err := client.DeleteEntry(id)
        if err == nil {
            return nil
        }
        if isRateLimited(err) {
            time.Sleep(1 * time.Second)
            continue
        }
        return err
    }
}

The shape of the broken Delete function (reconstructed from the provider source)

What this should have been is below. The schema.ResourceTimeout block lets users set a timeouts {} block on the resource. The context carries that deadline. When the deadline expires, the provider returns an error and Terraform marks the resource as tainted, not as silently in-progress for the rest of human history.

func resourceConfigEntryDelete(ctx context.Context, d *schema.ResourceData, meta interface{}) diag.Diagnostics {
    client := meta.(*ConfigClient)
    id := d.Id()

    return retry.RetryContext(ctx, d.Timeout(schema.TimeoutDelete), func() *retry.RetryError {
        err := client.DeleteEntryWithContext(ctx, id)
        if err == nil {
            return nil
        }
        if isRateLimited(err) {
            return retry.RetryableError(err)
        }
        return retry.NonRetryableError(err)
    })
}

What the Delete function should look like

The half-updated state and the stuck DynamoDB lock

Why kill -9 left us worse off

When the engineer finally ran kill -9 on the terraform process, two things happened that compounded the problem. First, the DynamoDB lock entry stayed exactly where it was. Terraform releases its lock on graceful shutdown, not on SIGKILL. So the next person who ran terraform plan got the familiar error and assumed someone else was still working on it. They were not. The lock was a ghost.

Second, because the destroys had been happening serially at 5 RPS for the 12 minutes before the hang became obvious (the team realized later they had actually waited longer than they thought before noticing the silence), roughly 60 of the 200 entries had actually been deleted from the backend. Terraform had updated the state file in memory as each delete returned, but it had not yet flushed state to the remote backend, because in the default terraform workflow state is written at the end of the apply, not after each resource. So all 60 of those successful deletes were lost from the state file. The cloud was missing 60 entries that tfstate still claimed existed.

Before doing anything else we confirmed the terraform process was actually dead on the operator's machine. ps aux | grep terraform, on the actual machine, not a tmux pane from yesterday. We have force-unlocked locks that turned out to belong to a process still doing useful work, and the damage is worse than a stuck lock. Once confirmed dead, terraform force-unlock with the lock ID from the error message released DynamoDB.

# 1. Confirm no terraform process is running on the operator's machine
ssh operator-host 'ps aux | grep -v grep | grep terraform'

# 2. Release the lock (lock ID comes from the error message)
terraform force-unlock 7c4a3e22-1b9d-4e8a-b6d7-9f2a8c5e4d11

# 3. See what state thinks vs what the cloud actually has
terraform plan -refresh-only

# 4. Apply the refresh so state matches reality
terraform apply -refresh-only

The recovery sequence after confirming the process is dead

Scripting state rm and import for 200 entries

Reconciling state against a half-finished destroy

After the refresh-only apply, state and cloud agreed on what existed. But the original goal, deleting all 200 entries, was still only partially done. We now had two populations to handle: entries that still existed both in tfstate and in cloud (the destroy had not gotten to them), and entries that had been removed from cloud during the hung apply but were no longer in tfstate either (the refresh had cleaned them up). The first group we could destroy normally. The second group needed nothing further.

Where it got annoying was a third population we discovered later: a handful of entries that had been deleted from cloud by the hung apply, but where the refresh had failed to notice because the provider's Read function had the same no-timeout bug and was returning stale cached data. Those entries were ghosts in tfstate. For each one we had to run terraform state rm by address. With 47 of them, we scripted it from a diff.

# Pull current tfstate resource list
terraform state list | grep config_entry > tfstate_entries.txt

# Pull live entries from the backend (after rate-limit-aware fetch)
curl -s --rate-limit 5 "$CONFIG_API/entries" | jq -r '.[].id' > live_entries.txt

# Entries in tfstate but not in cloud: these are ghosts
comm -23 <(sort tfstate_entries.txt) <(sort live_entries.txt | sed 's|^|module.config.config_entry.|') > ghosts.txt

# Remove them from state
while read addr; do
  terraform state rm "$addr"
done < ghosts.txt

Generating the state rm commands from a diff between tfstate and the live backend

For the inverse case (entry exists in cloud but not in tfstate), the recovery is terraform import. We did not hit this on this incident but we have hit it on similar ones, and the same diff approach works in the other direction. The general pattern for any half-finished Terraform operation against a custom provider is laid out in our Terraform state recovery playbook.

The contract every custom Terraform provider has to honor

What the provider should have done

A custom Terraform provider is a contract. Terraform's whole supervision model assumes the provider plays by it. The contract is short: Create, Read, Update, and Delete each accept a context, each respect the user's timeouts {} block, each emit clear errors when something goes wrong, and each return in bounded time. When a provider violates the contract, Terraform's user-facing behavior degrades in ways that look like Terraform bugs but are not.

Internal providers skip the contract more often than vendor ones, because the team that writes the provider also runs the backend it talks to, and they convince themselves they have full visibility. They do not. terraform-cli is a separate process. It cannot see your retry loop. It cannot see your in-flight HTTP call. All it sees is a function that has not yet returned. The fix for this provider was three changes:

Step	What it does
1. Accept context on every CRUD function	Migrate from the legacy schema.CreateFunc signatures to the context-aware schema.CreateContextFunc variants. This is a non-optional change on terraform-plugin-sdk v2.
2. Declare and honor timeouts on every resource	Add a Timeouts: &schema.ResourceTimeout{Create: schema.DefaultTimeout(5 * time.Minute), Delete: schema.DefaultTimeout(5 * time.Minute)} block on every resource schema. Use d.Timeout(schema.TimeoutDelete) inside the function.
3. Replace internal retry loops with retry.RetryContext	The retry helper respects the context deadline and surfaces retryable vs non-retryable errors cleanly. Hand-rolled for-loops over time.Sleep do not.
4. Pin the fixed version via .terraform.lock.hcl	Release a new patch version of the provider, update the lockfile, and remove the old version from your internal registry so nobody can fall back to it.

The apply pattern itself also needed a change. Destroying 200 entries in one shot against a 5 RPS backend is asking for trouble even with a correct provider, because a 5-minute timeout per resource is generous when one resource genuinely takes 200ms but useless when the queue ahead of you is 199 other deletes. We split future bulk operations into batches of 10 using -target, or we push the backend team to expose a bulk delete endpoint. The provider then wraps the bulk endpoint as a single resource operation instead of looping.

The relationship that broke and what fixes each side

When a custom provider has left your state in an unknown shape

If you are looking at a hung apply right now

Hung Terraform applies against internal providers are the kind of incident that sounds boring in a postmortem and feels terrifying in the moment. You cannot tell if the apply is still doing useful work or stuck forever. You cannot kill it without risking a half-finished state. You cannot force-unlock until you are certain the process is dead. And once you do recover, you do not actually know which resources got modified and which did not, because the provider did not emit progress and the state file was not flushed.

We run these recovery engagements often enough that the script above is templated. The no-timeout custom provider pattern shows up in maybe one in five of the Terraform recoveries we have done this year, almost always with internal providers written years ago by an engineer who has since left. The fix is mechanical once you know the shape of the failure: confirm process death, force-unlock, refresh-only plan, diff state against cloud, reconcile with state rm and import, then patch the provider so it cannot happen again.

If you are staring at a hung apply right now and you are not sure whether to kill it, book an infrastructure review with our team and we will be on a bridge with you the same day. If the apply is already dead and you are sorting through the wreckage, the same engagement covers the state reconciliation and the provider fix together.

Originally published at https://infraforge.agency/insights/terraform-apply-hung-custom-provider-no-timeout/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Grafana 'No Data' after migration: 7 reconcilers we had to kill first

Muhammad Hassaan Javed — Thu, 21 May 2026 21:13:36 +0000

The first fix lasted 90 seconds. We had corrected the Grafana datasource URL from prometheus:9999 back to prometheus:9090, watched the pod roll, refreshed the dashboard, and seen one panel come alive. By the time we opened a second tab, the ConfigMap was back to 9999. That was the real incident. The 'No Data' dashboards were a symptom of an observability stack that someone, or something, was actively re-corrupting from at least seven places we had not yet found.

Problem signals:

Grafana dashboards show 'No Data' on every panel after a cluster migration, and kubectl edit fixes revert within 1-3 minutes
Prometheus targets page is empty or stuck on a namespace that does not exist anymore
ClusterRoleBindings you just recreated reference a ClusterRole name nobody on the team typed
ps aux shows kworker-looking processes with elevated CPU that hold open file descriptors to a kubeconfig
kubectl get cronjobs -A shows entries in namespaces nobody on the platform team remembers creating

Why we stopped fixing config and started looking for what was undoing it

The fix that lasted 90 seconds

The team that called us had been at this for nine hours. After a cluster migration, every Grafana dashboard was blank. The on-call had walked through the obvious things. The Prometheus datasource in Grafana pointed at port 9999. The Loki datasource pointed at port 3199. The Prometheus scrape config had annotation keys nobody recognized (prometheus_io_metrics_enabled instead of prometheus_io_scrape) and targeted a namespace that did not exist. The Grafana deployment had a config-validator init container running sleep 3600. Each one of those was a real bug. Each one of those, fixed in isolation, would revert before the next pod rolled out.

The shape of what they were describing was not a botched migration. A botched migration leaves bad state. This was bad state being re-applied. When manual kubectl edits revert in minutes, the question is no longer 'what is wrong with the manifest', it is 'what process has write access and is reconciling against a corrupt source of truth'. We told them to stop fixing config until we had inventoried every actor that could write to the cluster.

This sounds obvious written down. In the middle of an incident, with executives asking for an ETA on dashboards, the instinct is to keep patching. We have run this play enough times now to know the patching never converges. You burn three more hours and your changes still revert. The only path out is persistence-first triage.

Seven places state was being rewritten from

A kworker thread holding a kubeconfig

We started on the nodes. ps auxf on each worker showed a process named [kworker/u8:2-events_unbound]. Square brackets usually mean a kernel thread, and you learn early not to touch kernel threads. We almost moved on. The thing that snagged our attention was CPU: a real kernel worker thread on an idle-ish node should not be sitting at 12 percent. We pulled its open file descriptors.

$ ls -l /proc/$(pgrep -f 'kworker/u8:2')/fd/ 2>/dev/null | head
lr-x------ 1 root root 64 ... 3 -> /root/.kube/config
lrwx------ 1 root root 64 ... 7 -> socket:[884213]
lr-x------ 1 root root 64 ... 9 -> /opt/.reconciler/state.json
$ cat /proc/$(pgrep -f 'kworker/u8:2')/comm
kworker/u8:2-events_unbound
$ readlink /proc/$(pgrep -f 'kworker/u8:2')/exe
/opt/.reconciler/agent

Kernel threads do not hold kubeconfigs or have an exe link. This was a userspace binary with a spoofed comm name.

That was reconciler one. The same trick was on every node, with comm names rotating through plausible kworker patterns (flush-dm-0, mm_percpu_wq). We collected the binary, killed every instance, removed the systemd unit that was respawning it, and moved on. Then we did the boring sweep nobody wants to do in the middle of an incident.

kubectl get cronjobs -A surfaced config-audit in kube-system and prometheus-metrics-federation in cattle-monitoring-system. Neither was ours. Both ran every 60 seconds and wrote ConfigMaps.
systemctl list-timers on each node showed k8s-health-monitor.timer firing every two minutes against the API server with a node-local kubeconfig.
ls /etc/cron.d/ had a host cron entry running a script under /opt/.reconciler/ once a minute as a belt-and-braces backup to the systemd timer.
kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations turned up pod-policy-webhook, namespace-policy-webhook, and the one that hurt us most, rbac-policy-enforcer.
chattr was set +i on /etc/cron.d/k8s-health and on the corrupted ConfigMap manifests staged on disk. Edits failed silently with 'operation not permitted'.
Finalizers on the CronJobs prevented kubectl delete from completing until we patched them off.
PodSecurity labels on cattle-monitoring-system were set to enforce a baseline that blocked our debug pods from running.

Seven places. Any one of them, left running, would have re-corrupted the stack within minutes of our fixes. Some teams have a reconciler. This cluster had a mesh of them, each one a backup for the others. That is not a thing healthy infrastructure does; it is a thing a previous incident or a hostile takeover does. Either way, the response is the same.

The order we neutralized things, and why order matters

Why we deleted the webhooks before touching RBAC

There is a trap in this kind of cleanup. If you fix the visible problem before you neutralize the actor reverting it, you have wasted a fix and burned credibility with the room. The worst version of this in our case was the RBAC webhook. The Prometheus ClusterRoleBinding had been deleted entirely, and the deployment had been swapped to the default service account. The obvious move was to recreate the CRB and patch the deployment back to a proper SA.

We tried it once, in a scratch namespace, just to see. The CRB came back with roleRef pointing at a ClusterRole that did not exist. The mutating webhook was matching anything with 'prometheus' or 'monitoring' in the name and silently rewriting the roleRef. If we had run that against the real CRB in production with the team watching, we would have looked like we did not know what we were doing, and the fix would not have worked.

Neutralize first, then fix. RBAC and any 'monitoring'-named resource go last because the webhook would mutate them on creation.

So the order was: strip finalizers from the CronJobs, chattr -i on the immutable files, delete the three webhook configurations, suspend and delete the CronJobs in kube-system and cattle-monitoring-system, mask the systemd timer, remove the host cron entry, kill the userspace reconciler processes on every node and remove their systemd unit. Then we sat for 60 seconds and watched. No ConfigMap mutations. No Deployment patches. Quiet cluster. That was the first time in nine hours the cluster had been quiet, and you could feel the room exhale.

Restoring the observability stack once writes were ours alone

The order we put it back together

With the reconcilers gone, the config fixes were the easy part. We did them top-down by data flow: scrape config, then service routing, then the consumers.

# 1. Prometheus ConfigMap: restore annotation keys, fix namespace, drop interval
kubectl -n monitoring get cm prometheus-config -o yaml > /tmp/prom-cm.yaml
# edit: prometheus_io_metrics_* -> prometheus.io/scrape, /metrics, port
#       namespaces: [bleater-nonexistent] -> the real app namespace
#       scrape_interval: 300s -> 30s
kubectl apply -f /tmp/prom-cm.yaml

# 2. Prometheus Service: targetPort 9099 -> 9090
kubectl -n monitoring patch svc prometheus --type=json \
  -p='[{"op":"replace","path":"/spec/ports/0/targetPort","value":9090}]'

# 3. Service account and RBAC (webhooks already deleted)
kubectl -n monitoring create sa prometheus
kubectl create clusterrolebinding prometheus \
  --clusterrole=prometheus --serviceaccount=monitoring:prometheus
kubectl -n monitoring set serviceaccount deploy/prometheus prometheus

# 4. Prometheus readiness probe: port 9099 /-/healthz -> 9090 /-/ready
# 5. Loki: drop -server.http-listen-port=3199 arg, fix svc selector loki-server -> loki
# 6. Grafana: remove init container, fix probe ports, drop GF_SERVER_HTTP_PORT,
#    fix volume refs (-v2 -> base name), reset admin secret
# 7. Delete NetworkPolicy grafana-egress-restrict
kubectl -n monitoring delete networkpolicy grafana-egress-restrict

We applied these as separate kubectl operations on purpose, not a single helm rollout, so we could verify each one stuck before moving on.

After every step we waited 30 seconds and re-read the resource. Nothing reverted. We rolled the Grafana deployment, watched it come up clean with no init container blocking startup, hit the Prometheus targets page and saw 11 active up series including the application pods, then loaded a dashboard. Data. The two-minute stability window passed with no drift. We held the bridge for another 20 minutes anyway, because the team needed to see it not break more than they needed us to leave.

Persistence-first triage is now the default for post-migration observability failures

What we changed in our own playbook

We have changed how we open any incident where fixes do not stick. The first 15 minutes are no longer spent on config. They are spent on a sabotage sweep: cronjobs in every namespace (not just the obvious ones, cattle-monitoring-system bit us and we have seen it bite others), systemd timers on every node, /etc/cron.d, validating and mutating webhooks, finalizers on resources we expect to delete, immutable file attributes on staged manifests, and a ps auxf on every node with an eye on anything in square brackets that has an exe link.

We also changed how we think about kubectl edit during a live incident. If a change has to land and the cluster has any chance of having a reconciler we have not yet found, we apply through git and watch the apply, not edit the live object. It is slower by 90 seconds and saves you from spending an hour wondering why your fix evaporated. We have written more on the same instinct in our notes on Kubernetes release failures and on ArgoCD self-heal traps, which is the friendly version of this same pattern.

The non-obvious lesson from this incident is that hostile or accidental reconcilers do not announce themselves. The kworker spoof was the cleverest piece; it would have survived a casual ps. The cattle-monitoring-system namespace looked legitimate to anyone who had ever run Rancher. The webhook had a name (rbac-policy-enforcer) that sounded like something a security team would install. In each case the move that surfaced it was boring: enumerate the category exhaustively, then ask which entries the team can account for. Anything they cannot account for is the answer.

When fixes revert, the problem is not the fix

If your post-migration monitoring keeps un-fixing itself

The hard part of incidents like this is not the Prometheus annotation key or the Grafana port. Those take 20 minutes once the cluster stops fighting you. The hard part is having the discipline to stop patching and inventory every actor that can write to your cluster, especially when leadership is asking for an ETA and your instinct is to keep typing. The hard part is also knowing what the categories of reconciler are. If you have never had to look for a mutating webhook that rewrites RBAC, or a host process pretending to be a kworker, the search takes hours. If you have seen it before, it takes 15 minutes.

We run these recovery engagements every week. We have seen the kworker spoof twice this year, the cattle-monitoring-system CronJob trick three times, and the RBAC-mutating webhook in two unrelated post-migration incidents. The playbook is portable; the patience to run it before patching is the part teams in the middle of an outage struggle with, and that is usually why they call us.

If your dashboards are blank after a migration and your fixes are not sticking, book an infrastructure review with our team and we will be on a bridge with you the same day. Bring node SSH access, kubectl with cluster-admin, and a list of every namespace you can name. We will handle the rest.

Originally published at https://infraforge.agency/insights/grafana-no-data-after-migration-reconcilers-reverting-fixes/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

When MinIO Deny Wins Cause Silent Upload Failure

Muhammad Hassaan Javed — Thu, 21 May 2026 01:44:45 +0000

The dashboards were green. The api-gateway logged 12,400 successful media POSTs over six hours, the storage service SDK reported 200 on every PutObject, and the fanout queue happily processed every notification. The MinIO bucket had gained zero new objects in the same window. Users were seeing broken image tiles in their feeds and the on-call team had spent three hours chasing the fanout service because that was the only place the symptom was visible. The actual problem was an explicit Deny on s3:PutObject sitting inside a bucket policy that had been added during a security hardening sprint two days earlier, and MinIO was doing exactly what S3 IAM semantics say it should do: deny wins, even when the user policy says Allow.

Problem signals:

Upload endpoints return HTTP 200 but the object never appears in the bucket
Bucket notification webhooks fire and downstream consumers process phantom events
Grafana shows upload throughput as healthy because SDK success metrics dominate the panel
Users report broken image links while every service-level dashboard is green
A recent IAM or bucket policy change correlates in time with the start of phantom uploads

The discrepancy that should have been the first alert

12,400 successful uploads, zero new objects

We came in on the third hour of the incident. The team had been chasing the fanout consumer because user reports were all of the form 'my avatar is broken' and the only service touching media after upload was fanout. Their working theory was that fanout was racing the CDN, or that the notification payload was missing a key, or that signed URLs were expiring early. They had three engineers staring at fanout-service logs and finding nothing wrong, because there was nothing wrong with fanout-service.

The question we asked, which is the question we always ask first when an upload pipeline misbehaves: how many objects has the bucket actually gained in the last hour? Not how many uploads the API recorded. Not how many notifications fanout received. How many real objects exist now that did not exist sixty minutes ago. We ran the listing against the MinIO admin API and the answer was zero. The bucket had not gained a single object since 02:14 that morning, which lined up almost exactly with the merge time of a security hardening PR the platform team had landed two days prior.

# count objects added in the last hour
mc find local/bleater-media --newer-than 1h | wc -l
# 0

# meanwhile the storage-service success counter
curl -s http://prometheus/api/v1/query \
  --data-urlencode 'query=sum(increase(storage_service_put_object_success_total[1h]))'
# {"status":"success","data":{"result":[{"value":[..., "2074"]}]}}

Two views of the same hour. The SDK was confident. The bucket was not.

Once we had that gap on a shared screen the room changed. The fanout investigation got paused. The new question was: why is the SDK reporting success for writes that never persisted?

Where the 200 came from when the object never landed

What the SDK thought, and what the server actually did

This is the part of the story that is worth understanding even if you never touch MinIO. The storage service was using a streaming PutObject path. The client opens a connection, the server accepts headers and begins reading the body, and the bucket notification configuration is wired to fire on the API receipt of the PutObject call. In a healthy run, the server then writes the object, the response is 200, and the notification correctly reflects a real write. In our broken run, the server accepted the headers, fired the notification, evaluated the IAM policies, hit the explicit Deny, and closed the stream. The client SDK saw the connection close after headers were ack'd and treated it as success because the response framing looked clean enough at the transport layer. The notification had already gone out. The audit log recorded the deny. Nobody was reading the audit log.

Enabling the MinIO audit target was the diagnostic turn. Two commands and the lie unwound itself.

mc admin config set local audit_webhook:1 \
  endpoint="http://collector:8080/minio-audit" enable=on
mc admin service restart local

# tail the collector for a few seconds
# {"api":{"name":"PutObject","bucket":"bleater-media",
#        "object":"avatars/u-83421.jpg","status":"AccessDenied",
#        "statusCode":403},
#  "requestClaims":{"accessKey":"storage-service"},
#  "error":{"message":"Access Denied.",
#           "source":["cmd/auth-handler.go:checkRequestAuthTypeCredential"]}}

Audit log showed 403 AccessDenied on every PutObject from the storage-service identity. The client never saw it.

The storage-service identity had a user policy that explicitly granted s3:PutObject on arn:aws:s3:::bleater-media/*. We confirmed this in two seconds. Which meant the deny had to be coming from somewhere else.

The bucket policy nobody had read since the hardening PR

Where the explicit Deny was hiding

MinIO, like S3, evaluates IAM in two layers. The user (or service account) policy attached to the identity is one layer. The bucket policy attached to the resource is the other. An explicit Deny in either layer overrides any Allow in either layer. The hardening PR had added a bucket policy intended to lock down a different identity, an analytics reader that had been overprovisioned, and the author had used a wildcard Principal with a NotPrincipal exception that was wrong. The effective rule said: deny s3:PutObject on this bucket for everyone who is not the analytics-reader identity. Which of course included the storage service.

curl -s -u $ADMIN:$SECRET \
  http://minio:9000/minio/admin/v3/get-bucket-policy?bucket=bleater-media \
  | jq .

# {
#   "Version": "2012-10-17",
#   "Statement": [
#     {
#       "Sid": "RestrictWritesToAnalyticsReader",
#       "Effect": "Deny",
#       "NotPrincipal": { "AWS": ["arn:aws:iam:::user/analytics-reader"] },
#       "Action": ["s3:PutObject"],
#       "Resource": ["arn:aws:s3:::bleater-media/*"]
#     }
#   ]
# }

The bucket policy that swallowed every write. NotPrincipal with Deny is a footgun in any S3-compatible IAM.

We have seen NotPrincipal misused in three separate engagements this year. It reads as if it means 'apply this rule to everyone except this principal' the same way a NotAction would, but the semantics interact badly with cross-account and service-account identities. If you are writing a Deny that you want scoped to a specific identity, write the Deny with Principal naming the identity you mean to block. Do not invert it. The blast radius of a wrong inversion is the entire bucket.

Before we touched anything we wanted to rule out the obvious adjacent causes, because removing a security-hardening policy at 06:00 without confirmation is the kind of fix that becomes its own incident. We checked credential expiry on the storage-service service account (valid for another 47 days), checked network policy for any new egress restrictions from the storage-service namespace (none), and confirmed bucket versioning was off so we were not chasing delete markers. The audit log had already told us the answer; we just wanted the rollback to be unambiguous when we wrote it up.

The four-minute patch and the queue we had to reconcile

Removing the Deny without re-opening the bucket

Two questions before patching. First, did we want to fix the bucket policy in place, or revert the hardening PR entirely? We chose patch in place. The hardening PR had also tightened three other identities correctly, and reverting would have undone work that was real. Second, did we want to leave the analytics-reader restriction in some form? Yes, but written correctly. We rewrote the statement as an explicit Deny on the analytics-reader principal for write actions, which is what the author had intended.

cat > /tmp/bleater-media-policy.json <<'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BlockAnalyticsReaderWrites",
      "Effect": "Deny",
      "Principal": { "AWS": ["arn:aws:iam:::user/analytics-reader"] },
      "Action": ["s3:PutObject", "s3:DeleteObject"],
      "Resource": ["arn:aws:s3:::bleater-media/*"]
    }
  ]
}
EOF

curl -s -u $ADMIN:$SECRET \
  -X PUT \
  --data-binary @/tmp/bleater-media-policy.json \
  "http://minio:9000/minio/admin/v3/set-bucket-policy?bucket=bleater-media"

# validate with a real write from the storage-service identity
curl -s -X PUT -T /tmp/canary.bin \
  -H "Authorization: ...storage-service-sigv4..." \
  http://minio:9000/bleater-media/canary/$(date +%s).bin

mc ls local/bleater-media/canary/ | tail -1
# [2024-...] 4.0KiB STANDARD 1717420831.bin

Replace the inverted NotPrincipal with an explicit Principal Deny, then prove with a canary that the storage-service identity can write.

The canary landed. Real uploads from the application resumed within the next minute as new requests came in. That fixed the forward path. It did not fix the past six hours.

The phantom notification problem was harder to bound. The fanout service had processed roughly 12,400 notification events for objects that did not exist, which meant 12,400 user timelines contained references to media that would 404 forever. We pulled the notification log from the RabbitMQ stream and diffed against the actual object listing in the bucket. The count of phantom references came in at 12,387. We pushed a one-shot reconciliation job that re-emitted upload prompts to the affected users for any media uploaded in that window, because we had no way to recover the original bytes; the storage service had streamed them to a connection that was closed before persistence.

The notification fires before the deny evaluation completes. Every layer below MinIO sees success.

What we changed so the next deny-wins conflict is not silent

The synthetic that would have caught this in 90 seconds

The deeper lesson here is not about MinIO. It is that SDK success and server persistence are different facts, and most observability stacks conflate them. Every metric on the storage service dashboard came from the SDK return code. Every metric on the fanout dashboard came from notification receipt. Nothing in the stack was sourced from the only ground truth that mattered, which was the count of objects actually present in the bucket. The hardening PR could have done much worse than this and we would still have been blind.

We made three changes after this incident. First, a synthetic that writes a canary object every 60 seconds and then lists the bucket to confirm the canary is there. The metric is the gap between writes and confirmed reads, and it alerts at gap greater than two intervals. This is the kind of probe we now build into every object-storage path we touch. Second, the MinIO audit webhook now ships to the log aggregation pipeline with a Loki alert rule on any sustained rate of statusCode 403 for PutObject, scoped per identity. Third, we wrote a pre-merge check for bucket policy changes that flags any statement using NotPrincipal with Effect Deny and requires an explicit reviewer sign-off.

# Loki alert: deny-wins on PutObject for any service identity
- alert: MinioPutObjectDenied
  expr: |
    sum by (accessKey) (
      rate({job="minio-audit"}
        | json
        | api_name = "PutObject"
        | api_statusCode = "403"
        [5m])
    ) > 0
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "MinIO denying PutObject for {{ $labels.accessKey }}"
    runbook: "Check bucket policy and user policy for explicit Deny statements."

The alert that would have paged the on-call within five minutes of the hardening PR rolling out.

If your bucket notifications drive downstream business logic, you have the same shape of risk we did. The notification path and the persistence path are not the same path, and the IAM evaluation sits between them. Assume nothing about server persistence based on SDK return codes. Read the audit log.

When a hardening PR silently revokes write access in production

If your object store is quietly lying to your monitors

This class of incident is hard for a specific reason: every monitoring surface a normal team has built reports healthy, because every normal monitoring surface reads from the layer above the failure. The teams we work with that have hit this pattern were not careless. They had dashboards, they had alerts, they had error budgets. None of those instruments were positioned to see a server-side deny that the SDK swallowed. The fix is a small synthetic and an audit log alert, and they take an afternoon to build. Getting to the point of knowing you need them usually takes one bad incident.

We run object-storage and IAM recovery engagements often enough that this exact shape, a hardening PR introducing a deny-wins conflict against a service account, has come up three times this year on three different stacks (MinIO, Ceph RGW, and AWS S3 with a SCP). The mechanics are the same in all three. If your team is staring at green dashboards and broken user reports, the gap between SDK success and ground-truth persistence is the first place to look. If you want a second set of eyes on a hardening rollout before it lands, or you are inside one of these incidents right now, book an infrastructure review with our team and we will be on a bridge with you the same day. We also document the audit-log and synthetic patterns in more depth on the infrastructure audit readiness page if you want to read ahead.

Originally published at https://infraforge.agency/insights/minio-deny-wins-silent-upload-failure/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

ArgoCD Drift: Three Namespaces, One JWT Hotfix

Muhammad Hassaan Javed — Wed, 20 May 2026 22:09:53 +0000

The on-call team had been chasing a 30% 401 rate on profile-service for two hours when we got pulled in. Only profile-service, only some pods, only authenticated requests. The shape of that number is what gave it away: a 30% failure rate on a service backed by a 3-pod deployment is what you see when one pod out of three is running with a different config. Except it was not a config rollout in flight. It was a week-old JWT key rotation hotfix that had landed in the live cluster, never made it to Git, and ArgoCD auto-sync had been disabled across three applications and quietly left off. By the time we opened a terminal there were four versions of the same ConfigMap floating around: one in Git, three in three namespaces, none of them in agreement.

Problem signals:

A service is returning 401s on a fraction of requests that matches a pod count ratio (30% on 3 pods, 25% on 4 pods)
ArgoCD shows applications as OutOfSync but auto-sync is disabled and nobody remembers turning it off
kubectl diff against the rendered Helm or Kustomize output shows changes nobody can attribute to a recent PR
Multiple namespaces have a propagated copy of the same ConfigMap and the copies disagree
A recent incident postmortem mentions a manual kubectl edit or kubectl patch that was never followed by a Git commit

The first 20 minutes: mapping how far the drift had spread

Four ConfigMaps, four different values

The initial theory from the on-call lead was that a pod had missed the last restart and was still holding the pre-rotation JWT public key. Reasonable theory. It was wrong, but only because it was incomplete.

We ran the obvious diff first. Pull the ConfigMap from each of the three namespaces, pull the manifest from the Git repo at HEAD, compare. What we expected to find was two values: a correct one in the cluster and a stale one in Git, or the reverse. What we actually found was four.

# auth-service namespace
$ kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
RS256 key-2024-11-rot

# like-service namespace (propagated copy)
$ kubectl -n like get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
RS256 key-2024-09

# profile-service namespace (propagated copy)
$ kubectl -n profile get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
HS256 key-2024-09

# Git, main branch
$ grep -E 'JWT_(ALGORITHM|PUBLIC_KEY_ID)' deploy/*/auth-config.yaml
deploy/auth/auth-config.yaml:  JWT_ALGORITHM: HS256
deploy/auth/auth-config.yaml:  JWT_PUBLIC_KEY_ID: key-2024-09
# (and the same stale pair in like and profile manifests)

What the diff actually showed. Four states of the same ConfigMap.

The story behind the four states reconstructed quickly from the previous week's incident channel. During the rotation, an SRE had patched auth-service's ConfigMap directly with the new RS256 key. They then walked the change into the like-service namespace and got the algorithm right but typo'd the key ID, leaving the old one. They ran out of focus before reaching profile-service, intended to come back to it, and did not. ArgoCD auto-sync had been disabled across all three applications during the incident as a guardrail and never re-enabled, which is the only reason the cluster state had survived a week without ArgoCD reverting it back to the stale Git values.

So the 30% 401 rate had a clean explanation. profile-service's pods had been restarted at some point and picked up the HS256 config from the unpatched ConfigMap. The auth-service was now issuing RS256-signed tokens. profile-service was trying to validate them as HS256 with the wrong key ID. The only requests that did not 401 were the ones that happened to skip the auth path entirely.

The decision that almost broke production a second time

Why Git was the wrong source of truth

The instinct, when you find drift between Git and a cluster, is to trust Git. That is the whole point of GitOps. The pull request is the source of truth and the cluster is downstream. Run an ArgoCD sync, let it overwrite the live state, move on.

That instinct would have broken auth-service inside of 30 seconds. Git held the pre-rotation HS256 values. The new private key that auth-service was signing tokens with did not match the public key Git was about to push into the ConfigMap. A sync from Git would have invalidated every token in flight across all three services, not just 30% of them.

We had to invert the model. For this one incident, the auth-service namespace's live ConfigMap was the canonical truth, and Git was stale. The recovery had to flow live-to-Git first, then Git-to-cluster for the other two namespaces, and only then could auto-sync be turned back on. The order mattered.

Recovery flow. Live state was canonical for one application, Git was canonical after the commit for the other two.

How we got the canonical values into Git and synced the stragglers

Committing a live hotfix back to Git without breaking auth

The commit itself was unremarkable once we had a clear model. We pulled the auth-service ConfigMap, extracted the two fields, and updated all three manifests in the deploy repo in a single PR with a postmortem link in the description. The PR title was 'Hotfix reconcile: commit post-rotation JWT values from live state (incident #INC-441)' because future-us was going to want to know why these values arrived without an upstream change.

# 1. Export canonical values from auth-service namespace
KID=$(kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_PUBLIC_KEY_ID}')
ALG=$(kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM}')

# 2. Patch the three manifests in the Git checkout, commit, push
for d in deploy/auth deploy/like deploy/profile; do
  yq -i ".data.JWT_PUBLIC_KEY_ID = \"$KID\" | .data.JWT_ALGORITHM = \"$ALG\"" "$d/auth-config.yaml"
done
git add deploy/auth deploy/like deploy/profile
git commit -m 'Reconcile JWT config from live auth-service (post-rotation hotfix, INC-441)'
git push

# 3. Trigger ArgoCD sync per application, in order
for app in auth-service like-service profile-service; do
  argocd app sync $app --prune=false
  argocd app wait $app --health --timeout 180
done

The commit and the sync sequence. auth-service syncs first as a no-op safety check before we touch the broken ones.

We synced auth-service first deliberately. It was already correct, so the sync should be a no-op. If it had shown a diff we did not expect, that was our signal to stop and re-audit before touching like-service or profile-service. It came back clean, which told us our commit matched the live state exactly. Then like-service synced and went healthy. Then profile-service synced and within 40 seconds the 401 rate in Prometheus went from 31% to 0.

Auto-sync we left off until the 401 rate had been at zero for ten minutes and we had eyes on the Jaeger traces showing fresh successful auth flows end to end. Only then did we re-enable auto-sync on all three applications, in the same order as the sync. We have written more about the order-of-operations on multi-app reconciles in the ArgoCD and GitOps recovery playbook.

Two cheap controls that prevent the next split-state week

What we changed about hotfix discipline after this one

The technical recovery was straightforward once the model was right. The interesting part of this incident was how a one-hour rotation hotfix turned into a week of latent drift. Two things had to go wrong together: a manual change that did not get committed, and an auto-sync toggle that did not get turned back on. Either one of those failing alone would have been caught within an hour by ArgoCD's reconciliation loop.

We made two changes to the platform after this. The first was a scheduled job that lists ArgoCD applications with auto-sync disabled and posts to a channel if any of them have been in that state for more than four hours. It is twelve lines of bash around argocd app list -o json. It has caught the same pattern twice in the last quarter, both times within the same incident as the original change instead of a week later.

# Posted to platform-alerts when auto-sync has been off for >4h on any app
argocd app list -o json \
  | jq -r '.[] | select(.spec.syncPolicy.automated == null)
            | [.metadata.name, .status.operationState.finishedAt] | @tsv' \
  | awk -v cutoff="$(date -u -d '4 hours ago' +%FT%TZ)" '$2 < cutoff'

The auto-sync watchdog. The cheapest control with the highest ROI we shipped this year.

The second change was a rule we now apply to every incident we run: if a hotfix lands in the cluster via kubectl, the same incident does not close until the change is in a merged PR. Not the next day. Not 'we'll get to it'. The incident commander treats the Git commit as a recovery step, not a follow-up. That sounds like a process rule, and it is, but it has a sharp version: the on-call's runbook for manual ConfigMap patches now includes the export-and-PR commands at the bottom of the same page. The friction to do it right is now lower than the friction to defer it.

When the cluster and Git disagree and you cannot just sync your way out

If your GitOps is in a split state right now

The hard part of this kind of incident is not the kubectl or the argocd CLI. The hard part is figuring out which system is the source of truth for which field right now, when the answer is not 'Git, always'. Get that wrong and an ArgoCD sync will take production down a second time on top of whatever is already broken. We have seen the same shape of failure four times this year: a rotation, a migration, an emergency schema change, and a CRD upgrade, each of which left some subset of clusters carrying values that Git did not yet know about.

InfraForge runs these reconciles every week. We know the order to commit, the order to sync, the checks that catch a propagated copy you forgot about, and the questions to ask before you trust Git over the live state. If your auto-sync has been off for a week and you are not sure what would happen when you turn it back on, book an infrastructure review with our team and we will be on a bridge with you the same day to walk the drift before you touch anything.

Originally published at https://infraforge.agency/insights/argocd-drift-three-namespaces-jwt-configmap-hotfix/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

How we recovered tfstate after force-unlock raced a CI apply

Muhammad Hassaan Javed — Tue, 19 May 2026 22:37:05 +0000

The engineer pinged us at 4:48 pm on a Thursday. They had been trying to push a small IAM change to staging, terraform apply had failed with Error acquiring the state lock, and they did what most of us have done at least once: they ran terraform force-unlock with the ID from the error message and re-ran apply. The apply went through. Ten minutes later a teammate on a different branch ran terraform plan and the plan output wanted to destroy and recreate 38 resources that were sitting healthy in AWS, returning 200s, serving traffic. By the time we joined the bridge, the original engineer was halfway convinced they needed to let Terraform rebuild the whole staging environment. They did not. The cloud was fine. The state file was the thing that was broken.

Problem signals:

terraform plan shows -/+ destroy and recreate for resources nobody touched and that are healthy in the cloud
Teammates see Error: state snapshot was created by Terraform v1.5.7, which is newer than current v1.5.4
S3 bucket versioning shows two or three tfstate writes inside a 60 to 90 second window
The DynamoDB lock table is empty but the state file timestamps do not line up with anyone's apply log
Someone on the team ran terraform force-unlock in the last hour

A stale lock from a dead CI job

What the engineer thought it was

The first wrong model was reasonable. The engineer saw Error acquiring the state lock, looked at the lock ID, did not recognize it, and assumed it was a leftover from a CI job that had crashed earlier in the week. They had seen stale locks before. The fix last time was force-unlock. So they ran it again.

What they did not check was whether the lock holder was actually still alive. The CI job that held the lock was a scheduled terraform plan cycle running on a 15-minute cadence, and that particular run was on the slow side because the workspace had grown to about 600 resources. It was not stuck. It was just working. The force-unlock removed the lock entry from DynamoDB while the CI process was still very much holding an in-memory version of the state file, mid-refresh. Two writers, no coordination.

When the engineer's apply finished, it wrote its version of the state to S3. About forty seconds later, the CI run finished its refresh and wrote its version of the state to S3 on top of that. Two non-linear writes, each thinking it had the latest state, each clobbering parts of the other. S3 versioning preserved both, but the live state pointer was pointing at a Frankenstein.

Three S3 versions in 90 seconds, and a plan that wanted to destroy healthy infrastructure

The moment the real cause became visible

We pulled the S3 object versions for the state file first. That is the single most useful command in a Terraform state incident, and most teams do not run it until someone external suggests it.

aws s3api list-object-versions \
  --bucket acme-tfstate-staging \
  --prefix env/staging/terraform.tfstate \
  --query 'Versions[?LastModified>=`2024-01-18T16:45:00Z`].[VersionId,LastModified,Size]' \
  --output table

# Output (abridged):
# VersionId                          LastModified               Size
# 9f3aV2.JqL...                      2024-01-18T16:51:12Z       412847
# 8h2nB1.KpM...                      2024-01-18T16:50:31Z       408992
# 7g1mA0.LoN...                      2024-01-18T16:49:48Z       411203
# 6f0lZ9.MnO...                      2024-01-18T16:42:15Z       411198   <-- last known good

Three writes inside 84 seconds. The 16:42 version was the last clean write before the collision.

Three writes in 84 seconds was the smoking gun. A healthy workspace writes state once per apply, and the next write is usually hours away. Three writes that close together meant at least two processes had been racing. We cross-checked against the CI logs and the engineer's shell history and confirmed: the CI plan cycle had been refreshing state from 16:49:48 onwards, the engineer's force-unlock landed at 16:50:18, the engineer's apply wrote state at 16:50:31, and the CI refresh wrote its stale view back at 16:51:12. The 16:51 write was the one Terraform was now reading, and it had been built from a refresh that started before half the engineer's changes existed.

That explained the plan output. The state Terraform was reading said the resources had attributes that did not match reality. Plan diffed state against the cloud, saw the mismatch, and proposed the only thing it knows how to propose: destroy and recreate. The cloud was correct. The state was lying. If we had let the apply run, we would have taken a healthy staging environment offline for somewhere between 40 minutes and two hours to rebuild things that did not need rebuilding.

Restore the pre-collision state version, then import only what actually drifted

How we worked through it

The recovery had two parts and an order that mattered. First, replace the corrupted live state with the last clean S3 version. Second, figure out which resources genuinely changed during the collision window and re-import only those. Skipping the second step is how teams end up with the same incident a week later, because real changes from the engineer's apply have been silently rolled back.

Before touching anything we pulled a local backup of the current (broken) state. If our restore went wrong, we wanted a way back.

# 1. Backup the current broken state to local disk
aws s3api get-object \
  --bucket acme-tfstate-staging \
  --key env/staging/terraform.tfstate \
  ./tfstate.broken.$(date +%s).json

# 2. Restore the last known good version in place
aws s3api copy-object \
  --bucket acme-tfstate-staging \
  --key env/staging/terraform.tfstate \
  --copy-source 'acme-tfstate-staging/env/staging/terraform.tfstate?versionId=6f0lZ9.MnO...' \
  --metadata-directive REPLACE

# 3. Confirm the active version is now the restored one
aws s3api head-object \
  --bucket acme-tfstate-staging \
  --key env/staging/terraform.tfstate \
  --query 'VersionId'

The copy-object call writes the old version as a new current version. Do not delete versions; you want the audit trail intact.

With the state restored, we ran terraform plan. The output was much shorter, around six resources, and they were the ones the engineer had actually changed in their apply. That was the divergence window: changes that had been made for real in AWS but that the restored state did not know about. Each of those needed a terraform import to reattach the live resource to the state. We did them one at a time, ran plan between each, and watched the diff shrink.

# Example: the engineer had created a new IAM role during their apply.
# The restored state predates it, but the role exists in AWS.

terraform import \
  module.platform.aws_iam_role.svc_runner \
  acme-staging-svc-runner

# After each import, re-run plan and confirm the resource is no longer in the diff.
terraform plan -out=/tmp/plan.out

# Repeat for each resource genuinely changed during the divergence window:
# - 1 IAM role
# - 1 IAM role policy attachment
# - 2 security group rules
# - 1 SSM parameter
# - 1 Lambda permission

Import surgically. Do not bulk-import; you want a clean plan after each step so you can spot collateral damage.

After the sixth import, terraform plan returned No changes. That was the success signal. The state matched the cloud, the engineer's intended changes were preserved, and nothing healthy had been destroyed. Total time on the bridge from first page to clean plan was 2 hours 40 minutes. About 45 minutes of that was the investigation; the rest was careful, slow imports with verification between each one.

flowchart TD
  A[terraform plan shows mass destroy/recreate] --> B{Are the resources actually broken in cloud?}
  B -- No, healthy --> C[State file is the problem, not cloud]
  B -- Yes, broken --> Z[Different incident; investigate cloud-side]
  C --> D[list-object-versions on tfstate]
  D --> E{Multiple writes in short window?}
  E -- Yes --> F[Identify last clean version pre-collision]
  E -- No --> Y[Investigate other corruption causes]
  F --> G[Backup current broken state locally]
  G --> H[copy-object to restore clean version]
  H --> I[terraform plan: short diff = divergence window]
  I --> J[terraform import each drifted resource]
  J --> K{Plan empty?}
  K -- No --> J
  K -- Yes --> L[Recovery complete; write postmortem]

Decision flow we use for any state-collision incident. The first branch matters most: confirm the cloud is healthy before touching state.

Diagram renders at the canonical version.

Two tempting shortcuts that would have made it worse

What we tried that we will not try again

Two shortcuts came up on the bridge that we ruled out. They are worth naming because both of them sound reasonable when you are tired.

1. Let terraform apply rebuild everything, The plan was already there. Just type yes. This would have caused 30 to 90 minutes of staging downtime for resources that did not need rebuilding, broken any data-layer resources with state of their own, and lost the audit trail of what had actually changed.
2. terraform refresh to fix the state, Refresh updates state from the live infrastructure for known resources. It does not learn about resources the state has forgotten, and it cannot undo a structurally corrupted state. Refresh on a Frankenstein state can deepen the damage by writing the merged view back as the new truth.

We have written about the broader pattern in the Terraform state recovery playbook, specifically the rule we now apply on every state incident: the state file is the suspect until proven otherwise. Cloud is healthy until you have evidence it is not. That ordering keeps you from running destructive applies under time pressure.

A pre-apply lock check that prints the holder's age

What we changed afterwards

The team made two changes the week after the incident. Both are small. Both have already paid for themselves.

The first change is a pre-apply wrapper script that reads the DynamoDB lock table before terraform apply runs. If a lock exists, the script prints the lock holder, when the lock was acquired, and how long ago that was. If the lock is younger than the workspace's typical apply duration plus a safety margin, the script refuses to run and tells the engineer to wait. If the lock is genuinely old (older than any plausible live process), the script still does not force-unlock automatically; it prints the exact force-unlock command and makes the engineer paste it. The friction is the point.

#!/usr/bin/env bash
# pre-apply-lock-check.sh
set -euo pipefail

WORKSPACE="${1:?workspace name required}"
LOCK_TABLE="acme-tfstate-locks"
MAX_PLAUSIBLE_APPLY_SECONDS=1800  # 30 minutes

LOCK_ITEM=$(aws dynamodb get-item \
  --table-name "$LOCK_TABLE" \
  --key "{\"LockID\":{\"S\":\"acme-tfstate-staging/env/${WORKSPACE}/terraform.tfstate-md5\"}}" \
  --output json 2>/dev/null || echo '{}')

if [[ "$(echo "$LOCK_ITEM" | jq -r '.Item // empty')" == "" ]]; then
  echo "No lock. Safe to proceed."
  exit 0
fi

HOLDER=$(echo "$LOCK_ITEM" | jq -r '.Item.Info.S' | jq -r '.Who + " @ " + .Operation')
CREATED=$(echo "$LOCK_ITEM" | jq -r '.Item.Info.S' | jq -r '.Created')
AGE=$(( $(date +%s) - $(date -d "$CREATED" +%s) ))

echo "Lock present."
echo "  Holder:  $HOLDER"
echo "  Created: $CREATED"
echo "  Age:     ${AGE}s"

if (( AGE < MAX_PLAUSIBLE_APPLY_SECONDS )); then
  echo
  echo "REFUSING TO PROCEED. Lock is younger than max plausible apply duration."
  echo "Wait for the current holder to finish, or confirm out-of-band that it is dead."
  exit 1
fi

echo
echo "Lock is older than ${MAX_PLAUSIBLE_APPLY_SECONDS}s. It may be stale."
echo "To force-unlock, run manually (do NOT automate this):"
echo "  terraform force-unlock $(echo "$LOCK_ITEM" | jq -r '.Item.Info.S' | jq -r '.ID')"
exit 2

We run this from CI and from a pre-apply git hook on engineer laptops. Same script, same rules, both places.

The second change is operational. The team's runbook now says: if you ever run force-unlock, page the on-call channel immediately with the lock ID and the reason. That single message would have caught this incident before it became one. The CI job would have replied within seconds that it was still running, and the engineer would have known to wait the eight minutes instead of clobbering the state.

We have stopped recommending that teams treat force-unlock as a routine command. It is a recovery command. It belongs in the same mental category as DROP TABLE: technically available, occasionally necessary, never the first thing you reach for. The TTL on the lock is generous on purpose. Wait it out, or confirm the holder is dead. Those are the only two paths.

When the state file is the suspect and the clock is running

If you are looking at a destroy plan you do not trust

The hard part of state-collision incidents is not the recovery commands. The commands are mechanical once you know the shape of the problem. The hard part is the 20 minutes before that, when an apply plan is sitting in your terminal showing 30+ destroys, someone senior is asking on Slack whether you can just run it, and you have to decide whether the cloud is broken or the state is. Get that wrong under pressure and you cause the outage you were trying to prevent.

We run these recovery engagements every week. The force-unlock-collision pattern has shown up four times this quarter alone, in three different shapes: a CI plan racing an engineer apply (this one), two engineers applying simultaneously after a Slack misunderstanding, and a long-running import operation that an engineer killed because they thought it had hung. The recovery shape is the same. The diagnostic discipline of confirming the cloud is healthy before touching state is the same. The thing that changes is which version of state is the right one to restore to, and that takes practice to spot quickly.

If you are staring at a terraform plan that wants to destroy resources you know are healthy, do not run apply. Book an infrastructure review with our team and we will be on a bridge with you the same day to work through the state restore and the surgical imports. We have done this enough times that we can usually have you back to an empty plan inside three hours.

Originally published at https://infraforge.agency/insights/terraform-force-unlock-state-divergence-recovery/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Why a forgotten RDS replica added $8,600 to one AWS bill

Muhammad Hassaan Javed — Tue, 19 May 2026 17:23:31 +0000

The finance lead forwarded the AWS bill on a Monday morning with three question marks in the subject line. The number had gone from a steady $3,200/month to $11,800 in six days. The on-call engineer's first guess, sensible enough, was that a data scientist had left a cross-region Athena job running over the weekend. It was not. It was an RDS read replica in a different AZ from its primary, provisioned a month earlier for a one-off load test, never decommissioned, retrying a replication-stream write every 50 milliseconds because somebody had flipped the primary's binlog format mid-stream. Nobody had read from the replica in three weeks. It had been quietly burning cross-AZ data transfer the whole time.

Problem signals:

AWS bill jumped 2-4x in under a week with no traffic or feature change
Cost Explorer concentrates the spike on DataTransfer-Regional-Bytes and RDSInstance line items
An RDS read replica sits in a different AZ than its primary and shows jagged ReplicaLag (spikes to 30s, drops to 0.5s, repeats)
No application config or BI tool actually points at the replica's endpoint
Recent schema or replication change on the primary that nobody coordinated with replica owners

Chasing the analytics query that did not exist

What we thought it was first

Almost every cost spike I have seen in the last three years gets blamed on analytics first. There is usually a junior data person, a notebook, a forgotten SELECT *, and a story everyone tells themselves. So we did the natural thing. We pulled the Athena query history for the previous ten days. Nothing unusual. We checked Redshift, which the team barely uses. Idle. We checked the data warehouse cluster's autoscaling history. Flat.

The clue was in Cost Explorer, but only when we grouped by usage type instead of by service. The RDS line item was up, sure, but the line item that had really moved was DataTransfer-Regional-Bytes. That is the meter for cross-AZ traffic inside a single region. Analytics queries do not typically light that meter up unless somebody has put a compute node in one AZ and the data in another, which would have been a much weirder problem.

Cross-AZ data transfer at that volume meant something was constantly shipping bytes between two availability zones. The shape of the bill said: find the thing that talks to itself across AZs at high frequency.

How we found the orphan replica

The diagnostic turn

We listed every RDS instance in the account and compared the AZ of each replica to its primary. One read replica was in us-east-1b while its primary was in us-east-1a. That alone is not a problem; cross-AZ replicas exist for legitimate HA reasons. What was odd was that this replica was tagged with nothing. No Owner. No Purpose. No Environment. Just the default Name tag, which read load-test-replica-temp.

# List replicas with their AZ and their primary's AZ
aws rds describe-db-instances \
  --query 'DBInstances[?ReadReplicaSourceDBInstanceIdentifier!=`null`].[DBInstanceIdentifier,AvailabilityZone,ReadReplicaSourceDBInstanceIdentifier,DBInstanceStatus]' \
  --output table

# Then for each primary, get its AZ
aws rds describe-db-instances \
  --db-instance-identifier <primary-id> \
  --query 'DBInstances[0].AvailabilityZone'

The two commands that surfaced the orphan in about 30 seconds.

The replica's CloudWatch ReplicaLag metric was the giveaway that this was not a healthy idle replica. It would spike to 30 seconds, drop to 0.5 seconds, spike again, every minute or so. That sawtooth pattern means the replication thread is failing and retrying. We pulled the replica's error log and found the same line repeating roughly every 50 milliseconds: a binlog format mismatch. Someone had changed the primary from MIXED to ROW format three weeks earlier, and the replica had been retrying the broken stream ever since.

Every retry shipped a chunk of binlog across the AZ boundary. At 50ms intervals, 24 hours a day, for three weeks. That was the bill.

The five-minute check that prevents the worse outcome

What we did before deleting anything

The instinct, when you have found the thing burning money, is to kill it immediately. We did not. The worse outcome here is not 'replica costs another hour of cross-AZ transfer'. The worse outcome is 'replica gets deleted, a quarterly BI dashboard breaks on Friday, and finance is back in your inbox with a different question'.

So we did the cheap verification first. We grepped the application monorepo for the replica's endpoint hostname. Zero hits. We checked the BI tool's data sources (Metabase in this case). Nothing pointed at it. We checked the data team's Airflow DAGs. Clean. We checked Terraform state to see how it had been created. It was in a workspace tagged load-test that had not been touched in a month, and the engineer who created it had left the company three weeks earlier.

If something had pointed at it, The right move would have been to keep the replica, fix the binlog format, and decide whether the read pattern actually justified cross-AZ. Deletion would have caused a worse incident than the cost spike.
Nothing pointed at it, Delete with --skip-final-snapshot. The replica was already corrupted by the binlog mismatch; a final snapshot was worthless. Cost stopped accruing within minutes.

aws rds delete-db-instance \
  --db-instance-identifier load-test-replica-temp \
  --skip-final-snapshot

The actual delete, once we were confident nothing depended on the replica.

Tag hygiene, expiration sweeps, and an anomaly budget that would have caught this on day 2

What we changed afterwards

Forgotten resources are the largest single category of cloud waste I see in client accounts. Bigger than oversized instances. Bigger than reserved-instance gaps. The fix is mechanical. Every cost-generating resource needs three tags: Owner, Purpose, ExpiresAt. ExpiresAt is the one most teams skip and the one that does the work.

We deployed a small Lambda on a weekly schedule that walks RDS, EC2, ELB, ElastiCache, and OpenSearch, finds resources past their ExpiresAt date or missing tags entirely, and posts to a Slack channel pinging the Owner. The owner has two weeks to either re-tag with a new ExpiresAt or delete. Resources with no Owner go to the platform team's queue. The first sweep flagged 47 resources across the account. Six of them were costing real money.

flowchart TD
  A[Weekly Lambda runs] --> B{Resource has<br/>Owner, Purpose,<br/>ExpiresAt tags?}
  B -- no --> C[Post to platform team queue]
  B -- yes --> D{ExpiresAt<br/>in past?}
  D -- no --> E[Skip]
  D -- yes --> F[DM the Owner in Slack]
  F --> G{Owner responds<br/>within 14 days?}
  G -- extends --> H[Update ExpiresAt]
  G -- no response --> I[Auto-tag for deletion<br/>review next sweep]

The sweep logic. About 180 lines of Python in practice.

Diagram renders at the canonical version.

The second change was AWS Budgets with anomaly detection scoped per service. The team had a single account-wide budget set at $5,000/month, which is useless for catching this kind of incident because the spike was concentrated in one service and the account total only crossed $5,000 on day five. A per-service budget on RDS set at $4,000 with a 20% variance threshold would have fired on day 2. The alert that matters is the one that fires before you have spent the money, not after.

The third change was a process one. The original binlog format change had been an uncoordinated database tweak from a senior engineer who had not realized a replica existed. Schema and replication changes now require a checklist that includes 'list all replicas of this primary and confirm they support the new config' as a pre-flight step. It is not glamorous. It would have prevented the entire incident.

Where cost spike triage gets stuck

If your AWS bill just jumped and you do not know why

The hard part of a cost spike is not finding the resource. It is being confident enough to delete it. Most teams we work with have at least one orphan RDS, ElastiCache, or NAT gateway they are afraid to touch because nobody remembers what depends on it. The triage takes a day; the courage to act takes a week of meetings. By then the bill has run another $2,000.

We run cost spike triage engagements every month. We have seen the orphan-replica case four times this year, the NAT-gateway-in-the-wrong-AZ case more often than that, and a half dozen variants of 'load test that never got cleaned up' across CloudWatch Logs, OpenSearch, and Aurora Serverless. The pattern is almost always the same: a resource that nobody owns, a tag policy that was never enforced, and a budget alert tuned too coarse to catch concentration in a single service. We have written more on the underlying patterns in the cloud cost spikes problem brief and across our services.

If your AWS bill jumped this month and you cannot point at the resource with confidence, book an infrastructure review with our team and we will start with a 30-minute diagnostic call this week. Cost stops accruing the day we find the orphan.

Originally published at https://infraforge.agency/insights/forgotten-rds-replica-cross-az-cost-spike/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Why terraform apply fails when plan passes: the map(any) trap

Muhammad Hassaan Javed — Tue, 19 May 2026 17:14:53 +0000

The on-call engineer pinged me at 4:42pm on a Friday with the release window open until 5:30. terraform apply against the staging workspace had failed with Error: Unsupported argument deep inside a child module nobody on the team had touched in seven months. terraform plan against the same workspace ran clean. They had already re-run plan twice and got fresh no-op output both times. The shape of the failure was off. plan and apply diverging is rare in the way they were describing, and you mostly see it on data sources that resolve at apply time, not on a static merge() call inside a module whose code had not changed in six months.

Problem signals:

terraform plan succeeds locally but terraform apply fails on a specific environment
The error is Error: Unsupported argument or Inappropriate value deep inside a child module
The traceback points at a merge() or lookup() call inside a module that has not been edited in months
Your root module input list has crossed 20 variables and several are typed any or map(any)
There is no CI job that runs terraform plan against every environment on every PR

Three hypotheses, three dead ends, twenty-two minutes left in the release window

What we ruled out in the first 18 minutes

The first thing the on-call lead suggested was state drift. Someone, somewhere, had terraform import-ed a resource by hand. We checked the audit log. No import events in the past 30 days. We checked the lock table in DynamoDB. The lock had been released cleanly by the previous successful apply at 2:11pm.

The second hypothesis was provider version drift. The team had recently bumped hashicorp/aws from 5.62 to 5.71 in versions.tf. A breaking change in a resource schema can absolutely cause an Unsupported argument error if apply pulls a newer provider than plan resolved against. We pinned both runs to 5.71 explicitly, deleted .terraform/, re-ran init, then plan, then apply. Same error, same module, same line.

The third hypothesis was a stale workspace. terraform workspaces sometimes diverge from the configuration if workspace select was bypassed by an engineer who exported TF_WORKSPACE and forgot. We ran terraform workspace show and verified it matched the intended target. The plan output even confirmed the right resource addresses.

Three explanations, three dead ends, twenty-eight minutes burned. The release window was now twenty-two minutes wide and shrinking. The on-call lead asked whether we should just roll back the deploy and figure it out Monday. I asked one more question first.

The 15th map(any) input that had been silently incubating for three weeks

Where the collision actually lived

I asked the on-call lead to walk me through what had merged into the workspace in the past two weeks. There were six commits. Five were obvious changes (image tags, a new IAM policy, a security group port). The sixth was a feature flag, added as a 15th map(any) input on the root module by an engineer who had joined six weeks earlier.

That was the lead.

The root module had 28 input variables. 14 of them were any-typed or map(any) to absorb per-environment overrides accumulated over six years of feature additions. The new feature flag added a 15th map(any) input named feature_overrides. Its values flowed through a merge() chain down to the database child module, which did its own merge(var.feature_overrides, local.legacy_db_flags) inside modules/services/database/locals.tf.

The two maps had a key collision. Both contained a key named read_replica_routing. The new input's value was a string. The legacy local's value was a map(object({ host = string, weight = number })). merge() resolves collisions by taking the last argument's value, but the argument order in this case depended on which input was non-empty at apply time, and the new feature flag was only non-empty in staging.

sequenceDiagram
  participant Op as Operator
  participant Plan as terraform plan
  participant Apply as terraform apply
  participant Child as child module
  Op->>Plan: feature_overrides (map(any))
  Plan->>Child: merge(map(any), map(any))
  Child-->>Plan: any (type-check deferred)
  Plan-->>Op: 0 to add, 0 to change (PASS)
  Op->>Apply: same input
  Apply->>Child: merge resolved to concrete value
  Child-->>Apply: Error: Unsupported argument
  Apply-->>Op: FAIL at 4:42pm

How map(any) defers type-checking past plan and surfaces it at apply

Diagram renders at the canonical version.

The collision had been latent for three weeks. plan succeeded because terraform's planner walked the call graph with both maps' element types collapsed to any. The merged value passed type-check as any, which type-checks against anything. apply, which actually constructs the resource, evaluated the merged value against the receiving attribute's concrete type signature and discovered the value was a string where an object was required.

That is the part that hurts. Terraform's any type defers all type-checking until apply. Every map(any) input on a root module is a future apply-time failure waiting on a contributor who does not know the implicit shape.

Three options, one open release window, seven minutes to pick

What we did before running apply again

We had three options and one open release window. I walked the on-call lead through them on the bridge call.

1. Delete the legacy key, Fastest. Also the riskiest: the legacy read_replica_routing key was referenced by three modules-of-modules three layers down. Deleting it would have moved the failure from staging to production an hour later.
2. Rename the new key, Safe-feeling. Left the underlying any-typed contract intact. Two months later a different contributor would add another map(any) input and we would be back on a Friday afternoon with the same shape of failure.
3. Rename plus add validation, Slower. Renamed the new key to feature_routing_overrides AND added a validation block on the input that explicitly rejected the colliding shape at plan time going forward. Stopped the immediate reoccurrence.

Option three carried the day. The rename took seven minutes. The validation block took twelve. apply succeeded at 5:14pm with sixteen minutes to spare on the release window. The release shipped on time.

The audit work behind option one (the one we did NOT take) is what stuck with me. The next morning, we grep-ed the entire terraform/ tree for read_replica_routing to map every consumer. Seven references across four modules. Three in modules/services/database/locals.tf itself. One in modules/monitoring/cloudwatch.tf. One in modules/services/cache/lookups.tf, which read the value to construct its own routing decision and would have broken silently if we had deleted the legacy key the night before. The remaining two were in a state-recovery helper module the team had forgotten existed. We had nearly fired the second shot of our own foot.

We left a tombstone comment on the legacy key and an open PR that would, the following week, replace its map(any) type with a proper object({ ... }) schema. That work landed five days later. The downstream consumers caught the change at plan time, and three of them needed minor patches before the type tightening could merge. None of those patches would have caught the original collision. They all caught real existing bugs the any type had been hiding.

Two policy changes and one structural fix

What we changed afterwards

Two policy changes came out of that night, and one structural fix took longer.

The first policy: no new map(any) or any-typed inputs on root modules. The team's terraform/ directory has a pre-commit hook (8 lines of grep) that fails the commit if any new variable block contains type = any or type = map(any). Existing instances are grandfathered, with a TODO list tracked against each module. Three of the original 14 have been converted to typed objects so far. The hook has fired four times in the six weeks since.

The second policy: every PR runs terraform plan against every environment, not just the one the contributor cares about. A matrix job in CI runs plan -var-file=envs/<env>.tfvars across all four environments and fails the PR if any of them errors. This would not have caught the original collision (plan succeeded everywhere), but it catches a different class of failure where one environment's tfvars hits an unwritten code path.

# Before: latent any-typed input
variable "feature_overrides" {
  type        = map(any)
  default     = {}
  description = "Per-environment feature flag overrides"
}

# In modules/services/database/locals.tf
locals {
  merged_flags = merge(
    local.legacy_db_flags,
    var.feature_overrides,
  )
}

# Above passes plan even when the two maps have a key
# whose value types disagree. The mismatch surfaces only
# at apply, when the receiving attribute is evaluated.

# After: typed, explicit, errors at plan time
variable "feature_overrides" {
  type = map(object({
    enabled     = bool
    rollout_pct = optional(number, 0)
    routing     = optional(string, "default")
  }))
  default     = {}
  description = "Per-environment feature flag overrides"

  validation {
    condition = alltrue([
      for k, v in var.feature_overrides :
      v.rollout_pct >= 0 && v.rollout_pct <= 100
    ])
    error_message = "rollout_pct must be between 0 and 100."
  }
}

The same variable, before and after. The lower form fails plan, not apply, when a contributor passes the wrong shape.

The structural fix took longer. A 28-input root module is not a configuration problem, it is a service-boundary problem. The team running the database stack should own a database/ root module with four inputs, not a 14-input subtree of a shared 28-input root. We split the original root into three roots along ownership boundaries (network, services, observability) using a thin terragrunt overlay for the cross-cutting variables. The split took six weeks of careful state-mv work to land without downtime. We have written more on the structural fix in the Terraform and IaC debt playbook, which covers when a shared root module starts costing more than the consistency it buys.

What we tell every team now: strong types in Terraform are not bureaucracy, they are the documentation. The half-day cost to write object({ name = string, enabled = bool, ... }) instead of map(any) buys you a plan-time failure instead of an apply-time failure, and apply-time failures land at 4:42pm on Fridays. We have stopped accepting map(any) inputs in any client engagement that involves an IaC audit, and we have not had a single contributor push back once they saw the cost.

If you are looking at a 28-input root with map(any) sprinkled through it

When your own root module is past 20 inputs

If you are reading this and your terraform/ directory has a root module past 20 inputs with several map(any) types in the input list, the failure you are heading toward is not a surprise. It is a scheduled event. The trigger will be a new contributor who does not know the implicit contract, plus one bad-enough Friday. The hardest part of cleaning it up is not the typing work itself; it is the audit of downstream consumers that have been silently depending on the loose contract for years. Two layers of modules-of-modules can hide a reference that breaks the moment you tighten the type, and your CI will not warn you because plan will keep passing right up to the apply that surfaces it.

We run these recovery and audit engagements every week. The map(any) collision pattern is the third-most-common shape we see in seed-to-Series-B SaaS Terraform repos, right after stale state lock holders and provider-version-drift cascades. It is one variant of the broader terraform apply fear problem we engage on most weeks. On a typical engagement we map every any-typed input in your root modules within the first day, prioritize them by blast radius, and either convert them in-place or split the root if the input count is the real problem. If you are looking at a Terraform root with map(any) sprinkled through it and a release window that does not forgive a 4pm apply failure, book an infrastructure review with our team and we will start with a 30-minute diagnostic call this week.

Originally published at https://infraforge.agency/insights/terraform-apply-fails-map-any-trap/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Init container cascade when every kubectl patch reverts in 10 seconds

Muhammad Hassaan Javed — Fri, 15 May 2026 20:15:23 +0000

The Slack ping came in at 2:14 am. Two replicas of the fanout service were stuck in Init:1/3 and the deploy queue behind them had grown to seven changes. The on-call engineer had already tried the obvious move, kubectl edit deployment, and the changes had reverted within ten seconds. By the time we joined the bridge, they had patched the same field four times in twenty minutes and were starting to wonder if etcd was corrupted. The shape of the failure was wrong though. Init containers do not normally cascade across three different upstream dependencies at once; either something upstream was common, or the spec was being rewritten under us.

Problem signals:

Pods stuck in Init:0/3 or Init:1/3 with no forward progress and no clear log story
kubectl edit deployment changes revert within ten to fifteen seconds, every time
Three init containers each failing in a different protocol layer (TCP dial timeout, NXDOMAIN, AMQP ACCESS_REFUSED)
A topology or schema ConfigMap claims state that the live broker or database disagrees with
No activeDeadlineSeconds set on init containers, so transient failures wedge the Pod indefinitely

Two replicas wedged, seven changes queued, four failed patches

The 2 am page

When we joined the bridge, the on-call engineer had already burned forty minutes on what looked like a config drift bug. The fanout service in the platform namespace had two replicas, both stuck in Init:1/3. The init container chain had three steps (wait-for-redis, wait-for-mongodb, wait-for-rabbitmq) and the redis step was failing on a hardcoded IPv4 address that did not match the live Service. They patched the env var on the Deployment. The init container restarted. Ten seconds later the IP was back. They patched it again. Same thing.

Their working hypothesis was etcd corruption or a faulty kube-apiserver caching layer. We have seen both before, but neither matches the symptom shape here. Etcd corruption surfaces as 5xx responses to kubectl, not as silent successful PATCHes that revert. We needed to find what was doing the reverting before we wasted any more time on the symptoms.

Two wrong guesses before the real culprit became visible

What we thought it was first

The first guess was a GitOps controller with self-heal enabled. ArgoCD does this with syncPolicy.automated.selfHeal: true. Flux does this with its Kustomization controller. Both will revert a kubectl patch within seconds if the live spec drifts from the source of truth in git. We checked the cluster for both. No Argo Application referenced the fanout namespace. Flux was not installed at all.

The second guess was a mutating admission webhook. A custom webhook that rewrites init container specs at admission time could in theory produce this pattern, except admission webhooks fire on create and update, not on a ten-second timer. We ran kubectl get mutatingwebhookconfigurations and the output was empty. That ruled it out.

The reverting was not coming from inside the cluster. It had to be coming from the node itself. We SSHed to the node where one of the fanout pods was scheduled and went looking. Within two minutes we had it.

$ ssh node-01 'ps -ef | grep admission'
root  1842  ... /usr/bin/supervisord -c /etc/supervisor/conf.d/admission.conf
root  2104  ... /bin/bash /var/lib/apex/admission.sh

$ ssh node-01 'cat /etc/supervisor/conf.d/admission.conf'
[program:admission]
command=/var/lib/apex/admission.sh
autorestart=true
startsecs=5

A supervisord-managed script on the node was the reverter. autorestart=true meant killing it bought us at most a few seconds.

The stored ConfigMap was the source of truth, not the live Deployment

What was actually overwriting our patches

The script at /var/lib/apex/admission.sh ran every ten seconds. It read three fields (redis-host, mongodb-host, amqp-uri) from a ConfigMap called fanout-init-config and patched them straight into the init container env vars on the live Deployment. The ConfigMap was the source of truth. The Deployment was a downstream artifact. Patching the Deployment was about as durable as writing in pencil.

sequenceDiagram
  participant Engineer
  participant Deployment
  participant Admission as node script
  participant ConfigMap as fanout-init-config
  Engineer->>Deployment: kubectl edit (fix redis-host)
  Deployment-->>Engineer: spec updated
  Note over Admission: tick every 10s
  Admission->>ConfigMap: read fields
  ConfigMap-->>Admission: stale values
  Admission->>Deployment: patch init container env
  Deployment-->>Engineer: changes reverted

The reverting loop. Edit the ConfigMap, not the Deployment.

Diagram renders at the canonical version.

This pattern shows up in places where the original GitOps story had gaps and someone wrote a node-side enforcer as a stopgap. Then the team rotated, the wiki page got out of date, and the enforcer kept running. We have seen this exact shape three times in the last year. Twice with supervisord scripts. Once with a systemd timer. The fix is always the same: find the source of truth before patching anything, and if you cannot find it in under fifteen minutes, stop and look on the nodes.

What each failure actually told us, and the fourth fix that did not show in any log

Three init containers, three different protocols

Once we knew to edit the ConfigMap, we still had three concurrent faults to diagnose. Each init container was failing in a different layer of the network stack, and each one had its own diagnostic signature.

The redis init container was dialing 10.43.181.44 on port 6379 and getting i/o timeout after thirty seconds. We compared against the live Service and got back a different ClusterIP.

$ kubectl get svc redis -n platform -o jsonpath='{.spec.clusterIP}'
10.43.218.92

$ kubectl logs fanout-7d4b9c-xx -c wait-for-redis -n platform | tail -3
dial tcp 10.43.181.44:6379: i/o timeout
dial tcp 10.43.181.44:6379: i/o timeout
dial tcp 10.43.181.44:6379: i/o timeout

The hardcoded IP had no relationship to the live Service. ClusterIPs are not stable across Service recreation. Hardcoding one is a time bomb.

The mongodb init container was logging 'lookup mongo.platform.svc.cluster.local on 10.43.0.10:53: no such host'. The live Service was named mongodb, not mongo. One character off, NXDOMAIN. We caught it by running kubectl get svc -n platform and reading the actual Service name out loud. The hostname in the ConfigMap had been typed from memory by someone who remembered the team's old naming convention.

The rabbitmq init container was the most interesting of the three. The TCP connection succeeded. The AMQP frame negotiation succeeded. Authentication succeeded. The vhost open returned ACCESS_REFUSED. The URI was amqp://app:app@rabbitmq:5672/fanout-internal. We port-forwarded to the management API and listed valid vhosts.

$ kubectl port-forward -n platform svc/rabbitmq 15672:15672 &
$ curl -s -u app:app http://localhost:15672/api/vhosts | jq -r '.[].name'
/
/platform

# fanout-internal does not exist on this broker

The URI parsed cleanly and authenticated cleanly. The failure was at vhost open. Always enumerate vhosts before assuming auth or credentials.

There was a fourth fix that did not show up in any log. None of the init containers had activeDeadlineSeconds set, and neither did the Pod spec. Even after the three protocol bugs were resolved, a transient DNS hiccup or broker restart would have hung an init container indefinitely instead of failing fast and letting the kubelet retry the Pod. We added activeDeadlineSeconds: 120 on every init container and 600 at the Pod level. Defense in depth, because init container deadlines do not always catch the case where the kubelet keeps reconciling a stuck container.

A second ConfigMap with the same shape, intentionally broken, was a load-bearing canary

The look-alike ConfigMap we almost broke

Before we patched fanout-init-config, we almost made one more mistake. There was a second ConfigMap in the same namespace called fanout-init-config-canary. Same shape, same broken-looking IP, same broken-looking AMQP URI. It was labeled role: protected and annotated with purpose: chaos-canary. A drift-detection job in the cluster read it every fifteen minutes to confirm its own detection logic still fired on broken inputs. If we had run a sed-style global replace across all matching ConfigMaps (which is exactly what a tired engineer at 3 am tends to do) we would have silenced the canary and the team would have learned about the next round of real drift only when a customer noticed.

When you patch infrastructure under pressure, target the named resource, not the pattern. Read the labels and annotations of every resource you are about to touch. A surprising number of clusters have load-bearing decoys you do not know about until you break them. We have written more on this in the Kubernetes and CI/CD stabilization pillar.

Source-of-truth guard, deadline defense, a validation Job, and convergence checks

What we changed afterwards

The fanout service was the visible failure, but the recovery exposed five underlying gaps in the team's release flow. We left four durable changes in place before disconnecting from the bridge.

The fanout-init-config ConfigMap is now committed in git and synced via a real GitOps controller, and the node-side admission script was rewritten to refuse to overwrite a Deployment if the ConfigMap's content hash does not match a known-good baseline annotation. The script can still enforce, but it cannot enforce a broken state.

Every Deployment in the platform namespace now has activeDeadlineSeconds set at both the init container level (120 seconds) and the Pod level (600 seconds). The pair matters. Init container deadlines fail-fast the individual container; the Pod-level deadline prevents the kubelet from looping retries on a Pod that is structurally wrong.

A pre-deployment validation Job runs as part of the release flow. It carries label validation: predeploy, restartPolicy: OnFailure, activeDeadlineSeconds: 120, and a validator that does three real checks: redis, mongodb, and rabbitmq Services each have non-empty Endpoints, AND the broker reports every binding the topology ConfigMap claims to have declared. Topology drift was the other half of this incident; the binding count had silently dropped from five to three after a partial migration three weeks earlier, and nobody had noticed because the topology-version annotation still said 5.

# Snippet from the topology-reconcile Job that fixed the broker drift
apiVersion: batch/v1
kind: Job
metadata:
  name: topology-reconcile-2026-05-15
  labels:
    validation: predeploy
spec:
  activeDeadlineSeconds: 120
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: reconcile
        image: rabbitmq:3.13-management
        command: ["/bin/bash", "-c"]
        args:
          - |
            set -euo pipefail
            EXPECTED=$(yq '.bindings | length' /config/topology.yaml)
            for b in $(yq -o=json '.bindings[]' /config/topology.yaml | jq -c .); do
              EX=$(echo $b | jq -r .exchange)
              QU=$(echo $b | jq -r .queue)
              RK=$(echo $b | jq -r ."routing-key")
              rabbitmqadmin declare binding source=$EX destination=$QU routing_key=$RK
            done
            ACTUAL=$(curl -s -u $USER:$PASS http://rabbitmq:15672/api/bindings | jq 'length')
            [ "$ACTUAL" -ge "$EXPECTED" ] || exit 1

Reconcile via Job, not via kubectl exec. The Job is observable, retryable, and leaves an audit record.

The team's rollback runbook now requires two consecutive green health observations twenty seconds apart before a rollout is declared finished. Single-shot green is not enough on a cluster that has a ten-second admission tick, because you can catch the Pod between reverts and declare victory ninety seconds before the next failure cascade. We learned to distrust single-shot green the hard way on a different engagement, and that is now the default in every recovery handover we ship.

If you are looking at a cluster where every patch reverts within seconds, do not patch faster. Stop patching and find what is doing the reverting. The fix itself is usually ten minutes once you know where the source of truth lives. Finding the source of truth is what takes the hour. If you want a second pair of eyes on a system that is in this state, request an infrastructure review and we will be on a bridge with you the same day.

Originally published at https://infraforge.agency/insights/init-container-cascade-reverting-patches/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.