Forem: Shamsher Khan (Shamz)

Production AI Agents in Kubernetes: A 7-Control Checklist for Platform Teams

Shamsher Khan (Shamz) — Thu, 07 May 2026 00:33:26 +0000

📝 If you've ever watched an AI agent invent a service that doesn't exist or invent an alert it then "handles," this checklist is for you.

A vendor-neutral, engineer-focused guide to identity, authorization, audit, rate limits, rollback, and deterministic fallbacks — built on patterns proven in production Kubernetes.

Figure 1. The seven control-layer guardrails standing between an AI agent and production infrastructure.

Why I wrote this. I started drafting this checklist after watching one of our production Kubernetes proof-of-concepts go sideways — not because the AI agent was poorly built, but because the controls around it weren't there. The agent was working as designed; the platform around it had gaps. This piece is what I wish someone had handed me on day one of putting tool-using agents anywhere near a production cluster. It is, in many ways, a continuation of a thread I have been pulling on in The Next Evolution After Kubernetes: Platform Engineering and Autonomous Infrastructure — the idea that the next era of infrastructure is less about new tools and more about new disciplines applied with old rigor.

Why This Checklist Exists

In April 2026, an AI coding agent running inside Cursor encountered a credential mismatch in a staging environment, decided on its own to "fix" the problem, and deleted a startup's production database — along with the volume-level backups — in a single API call. The whole thing took nine seconds. The startup, PocketOS, builds software for car rental operators; its founder Jeremy Crane reported the incident publicly, and the agent itself produced a written explanation admitting it had violated multiple safety rules it was supposed to follow, including an explicit instruction never to run destructive commands without user approval. Railway, the infrastructure provider, confirmed the deletion publicly. The incident was reported by The Register, Decrypt, Fast Company, and Business Insider, among others.

That is not a model failure. That is an envelope failure. The model did exactly what models do — it reasoned, it acted, it explained. What was missing was every single guardrail that should have stood between the agent's reasoning and the production system: scoped credentials, destructive-action gates, deny-by-default authorization, rate limits, rollback paths, deterministic fallbacks for high-risk operations. Those are not AI problems. They are platform engineering problems, and we already know how to solve them. We just have to insist on solving them before the agent ships, not after the first incident.

Tool-using AI agents are no longer experimental. They are calling APIs, mutating databases, opening pull requests, deploying code, and increasingly making decisions that touch regulated workloads. The discipline around them has not yet caught up to the risk they introduce.

What follows is a seven-control production checklist for tool-using agents, synthesized from platform engineering patterns that have served Kubernetes operators well for years and adapted to the new failure modes agents introduce. It is vendor-neutral by design. The patterns map cleanly to AWS, Azure, GCP, on-prem Kubernetes, and managed agent platforms alike. Where I cite a concrete example, I do so to show what the control looks like when it is actually wired up — not to endorse a stack.

If your team is shipping agents and cannot tick all seven boxes, you do not have a production system. You have a prototype with production traffic.

TL;DR

Tool-using AI agents introduce new operational risks — but the solutions are familiar.
Production readiness depends on enforcing identity, authorization, auditability, and rollback.
This checklist translates proven platform engineering patterns into agent systems.

A Mental Model: Three Layers, One Failure Surface

Before the checklist, a framing that has helped every platform team I have walked through this material:

Think of an agent system as three layers:

Model layer — the reasoning component (the LLM and its prompt).
Control layer — identity, authorization, rate limits, audit, rollback.
Execution layer — the tools and APIs the agent ultimately calls.

Most agent failures in production are not failures of the model layer. They are failures of the control layer — the connective tissue that decides what the model is allowed to do, observes what it does, and contains it when it misbehaves. (I explored a related version of this idea — that what you cannot see is what hurts you most — in Your Kubernetes Cluster Is Lying to You: What a 60-Second War-Room Scan Reveals.) The seven controls below are all control-layer controls. That is where the work is.

flowchart TB
    subgraph M["Model Layer"]
        LLM["LLM + Prompt<br/>(reasoning)"]
    end
    subgraph C["Control Layer — where production safety lives"]
        ID["Identity"]
        AUTHZ["Authorization"]
        ALLOW["Tool Allowlist"]
        AUDIT["Audit Logs"]
        RL["Rate Limits<br/>+ Circuit Breakers"]
        RB["Rollback"]
        FB["Deterministic<br/>Fallback"]
    end
    subgraph E["Execution Layer"]
        TOOLS["Tools / APIs<br/>(databases, K8s, cloud)"]
    end
    M --> C --> E
    style C fill:#1B2B6E,color:#fff,stroke:#7DC400,stroke-width:2px
    style M fill:#f5f5f5,color:#000
    style E fill:#f5f5f5,color:#000

The seven checklist items are the working contents of the control layer. Strip any one out and you have rebuilt the conditions for the PocketOS incident.

Figure 2. Three-layer mental model: the seven controls live in the middle layer, where production safety is enforced.

1. Identity: Every Agent Is a First-Class Workload

Prevents credential leakage and unbounded access.

The first question I ask of any agent design is: what identity does this agent present when it calls a tool? Far too often the answer is "a long-lived API key in an environment variable" — which is the 2026 equivalent of leaving a root password in a Slack DM.

An agent is a workload. Treat it like one:

Short-lived, federated credentials. On Kubernetes, this is workload identity (AKS Workload Identity, GKE Workload Identity, EKS IRSA) bound to a ServiceAccount. The agent receives an OIDC token, exchanges it for a scoped cloud credential, and the credential expires in minutes.
One identity per agent role, not per deployment. A "customer-support-agent" and a "billing-reconciliation-agent" should be distinct identities even if they run from the same image. Identity is how you draw the blast radius.
No shared service accounts between agents and humans. If a human can impersonate an agent's identity, you lose the ability to attribute any action to either.

A leaked long-lived token in an agent context can mean unbounded tool invocation by an attacker, with no expiry, often across multiple downstream APIs. Short-lived federated credentials reduce that window from "until someone notices" to "until the next rotation."

2. Authorization: Least Privilege at the Tool Boundary

Prevents agents from performing unintended or high-risk actions.

Identity tells the system who the agent is. Authorization tells the system what it is allowed to do. These are different problems and they fail in different ways.

The pattern that has worked for me, borrowed almost directly from Kubernetes RBAC, is capability-based tool grants. Each agent role gets an explicit list of tool capabilities:

agent_role: customer-support-agent
allowed_tools:
  - crm.read_customer
  - crm.read_ticket
  - crm.update_ticket_status
denied_tools: ["*"]   # deny-by-default
constraints:
  crm.update_ticket_status:
    allowed_values: ["resolved", "pending_customer"]
    forbidden_values: ["closed_billing_dispute"]

Two design rules I do not bend on:

Deny by default. New tools are unavailable to existing agents until explicitly granted. Adding a tool to an allowlist is a code review event.
Authorization belongs in the platform, not in the prompt. If the only thing stopping an agent from calling delete_customer is a sentence in its system prompt, you have no authorization. Prompts are not security boundaries. The tool gateway is.

This is the same logical structure as a Kubernetes Role and RoleBinding: a small, explicit set of verbs over a small, explicit set of resources, enforced at a checkpoint the workload cannot bypass. (For a deeper walkthrough of how those primitives behave under real failures, see the Production Kubernetes Debugging Handbook.)

3. Tool Allowlists: A Registry, Not a Convention

Prevents ad-hoc tools from quietly expanding the agent's blast radius.

Closely related to authorization, but worth its own line on the checklist, is the tool registry. Every tool an agent can call must exist as a versioned, reviewed entry in a central registry. Ad-hoc tools — a function someone added to a notebook last sprint and quietly wired into the agent runtime — are how you end up with an agent that can call kubectl delete namespace because nobody noticed.

A useful tool registry entry includes:

Tool name and version. Versioning matters; a tool's behavior can change.
Input and output schema. Strict schemas at the boundary catch a surprising number of model mistakes before they reach the downstream system.
Side-effect classification. Read-only, idempotent-write, or destructive. Destructive tools require additional gates (approval, confirmation, dry-run mode).
Rate-limit class (see control #5).
Owner. A human team accountable for the tool's correctness.

Conceptually, the tool registry is to agents what the container registry plus admission control is to Kubernetes workloads: a single place where what is allowed to run is declared, and a checkpoint that refuses everything else. The same deny-by-default discipline that container security demands at the runtime layer — capability dropping, seccomp profiles, AppArmor — applies one level up at the agent's tool boundary. (I cover these container-side patterns end-to-end in the OpsCart Docker Security Guide; the conceptual line from "what syscalls can this container make" to "what tools can this agent call" is shorter than it looks.)

4. Audit Logs: Every Tool Call, Forever

Turns "we think this happened" into "we know exactly what happened."

In regulated industries, the audit log is the difference between a recoverable incident and a regulatory finding. In every industry, it is the difference between knowing and guessing. Agents amplify this need because they generate far more tool invocations per unit of human intent than any prior class of system.

The non-negotiables for an agent audit log:

Every tool call recorded. Inputs, outputs, agent identity, parent reasoning step ID, timestamp, latency, tool version, outcome.
Immutable storage. Append-only; ideally with a hash chain or write-once object storage. Auditors do not accept "we promise we did not edit it."
Structured, queryable format. JSON lines indexed in your logging stack. If you cannot answer "what did agent X do between 14:00 and 14:15 yesterday" in under thirty seconds, your logs are not yet operational.
Linked to the model trace. The audit log should reference the prompt, model version, and reasoning trace that produced the tool call. Without this link, post-incident analysis is guesswork.

This is also where lightweight observability tooling earns its keep. On the Kubernetes side, I maintain opscart-k8s-watcher, a small open-source watcher that surfaces resource and policy drift across clusters. The same architectural pattern — a sidecar or controller that observes, classifies, and reports without sitting in the request path — applies cleanly to agent runtimes. You want a parallel observation channel that cannot be silenced by a misbehaving agent. (The deeper "why" of this matters: Kubernetes itself loses critical evidence within seconds of a failure, as I documented in Beyond the 90-Second Gap: Four More Ways Kubernetes Destroys Your Evidence. Agents will erase their own evidence even faster if you let them.)

If you take only one control from this list, take this one. Audit logs you wrote before the incident are worth ten times more than dashboards you built after it.

5. Rate Limits and Circuit Breakers: Bounded Blast Radius

Contains runaway loops before they become runaway bills or runaway incidents.

The problem. An agent in a loop is an outage waiting to happen. Multi-turn agent frameworks have no natural stopping condition; if a tool returns an error, the most plausible next turn is "try again, perhaps with a small variation." That is a loop, and at the speed of a tool API, loops compound fast.

The failure mode. In one of our internal AKS proof-of-concepts, an operations agent attempting to diagnose a failing service began hallucinating service names — confidently generating queries, log fetches, and remediation attempts against services that did not exist. Each failed lookup fed back into its context, each retry produced a longer prompt, and each retry produced another fabricated entity to chase. The pattern compounded. We caught it in non-production. In a production environment without a hard rate limit, that same loop would have generated tens of thousands of API calls per cluster per day — and we run more than a dozen clusters. The shape of this failure is now well documented in the FinOps community as well; the FinOps Foundation's anomaly management capability explicitly calls out cost spikes from "uncontrolled dashboard refresh, or an orchestration loop" as a leading cause of unbudgeted cloud spend. Agents are orchestration loops with a creative streak.

The control. Rate limits applied at multiple layers, every one of them paired with a circuit breaker:

Per-agent-instance limits. Total tool calls per minute, hour, day.
Per-tool limits. A send_email tool needs a far tighter ceiling than a read_dashboard tool. Match the limit to the side-effect class.
Per-tenant limits. In multi-tenant agent platforms, one tenant's runaway agent must not exhaust capacity for the rest.
Cost-based limits. Token spend, downstream API spend, compute spend. A dollar-ceiling per agent per day is one of the cheapest controls you can implement and one of the most reassuring to finance.
Mutation limits. Maximum scaling actions per hour, maximum cluster mutations per session, maximum pods created. Any tool that changes infrastructure state needs its own bounded budget, separate from read-only call budgets.

When a circuit breaker trips, fail fast. Return a structured error the agent can reason about, emit a high-priority alert, and stop. No silent retries. No "log and continue." The breaker tripping is signal — treat it as such.

A worked example. Imagine an operations agent given the standard toolbox — kubectl, Prometheus query, Azure Monitor API, scale-deployment, Istio traffic shift — and turned loose across ten production clusters. Now imagine it enters a diagnostic loop firing once every five seconds:

Layer	Per cluster	Per fleet (×10)
Tool calls per minute	12	120
Per hour	720	7,200
Per day	~17,000	~170,000

Figure 3. How a small recursive agent loop compounds into runaway cloud cost when no rate ceiling is in place.

That is before second-order effects: telemetry volume rising, log ingestion exploding, the cluster autoscaler adding nodes to absorb the agent's own load, traffic mirroring spinning up because the agent decided it needed more data. Runaway agent diagnostics can produce hundred-megabyte artifact dumps per service per cycle — multiply that by a multi-hundred-service fleet on an hourly cadence and you reach terabytes of egress per day in the bad case. None of this requires the agent to be malicious. It only requires the absence of a ceiling.

6. Rollback Strategies: Programmatic, Not Heroic

Replaces 2 a.m. heroics with a tested, callable revert path.

The problem. Every agent deployment will eventually need to be rolled back. This will not happen at a convenient time. The question is whether your rollback path is a documented, tested, programmatic procedure — or a 2 a.m. heroic effort by whoever is on call.

The failure mode. Without programmatic rollback, an agent change that worsens production behavior — a prompt update that makes the agent more aggressive, a new tool grant that turns out to be too broad, a model upgrade that regresses on a critical task — has no fast undo. Teams end up redeploying old container images by hand, or worse, manually reverting configuration in a live environment under pressure. Both produce more incidents than they resolve.

The control. The minimum viable rollback story for an agent system has four parts:

Versioned agent configuration. Prompts, tool allowlists, model selection, and runtime parameters are all configuration. They live in source control, are deployed as artifacts, and can be reverted by version.
Versioned tool definitions. A tool's behavior change is a deployment event, with the same rollback semantics as application code.
Traffic shifting. New agent versions go to a small percentage of traffic first. The rollback is a percentage change, not a redeploy.
Programmatic deployment hooks. The pipeline that deploys agents must expose APIs for impact analysis, deployment initiation, status polling, and rollback. Without these, every rollback is manual, and manual rollbacks are slow rollbacks.

An example. In Kubernetes-native delivery, this control is well-trodden ground. Argo CD and Argo Rollouts expose pause, promote, and abort as API operations. Flagger drives progressive delivery against service-mesh traffic weights. Even raw kubectl rollout undo is, at the mechanical level, a callable operation against the API server — every revision of every Deployment is recoverable by version. The hard part is not the technology; it is insisting that agent deployments use the same primitives as application deployments. Rollback must be an API call, not a runbook.

If your only rollback procedure is "redeploy the previous container image and hope," you do not have a rollback strategy. You have a coping mechanism.

Closing the loop on the opening case. The PocketOS deletion — where an AI coding agent destroyed a production database and its backups in nine seconds — was, at its core, a rollback failure as much as it was an authorization failure. Volume-level backups stored inside the volume that got deleted. No traffic-shifted canary. No programmatic revert path. Once the destructive call succeeded, there was nowhere to roll back to. Controls #2 (authorization), #3 (allowlists), #6 (rollback), and #7 (deterministic fallback for destructive operations) would each, independently, have prevented the outage. Defense in depth is not an aesthetic preference — it is what survives the day a single layer fails.

7. Deterministic Fallbacks: A Safe Path When the Agent Cannot Be Trusted

Keeps the feature degraded but functional when the agent cannot, or should not, act.

The final control is the one most teams skip, and the one I now consider non-negotiable: every agent must have a deterministic fallback path.

There is a structural reason this matters. Multi-turn agent frameworks like AutoGen do not have a natural stopping condition — by default, the conversation between the assistant and the user-proxy continues until something explicitly terminates it (a configured maximum, a TERMINATE keyword, a callback returning true, or a human-input requirement). Continuation is the default; stopping has to be designed in. In an AutoGen-based experiment of ours, an agent that had successfully resolved a real Docker container alert kept the loop going by inventing fictional alerts to handle next — fake CrashLoopBackOff and OOMKilled events for containers that did not exist. The model was not malfunctioning. It was doing exactly what was asked of it: produce a plausible next turn. With no real work remaining, the most plausible next turn was a fabricated alert. The fix was a max_consecutive_auto_reply ceiling and an explicit handoff condition. That handoff condition is the deterministic fallback. When the agent has no more real work, the safe behavior is to stop and return control — not to manufacture more.

A deterministic fallback is a non-AI code path that handles the agent's responsibility when the agent cannot, should not, or has not been authorized to act. It activates when:

The agent's confidence falls below a threshold.
A circuit breaker has tripped.
The model provider is degraded or unavailable.
The input matches a high-risk pattern that policy says must be handled by deterministic logic.
The audit log shows the agent has exceeded a per-session error rate.
The agent has run out of real, externally-sourced work to do.

The fallback does not need to be smart. It needs to be safe and predictable. For a customer support agent, the fallback might be "route the ticket to a human queue with the conversation context attached." For a deployment agent, the fallback is "do nothing; page the on-call." For a reconciliation agent, the fallback is "hold the transaction for manual review." For an operations agent that has just resolved an incident, the fallback is "exit the loop."

This is the same operational philosophy I have been exploring in my research on operational memory architectures for distributed systems — captured both in the open-source k8s-causal-memory project and in the OpsCart write-up When Kubernetes Forgets: The 90-Second Evidence Gap — where deterministic memory carries the load when probabilistic components fail. The agent is the optimistic path. The fallback is the pessimistic path. Production systems need both.

Without a deterministic fallback, your incident response plan during an agent outage is "the feature is down." With one, it is "the feature is degraded but functional." That distinction is the difference between a postmortem and a press release.

How to Use This Checklist

Print it. Put it on the wall next to whatever ticket board your platform team uses. Before any agent ships to production, walk the seven controls and ask, for each one: can I demonstrate this is wired up, or am I hoping?

Figure 4. The seven-control checklist, with the demonstrable artifact required to tick each box.

#	Control	Demonstrable Artifact
1	Identity	Workload identity binding, no static keys in env
2	Authorization	Capability-based grants, deny-by-default policy
3	Tool allowlist	Versioned registry with schemas and owners
4	Audit logs	Immutable, structured, queryable, model-linked
5	Rate limits	Per-agent, per-tool, per-tenant, cost-based
6	Rollback	Programmatic deployment APIs, versioned config
7	Deterministic fallback	Non-AI code path with defined trigger conditions

Conclusion

None of these controls are exotic. Every one is a direct application of patterns the platform engineering community has been refining for a decade — workload identity, RBAC, admission control, structured logging, rate limiting, progressive delivery, graceful degradation. The novelty is not the controls. The novelty is insisting we apply them to AI agents before the first incident, with the same rigor we apply to any other production workload.

Agents are not magic — they are API clients with autonomy. And autonomy without control is just another word for risk.

The teams that succeed with AI agents will not be the ones with the most advanced models. They will be the ones with the strongest operational discipline around them. Build the envelope first. Then let the agent inside it.

Quantum Computing Will Break Your Kubernetes Clusters — Here's When and What To Do Now

Shamsher Khan (Shamz) — Tue, 10 Mar 2026 05:34:01 +0000

TL;DR — Three layers break now (Secrets/mTLS, etcd, API Server TLS). Two are architectural foresight for 2030+. One is already being fixed in Kubernetes v1.33+. Here's what you can actually verify in your clusters today.

I manage 8+ AKS clusters running 500+ cores for Fortune 500 pharmaceutical clients. When I started auditing our quantum exposure last year, I expected the problem to be distant and theoretical. It isn't. Two of the six Kubernetes layers I'm about to walk through require action this year — not in 2030.

This isn't hype. IBM's Quantum Readiness Index (Dec 2025, 750 organisations, 28 countries) puts the average global quantum readiness score at 28/100. Only 5% of enterprises have any formal quantum-transition plan (arXiv:2509.01731). Meanwhile, NIST finalised post-quantum cryptography standards in August 2024. The clock is running whether your team is watching it or not.

The Wave Framework — Not a Single "Quantum Arrives" Moment

The most important thing to understand is that quantum impact on Kubernetes infrastructure doesn't arrive as one event. It unfolds in three waves:

Wave	Name	Timeline	Your Action
Wave 1	Security Migration	Now → 2028	Audit crypto, patch etcd gap, verify API server
Wave 2	Hybrid Optimisation	2026 → 2030	Design QPU-aware scheduling, hybrid workloads
Wave 3	Quantum-Native Ops	2030+	QPUs as k8s resources, new observability primitives

Waves 2–3 are architectural foresight. Wave 1 is a sprint ticket.

Wave 1: The Three Security Layers — Act Now

1. Secrets & mTLS — Harvest Now, Decrypt Later

Your Vault PKI, cert-manager-issued X.509 certificates, and Istio mTLS chains all rely on RSA and elliptic curve cryptography. Shor's algorithm — running on a fault-tolerant quantum computer — solves these in polynomial time.

The threat that's active today is called "harvest now, decrypt later." Adversaries record your encrypted control-plane traffic now and decrypt it when quantum hardware matures. For pharmaceutical data with 10–20 year regulatory lifetimes, that window is already open.

The Global Risk Institute puts an 11–31% probability on RSA being broken by 2030. That's not certainty — but it's not theoretical either.

What NIST has already finalised (August 2024):

FIPS 203 — ML-KEM (formerly CRYSTALS-Kyber) — key encapsulation
FIPS 204 — ML-DSA (formerly CRYSTALS-Dilithium) — digital signatures
FIPS 205 — SLH-DSA (formerly SPHINCS+) — hash-based signatures
HQC — selected March 2025 as a backup algorithm

AWS announced it is removing CRYSTALS-Kyber from all endpoints in 2026 in favour of ML-KEM. The migration timeline is set externally — not by your team.

What to do:

# Inventory your cipher suites across all clusters
kubectl get pods -A -o json | jq '.items[].spec.containers[].env[] | select(.name | test("TLS|CIPHER|CRYPTO"))'

# Check cert-manager issuer algorithms
kubectl get clusterissuers -o yaml | grep -A5 'privateKey'

2. etcd — The PQC Gap Nobody Is Writing About

This is the finding I haven't seen documented anywhere else. Here's the situation:

Kubernetes 1.33+ ships hybrid post-quantum cryptography (X25519MLKEM768) on the API server, kubelet, scheduler, and controller-manager — enabled by default when running Go 1.24+. That's the good news.

The problem: etcd deliberately runs an older Go version for stability. Your most critical data store — every Secret, ConfigMap, and pod spec — remains classically encrypted. The API server speaks ML-KEM hybrid PQC; the store it writes to does not.

This is architectural, not a configuration error. It's a known trade-off etcd makes. But it means a cluster where you've verified API server PQC has a false sense of security at the data layer.

# Check your etcd Go version — this is the one that matters
etcd --version | grep -i go
# If Go < 1.24, your secrets store is classically encrypted

# Contrast with API server
kubectl version -o json | jq '.serverVersion.goVersion'
# Should be go1.24+ for hybrid PQC on the control plane

These are two independently encrypted channels. A green API server check does not mean your etcd is covered.

3. API Server TLS — Already Being Fixed, But Watch for Silent Downgrade

The good news first: Kubernetes 1.33+ on Go 1.24 enables X25519MLKEM768 hybrid key exchange by default on the API server. OpenShift 4.20 applies it across the full control plane. This is real and shipping.

The gotcha: if your cluster runs Go 1.23 but your kubectl binary was built with Go 1.24, the connection silently downgrades to classical X25519 — no error, no warning, no log entry. This happens during version skew windows (common during rolling upgrades on managed clusters).

# Verify both sides — they must match
kubectl version -o json | jq '{client: .clientVersion.goVersion, server: .serverVersion.goVersion}'

# On managed clusters (AKS/EKS/GKE), the control plane Go version
# is managed by the provider — verify before assuming PQC is active
az aks show -g <rg> -n <cluster> --query kubernetesVersion

Managed clusters (AKS, EKS, GKE) set their own upgrade cadences for the Go toolchain independently of upstream Kubernetes releases. Don't assume v1.33 automatically means Go 1.24 on your control plane.

Waves 2–3: Architectural Foresight (Not Sprint Tickets)

These rows are conceptual mappings to Kubernetes abstractions — design signals for platform architects, not immediate action items.

4. Container Runtime — The Pod Model Breaks for Quantum Workloads

Classical containers assume: persistent state, restartable processes, deterministic health probes. Run the same container twice → same result.

Quantum circuits violate all of this. Measuring a qubit collapses its superposition — the same circuit run twice returns different results by design. Liveness probes have no quantum equivalent. Checkpointing is physically impossible.

The Qubernetes project (arXiv:2408.01436, Osaka University + Fujitsu, Jul 2024) mapped quantum circuits to Kubernetes CRDs as a parallel model — not an extension of pods. Key finding: "a quantum service cannot be deployed permanently like a classical counterpart — the circuit must be compiled and sent to the quantum device fresh at runtime."

This is a 2030+ concern. But if you're designing a platform engineering layer today, knowing this will prevent a painful refactor.

5. Scheduler / HPA — Two Independent Breaks

Break 1 — Coherence windows: Quantum circuits must execute within microsecond-to-millisecond decoherence windows. kube-scheduler optimises against a resource budget. Quantum scheduling optimises against physics.

Break 2 — No-cloning theorem: HPA works by replicating identical instances horizontally. The no-cloning theorem makes it physically impossible to copy arbitrary quantum state. HPA is fundamentally broken for quantum workloads — not because of software limitations, but because of quantum mechanics.

The Qonductor project (arXiv:2408.04312, 2024) built the first hybrid quantum-classical Kubernetes scheduler balancing gate fidelity vs. job completion time. IBM Kookaburra (2026, 1,386-qubit multi-chip) will begin testing hybrid scheduling at commercial scale.

6. Observability — Measurement Destroys State

The entire CNCF observability stack (Prometheus, OpenTelemetry, Loki, Jaeger) assumes one thing: observing a system doesn't change it. Prometheus scrapes targets indefinitely — nothing changes in the workload.

Quantum measurement collapses superposition. The act of observation changes the result. Quantum workloads need entirely new primitives: fidelity metrics, shot-count statistical sampling, decoherence rate tracking, per-gate error rates.

As of 2026, no production quantum observability toolchain exists. This is a 10+ year horizon — worth knowing so it doesn't surprise your platform architecture.

The Hardware Timeline That Makes This Concrete

This isn't speculative. Here's where the hardware actually is:

2025 — Quantinuum Helios: 98 physical / 48 logical qubits, 99.9%+ fidelity, available commercially (cloud + on-premises). Nvidia GB200 integration via NVQLink. (Quantinuum press release, Nov 2025)
2026 — IBM Kookaburra: 1,386-qubit multi-chip processor. Quantum advantage for specific workloads targeted. (IBM Quantum roadmap)
2026 — Honeywell CEO at Citi Global Industrial Tech Conference (Feb 2026): "12–36 months is the window" for commercial quantum impact. Banking and pharma named as primary markets.
2029 — IBM Starling: first fault-tolerant QC, 200 logical qubits, 100M gates. (IBM Quantum blog)
2033 — IBM Blue Jay: 2,000 logical qubits.

The QCaaS market is projected at $18–74B by 2032–2033 across multiple analyst firms at 34–43% CAGR (Credence Research Dec 2025, SNS Insider Oct 2025). This is already a commercial cloud infrastructure category.

Your Immediate Checklist

Do this week:

Run etcd --version | grep -i go on each cluster — document which are on Go < 1.24
Run kubectl version -o json | jq '.serverVersion.goVersion' — verify API server Go version
Inventory all cert-manager issuers — identify which are still RSA/ECDSA only
Check if Vault is configured with crypto-agility in mind (can you swap algorithms without re-issuing all secrets?)

Design consideration for next platform iteration:

Build crypto-agility into your PKI from the start — algorithm swap without full re-issue
Start tracking NIST FIPS 203/204/205 compliance in your security posture documentation
Add quantum readiness to your next architecture review template

The Full Picture

This is the quantum section from a larger article on the next evolution of Kubernetes infrastructure — covering Platform Engineering, Autonomous Infrastructure (k8sgpt, Robusta), and the full post-quantum migration stack with production war stories from managing AKS clusters for pharmaceutical clients.

Read the full article on OpsCart.com →

The visual diagram of all six layers — with wave labels, source citations, and the IBM readiness data — is embedded in the full article.

Sources: arXiv:2408.01436 (Qubernetes, Jul 2024) · arXiv:2408.04312 (Qonductor, 2024) · arXiv:2509.01731 (Enterprise PQC Readiness, Sep 2025) · NIST FIPS 203/204/205 (Aug 2024) · IBM Quantum Readiness Index (Dec 2025) · IBM Enterprise in 2030 Study (Jan 2026) · Quantinuum Helios press release (Nov 2025) · IBM Quantum roadmap · Global Risk Institute · Credence Research (Dec 2025) · SNS Insider (Oct 2025)

Your Plumber Has More Job Security Than You: AI Threat Levels for Engineers vs. Blue-Collar Workers

Shamsher Khan (Shamz) — Tue, 17 Feb 2026 11:00:00 +0000

This is a condensed, engineer-focused version of my full research article The Great Inversion on OpsCart.com.

Introduction

For decades, we assumed automation would hollow out physical labor first — factories, warehouses, construction sites. White-collar knowledge work was supposed to be the safe zone.

That assumption is quietly breaking.

As AI systems move from narrow tools to cognitive collaborators, the jobs most exposed aren’t the ones that lift heavy objects — they’re the ones that manipulate information. Engineers, analysts, testers, and writers now sit closer to the automation frontier than many skilled trades. The charts below don’t argue that these roles disappear — they show where AI pressure is already concentrating, and why the risk curve is inverting faster than most people expect.

The Uncomfortable Chart

Before I say anything else, look at this:

Source: OpsCart.com analysis based on Brookings, OpenAI/UPenn, Goldman Sachs, Bloomberg task-exposure data (2023–2025)

QA Engineer: 82% task exposure. Plumber: 12%.

Read that again. A QA engineer — someone who went to university, learned frameworks, mastered testing theory — has nearly seven times the AI automation exposure of someone who fixes pipes.

This isn't a thought experiment. It's what the data says. And it flips a century of assumptions upside down.

The Great Inversion, Explained in 60 Seconds

For decades, everyone assumed automation would eat blue-collar work first. Robots on factory floors. Self-driving trucks. Warehouse drones. Then, eventually — maybe — software would come for the knowledge workers.

The opposite is happening.

Generative AI targets cognitive screen work — tasks performed through keyboards, monitors, and code editors. Everything that lives inside a terminal or IDE is in AI's native medium. Meanwhile, physical dexterity — unclogging drains, pulling cable through conduit, diagnosing a weird HVAC noise — remains a frontier challenge for robotics.

Brookings Institution confirmed this: AI exposure is "exactly opposite" of prior automation waves. Goldman Sachs puts 300 million global roles at risk. The WEF projects 92 million jobs displaced by 2030.

The plumber is fine. The senior DevOps engineer writing Terraform all day? That's a different conversation.

Why YOUR Job Is in the Strike Zone

Here's the technical reason: LLMs are pattern-completion engines trained on code, docs, configs, reports — the exact output of your daily work. Every task that decomposes into read input → apply pattern → generate output is within current AI capability.

Think about your average week:

# What AI can already do for a Software Engineer:
- Generate code from specs and tickets               ⚙️ AI
- Write unit and integration tests                   ⚙️ AI
- Code review for patterns and bugs                  ⚙️ AI
- Refactor and optimize existing code                ⚙️ AI
- Generate API boilerplate                           ⚙️ AI
- Write documentation, READMEs, PR descriptions      ⚙️ AI
- Dependency updates and migration scripts           ⚙️ AI

# What still needs YOU:
- Architecture decisions with real trade-offs        🧠 Human
- Novel algorithm design for edge cases              🧠 Human
- Cross-team negotiation ("your API breaks our SLA") 🧠 Human
- Production incident judgment at 3 AM               🧠 Human
- Technical debt prioritization                      🧠 Human

That's roughly 78% AI-amenable, 22% human-required. (Based on Bloomberg task analysis, 2025; OpenAI/UPenn occupation exposure models, 2023.)

And it gets worse for some roles.

The Role-by-Role Breakdown

Source: OpsCart.com analysis based on Bloomberg, OpenAI/UPenn, enterprise adoption patterns

Here's where every major engineering role lands:

Role	AI-Amenable	Human-Required	Why
QA / Tester	~82%	~18%	Testing is fundamentally pattern-matching. AI generates test cases, runs regression, drafts bug reports. What survives: exploratory testing intuition, risk-based strategy, compliance judgment.
Software Engineer	~78%	~22%	Code generation, reviews, refactoring, docs — all automatable. What survives: architecture decisions, novel design, cross-team negotiation.
Business Analyst	~75%	~25%	Requirements docs, reports, meeting notes, process mapping — AI-drafted. What survives: stakeholder politics, ambiguity resolution, change management.
DevOps Engineer	~70%	~30%	Terraform, pipeline YAML, runbooks, log analysis — AI-generated. What survives: architecture trade-offs, incident response, compliance judgment, physical DC work.
Security Engineer	~68%	~32%	Vuln scanning, log correlation, policy docs, compliance reports — AI-processed. What survives: threat modeling, adversarial thinking, incident leadership, zero-day judgment.
Network Engineer	~65%	~35%	Config generation, monitoring, capacity planning — AI-handled. What survives: physical cabling, on-site troubleshooting, vendor negotiations, disaster recovery execution.

Notice the pattern? The more physical your role, the more it resists automation. Network engineers have the lowest exposure because you can't AI your way through pulling fiber or swapping a failed SFP at 2 AM.

Important: These are task exposure percentages — not job elimination rates. 80% task automation ≠ 80% job loss. But it does mean headcount compression, junior-role collapse, and fundamentally different job descriptions. (See the full methodology notes on OpsCart.com.)

Meanwhile, on the Blue-Collar Side...

Role	AI Exposure	Why It's Protected
Nurse / Healthcare Aide	~10%	Physical care, human empathy, unpredictable environments
Plumber	~12%	Every job is different; physical dexterity in confined spaces
Electrician	~15%	Code compliance + hands-on in unstructured environments
HVAC Technician	~18%	Diagnosis requires physical inspection; repair requires hands
Construction Worker	~20%	Unstructured outdoor environments; heavy physical labor
Auto Mechanic	~22%	Diagnostics computerized; repair still fully manual

This is Moravec's Paradox in action: tasks trivial for a five-year-old (grasping, walking, perceiving physical space) remain frontier challenges for robots. Tasks hard for humans (chess, calculation, code generation) are easy for AI.

Amazon has 750,000 warehouse robots and still can't automate physical dexterity. Tesla launched its robotaxi in Austin with human safety monitors in the passenger seat.

Your plumber doesn't have that problem.

What Your Job Title Looks Like in 10 Years

Source: OpsCart.com forecast based on enterprise adoption patterns and WEF Future of Jobs 2025

The trajectory is consistent across every role: Execution → Orchestration → Architecture → Autonomy Governance.

Software Engineer → AI-Assisted Software Eng → AI Systems Orchestrator → Human-Machine Integration Architect
DevOps Engineer → AI-Ops Engineer → Platform Intelligence Eng → Infrastructure Autonomy Architect
QA Engineer → AI Test Validation Spec → Quality Automation Architect → Trust & Verification Engineer
Security Engineer → AI Security Analyst → Threat Intel Orchestrator → Adversarial Systems Architect

You stop writing code. You start governing systems that write code. The question is whether you're steering that transition or being displaced by it.

The Career Path Fork

You have two defensible directions:

Path A — Up (Strategic Orchestration)
Governing AI systems, designing architectures, managing risk, making judgment calls under uncertainty. The Stratosphere. Small headcount, high compensation, high abstraction.

Path B — Out (Physical Systems)
Maintaining infrastructure, managing hardware fleets, edge deployment, DC operations. The Ground Layer. Larger headcount, moderate compensation, high tactile skill.

The indefensible position: Staying in the middle — performing routine cognitive screen work. That's the part AI eats first, fastest, and most completely.

What to Do Monday Morning

🛑 STOP

Writing boilerplate code, configs, or docs by hand
Running manual test suites you could automate
Treating AI tools as optional "nice-to-have"
Assuming your current skill set has a 10-year shelf life

✅ DOUBLE DOWN

Architecture & system design (the why behind decisions)
Cross-team communication & stakeholder translation
Compliance judgment (GxP, SOC2, HIPAA, GDPR)
Incident response leadership under pressure

📚 LEARN NEXT

Physically: Hardware troubleshooting, DC operations, edge deployment
Architecturally: AI agent orchestration, model governance, drift detection
Adversarially: Threat modeling, red team thinking, security architecture

⚡ DELEGATE TO AI IMMEDIATELY

First-draft code, configs, IaC templates
PR descriptions, commit messages, runbook drafts
Log analysis, alert correlation, CVE triage
Meeting summaries, status reports, documentation

The Skill Decay Trap

Here's the subtle danger nobody talks about: the more you delegate to AI, the faster your skills atrophy.

I manage 8+ production AKS clusters for a Fortune 500 pharma client. The moment I stop doing kubectl debug by hand and rely entirely on AI diagnostics, my ability to reason about pod scheduling and resource contention degrades. AI gives me an answer. But when it gives me the wrong answer at 3 AM during a production outage, I need the expertise to catch it.

# Skills that DECAY with AI reliance:
  - Syntax fluency (any language)
  - Boilerplate config writing
  - Manual log parsing

# Skills that COMPOUND with AI proliferation:
  - System-level failure analysis ("why did this cascade?")
  - Architecture under constraint ("GxP + cost + performance")
  - Stakeholder translation ("here's why migration takes 6 months")
  - Threat modeling ("what attack surface does this AI create?")
  - Physical infrastructure judgment ("this rack layout fails cooling")

The engineers who thrive in the Inversion will be the ones who use AI for the 70–80% while deliberately practicing the 20–30% that AI can't touch. Delegate the boilerplate. Protect the judgment.

Final Thought

The question that should keep every engineer awake at night is not "Will AI take my job?"

It's this:

If 75–85% of your task content can be automated today, and 95% within a decade, what is the 5% that justifies your presence — and are you investing in becoming indispensable at it?

That question has no comfortable answer. But the engineers who confront it honestly will navigate the Inversion. The rest will discover, too late, that the glacier they were standing on has already melted.

📖 Read the full 6,700-word research article with all data sources, methodology notes, and the complete role-by-role analysis: The Great Inversion on OpsCart.com

💬 Disagree with the numbers? Think I'm overstating the case? Drop a comment — I'll respond with sources.

You Blocked docker.sock. Your Containers Are Still Not Safe.

Shamsher Khan (Shamz) — Tue, 03 Feb 2026 02:58:46 +0000

I spent the last two weeks building out a full runtime escape lab — five attack scenarios, automated defense scripts, Falco rules, the works. Scenario 1 (docker.sock mounting) already has its own deep dive on DZone. Everyone knows that one.

But scenarios 3 and 4 are what actually kept me up at night. Not because they're loud and dramatic. Because they're quiet. They pass the checks your team runs. They look normal in docker inspect. And they hand an attacker a path to the host.
This post covers both. One is an audit blind spot. The other is a two-container escalation chain. Neither requires --privileged.

The Audit Blind Spot: CAP_SYS_ADMIN

CAP_SYS_ADMIN Section (Audit Gap)

Here's the thing most Docker security checklists do: they scan for Privileged: true. If the flag is false, they move on. Green checkmark. Threat neutralized.

But CAP_SYS_ADMIN alone — without --privileged — gives a container almost everything privileged mode does. Mount filesystems. Manipulate namespaces. In some kernel configurations, escape to the host entirely. And it shows up in the audit as just another capability in a list. Not as a red flag.

This is what I call the audit gap. It's the single capability that consistently slips through automated scans because scanners are tuned to look for the Privileged boolean, not for what individual capabilities can actually do.

What it actually looks like

# This is what your scanner sees
docker inspect my-container | jq '.[] | {Privileged: .HostConfig.Privileged}'
# Output: { "Privileged": false }   <-- Scanner says: all clear

# This is what's actually running
docker inspect my-container | jq '.[] | .HostConfig.CapAdd'
# Output: ["SYS_ADMIN"]             <-- Scanner doesn't flag this

Why it matters in production
In my lab (Scenario 3 of Lab 09), I ran a container with only --cap-add=SYS_ADMIN. No privileged flag. No other dangerous capabilities. From inside that container, I was able to mount the host's /etcdirectory and read credential files directly. The container passed every Privileged: false check along the way.
T
his isn't theoretical. FUSE filesystems, certain monitoring agents, and some CI tooling legitimately request SYS_ADMIN. It's in production configs right now, and most teams have no idea what it enables.

The three-line audit that actually catches it

# Don't just check Privileged. Check capabilities.
docker ps -q | xargs docker inspect --format \
  '{{.Name}}: Privileged={{.HostConfig.Privileged}} CapAdd={{.HostConfig.CapAdd}}'

# Flag anything with SYS_ADMIN, SYS_PTRACE, or SYS_MODULE
# These three are the ones that cross the line from "useful" to "escape vector"

That's it. Three lines. Run it against your production containers right now. If you see SYS_ADMIN in the output, you have a conversation to have with whoever owns that container.

The Escalation Chain: Host Mounts + docker.sock

Escalation Chain (Two Containers)

This one is more subtle, and it's the pattern that worries me most for real-world environments. It's not a single misconfiguration. It's two containers working together — not by design, but because an attacker can chain them.

How the chain works

Scenario 4 in the lab demonstrates this with two containers:
Container A — a bind mount that exposes /etc from the host. Legitimate use case: an application that needs to read host configuration. Totally normal setup.

docker run -d --name app-container \
  -v /etc:/host-etc:ro \
  ubuntu:22.04 sleep infinity

Container B — has access to docker.sock. Also common. CI tools, monitoring agents, Portainer. The socket is there for a reason.

docker run -d --name monitor \
  -v /var/run/docker.sock:/var/run/docker.sock \
  ubuntu:22.04 sleep infinity

Neither container alone is the vulnerability. Container A can't create new containers. Container B can't read host files (it doesn't have a bind mount to /etc). But an attacker who controls Container B can use the Docker API through the socket to create a new container that mounts the same paths Container A has access to — and make it privileged.

# From inside Container B (has docker.sock):
docker run --privileged \
  -v /etc:/stolen-etc \
  -v /var/run/docker.sock:/var/run/docker.sock \
  alpine cat /stolen-etc/shadow

The chain is: socket access → create privileged container → mount host paths → full credential access. No single container had all the permissions. The attacker assembled them.

Why this is hard to catch

Each container, in isolation, passes a standard audit. Container A has a bind mount — auditors see it, note it as "read-only, accepted." Container B has the socket — it's flagged in some scans, but it's a monitoring tool, so it gets an exception. The combination is what's dangerous, and nobody audits combinations.

What detection actually looks like

The defense script I built for this scenario generates a Falco rule that watches for the pattern specifically — a socket-mounted container spawning a new privileged container with host path mounts. Here's the core logic:

- rule: Escalation Chain Detected
  desc: >
    Container with docker.sock access created a new privileged
    container that mounts host paths. Likely escalation chain.
  condition: >
    container.image.digest != "" and
    evt.type = container_start and
    ka.verb = create and
    container.privileged = true and
    container.mount.dest in ("/etc", "/root", "/var/run")
  output: >
    Escalation chain: socket container spawned privileged mount
    (container=%container.name image=%container.image)
  priority: CRITICAL

This catches the pattern, not just the individual container. That's the shift in thinking that makes the difference.

The Hands-On Lab

All five scenarios — including the two I covered here — are in the open-source lab repository. Each scenario has:

demo.sh — runs the attack so you can see it
defense.sh — generates the detection artifacts (Falco rules, audit scripts)
validate.sh — verifies your defenses actually work
cleanup.sh — tears everything down cleanly

git clone https://github.com/opscart/docker-security-practical-guide
cd labs/09-runtime-escape

# Run Scenario 3 (CAP_SYS_ADMIN audit gap)
cd scenario-3-sys-admin
chmod +x *.sh && ./demo.sh

# Run Scenario 4 (host mount escalation chain)
cd ../scenario-4-host-mount
chmod +x *.sh && ./demo.sh

Works on Docker Desktop (macOS/Windows) and Linux. The README has Docker Desktop-specific notes for the parts that behave differently on the VM layer.

What to actually do about it
Two things. Both take five minutes.

First, run the capability audit from above against your environment. Look for SYS_ADMIN, SYS_PTRACE, SYS_MODULE. If you find them, trace back to why they're there. Half the time, nobody remembers.

Second, audit for the escalation chain pattern: any container with docker.sock mounted in the same environment as containers with host path bind mounts. If both exist, even if they're unrelated services, the attack surface is there.

Related reading
If you want the full picture on container runtime escapes, these are the other pieces in the series:

Docker Runtime Escape: Why Mounting docker.sock Is Worse Than Running Privileged Containers — Scenario 1, the socket escape chain (DZone)
Docker Security: 6 Practical Labs From Audit to AI Protection — Labs 01–06, the foundation (DZone)
Advanced Docker Security: From Supply Chain Transparency to --Network Defense — Labs 07–08 (DZone)

The lab repo has everything: github.com/opscart/docker-security-practical-guide

Connect:

What a 60-second war-room scan reveals

Shamsher Khan (Shamz) — Tue, 27 Jan 2026 20:45:11 +0000

What a 60-Second War-Room Scan Revealed in Production

Everything was green.
Dashboards looked perfect.
Alerts were quiet.

And yet production was unstable.

After too many late-night war rooms chasing "ghost issues" in Kubernetes, I learned an uncomfortable truth:

Kubernetes clusters can report "healthy" while hiding serious operational, security, and cost risks.

I’ve seen this pattern repeatedly in production — even in “stable” clusters.

What Your Monitoring Stack Isn't Telling You

Most Kubernetes monitoring answers questions like:

Is CPU or memory spiking?
Are pods running?
Is latency increasing?

What it often misses:

Containers running as root in production
Privileged workloads with host access
Namespaces idle for weeks, burning money
Pods crash-looping thousands of times without alerts
Security misconfigurations that don't fail fast — but fail catastrophically

Your cluster can show 99.9% uptime while quietly accumulating risk.

The 60-Second War-Room Scan

To expose these blind spots, I built opscart-k8s-watcher — a Kubernetes scanner designed for incidents, not audits.

It answers the questions engineers ask during outages, not after postmortems.

1. Security Blind Spots (Pod-Level CIS Signals)

While debugging an incident, this is what surfaced:

🔴 CRITICAL FINDINGS:
- Containers running as root: 31
  └─ PRODUCTION: 10 (⚠️ immediate risk)
- Privileged containers: 3
  └─ SYSTEM: 3 (expected)
- HostPath volumes detected

Instead of overwhelming you with hundreds of controls, the scan focuses on high-impact pod risks:

Root execution
Privileged containers
Host namespace access
Missing resource limits

All findings are environment-aware — because a privileged pod in kube-system is normal, but the same pod in production is a serious incident.

2. Resource Waste Hiding in Plain Sight

Clusters don't just fail — they quietly waste money:

OPTIMIZATION OPPORTUNITIES:
- staging idle for 21+ days (0.3 CPU, 0.4 GB)
- dev idle for 14+ days (0.2 CPU, 0.2 GB)

These are immediate wins, not theoretical optimizations.
Idle namespaces, over-allocated workloads, and prod-grade resources running dev environments often go unnoticed for months.

3. Silent Failures That Don't Trigger Alerts

Some of the most dangerous problems never cross alert thresholds:

🔴 CRITICAL:
kubernetes-dashboard
Status: CrashLoopBackOff
Restarts: 2157

A pod restarting 2,000+ times is not healthy — yet many clusters tolerate this indefinitely.

These silent failures:

Mask deeper configuration issues
Degrade cluster stability
Eventually cascade into outages

Why Traditional Monitoring Misses This

Monitoring tools are excellent at answering:

"Is it down right now?"

They're bad at answering:

"Is this safe?"
"Is this wasteful?"
"What will fail next?"

Structural risk rarely looks like an outage — until it suddenly becomes one.

What Teams Discover in Their First Scan
Within 60 seconds, teams usually uncover:

Root containers running in production
Privileged workloads with host access
Crash-looping pods running for weeks
30–40% hidden resource waste
Dev environments consuming prod-grade capacity
Failing most pod-level CIS controls

All while dashboards remain green.

The 60-Second Challenge
Run this against your cluster — right now:

./opscart-scan security --cluster your-prod-cluster
./opscart-scan emergency --cluster your-prod-cluster
./opscart-scan resources --cluster your-prod-cluster

You will find something surprising.
You will probably find several things uncomfortable.

Your cluster is lying to you.

Try It Yourself

The full war-room walkthrough, diagrams, screenshots, and installation steps are available here:
👉 Full war-room walkthrough: OpsCart.com - Full Deep Dive
👉 Open source project: opscart-k8s-watcher on GitHub

Run it once — and you'll never trust a "green" dashboard the same way again.

Connect: LinkedIn | GitHub | OpsCart.com