Forem: Sodiq Jimoh

I Build ML Infrastructure for a Living — Here's Why Hermes Agent Changes the Game for Platform Engineers

Sodiq Jimoh — Sat, 23 May 2026 22:30:00 +0000

Hermes Agent Challenge Submission: Write About Hermes Agent Challenge Page

I've spent the past year building NeuroScale — an open-source AI inference platform on Kubernetes. 108 commits. 21 automated smoke checks across 6 milestones. The kind of platform where a developer fills in a Backstage form and gets a production-grade inference endpoint with drift control, policy guardrails, and cost attribution — no kubectl required.

I'm telling you this because I need you to understand where I'm coming from when I say: Hermes Agent isn't just another AI coding assistant. It's the first agent framework that actually thinks like a platform engineer.

I don't say that lightly.

The Problem Nobody Talks About: AI Agents Are Stateless in a Stateful World

Building ML infrastructure teaches you one thing fast: everything is state.

Your ArgoCD sync status is state. Your Kyverno policy violations are state. The drift between what's in Git and what's running in the cluster — state. The fact that someone ran kubectl apply directly at 2am and broke the GitOps contract — that's state too.

Every AI agent I've used before Hermes treats each conversation like a blank canvas. You explain your architecture. You describe the problem. You get a plausible answer. Then you close the tab and do it all over again tomorrow.

Groundhog Day for infrastructure debugging.

Hermes Agent is architecturally different, and the difference matters specifically for the kind of work platform engineers do.

Three-Layer Memory: What It Actually Means for Infrastructure

Most people writing about Hermes focus on the memory system as a convenience feature. "It remembers your preferences." "It knows your name."

That's not what makes it interesting.

Hermes runs a three-layer memory architecture:

Short-term — current conversation context (same as every other agent)
Medium-term — session summaries that persist between conversations, built through periodic "memory nudges"
Long-term — Skill Documents that capture how it solved specific types of problems, stored as reusable procedures

For a platform engineer, this maps directly to something we already understand: runbooks.

When I troubleshoot an ArgoCD sync failure, I don't start from first principles. I check the runbook. Token expiry? Webhook misconfiguration? Sync wave ordering? The runbook encodes prior incident resolution as a procedure.

Hermes does this automatically. After roughly 15 tasks, its GEPA loop (Goal → Execute → self-Prompted introspection → Adapt — published at ICLR 2026 as an Oral) kicks in: it reviews its own performance, identifies patterns, and writes new Skill Documents. Agents with 20+ self-generated skills complete similar future tasks 40% faster than fresh instances.

That's not "remembering your name." That's an agent building its own runbook library. It's the difference between a junior on-call engineer and a senior who's seen every failure mode before.

Where Hermes Creates Real Value in an ML Platform Stack

Abstract possibilities are cheap. Let me be specific about where this matters in a stack like NeuroScale.

1. Configuration Drift Diagnosis

NeuroScale uses ArgoCD with selfHeal: true — drift is auto-corrected. But detecting drift before ArgoCD catches it, and understanding why it happened, is a different problem.

Here's what a Hermes scheduled audit looks like in practice:

hermes task add --cron "0 */6 * * *" \
  "Check the diff between Git-declared state in infrastructure/apps/ \
   and live cluster state. If they diverge, summarize what changed, \
   correlate with recent kubectl audit logs, and flag whether the \
   change was human-initiated or a controller reconciliation. \
   Send results to Telegram."

Most agents can run a diff. Hermes does the part that matters: building a pattern library over time. After a month of audits, it knows that drift in the serving-stack namespace is almost always a Knative autoscaler update (harmless), while drift in kyverno/policies/ is almost always someone bypassing admission control (critical).

That context accumulates in Skill Documents. I haven't seen another agent framework that does this out of the box.

Here's what a drift report from Hermes actually looks like after a few weeks of accumulated context:

📋 Drift Audit — 2026-05-23 12:00 UTC

Cluster: neuroscale-prod
Namespaces scanned: 4

✅ serving-stack: 2 diffs detected
   → Both are Knative autoscaler reconciliations (harmless)
   → Matches pattern from Skill: "knative-autoscaler-drift"
   → No action required.

⚠️ kyverno/policies: 1 diff detected
   → ClusterPolicy "require-resource-limits" modified in-cluster
   → Not present in Git (infrastructure/policies/)
   → kubectl audit: manual apply by user "ops-admin" at 03:12 UTC
   → FLAGGED: Possible admission control bypass.
   → Recommend: Revert in-cluster change or commit to Git.

📎 Context: This is the 3rd manual policy edit in 14 days.
   Previous incidents resolved by reverting. See Skill:
   "kyverno-drift-response" for standard procedure.

Notice the last three lines. That's not a generic diff. That's an agent referencing its own operational history — correlating today's anomaly with patterns it learned from previous audits. A fresh agent instance can't do that. One with a month of Skill Documents can.

2. Policy Validation Before Merge

NeuroScale enforces 5 Kyverno ClusterPolicies — requiring resource limits, standard labels, non-root containers, no :latest tags. But violations caught at admission mean the deploy already failed. The earlier you catch them, the cheaper the fix.

This is where Skill Documents become genuinely powerful. You write one that encodes your specific policies:

# Skill: NeuroScale Policy Pre-Check
## When to Use
When reviewing PRs that modify files under `apps/` or `infrastructure/`.
## Procedure
1. Check for `owner` and `cost-center` labels on all InferenceService manifests
2. Verify `resources.requests` and `resources.limits` are set
3. Flag any image tag that is `latest` or missing
4. Verify `securityContext.runAsNonRoot: true`
## Known False Positives
- ClusterServingRuntime objects are exempt from label requirements

That's not a prompt. It's a procedural memory document — loaded on-demand, zero tokens until needed, self-improving based on new violations it discovers.

3. Incident Response That Compounds

Real scenario from NeuroScale development: Backstage went into a CrashLoop. Root cause was a token refresh issue with the Kubernetes service account. I documented it in INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md.

With Hermes running persistently — which you can do on a $5 VPS or a serverless backend that hibernates when idle — it would have:

Detected the CrashLoop via scheduled health check
Correlated with recent changes (cert rotation? secret update?)
Checked its Skill Documents for prior Backstage incidents
Either resolved it or escalated with a structured diagnosis

Next time a similar issue occurs, it resolves faster because the skill from incident #1 already exists. That's the compounding effect that makes experienced SREs more valuable over time — now encoded in an agent's memory.

What Hermes Gets Right That Other Frameworks Don't

I've looked at the landscape — LangChain, CrewAI, AutoGen. Here's what Hermes gets structurally right for infrastructure:

Local-first data residency. Everything lives in a local SQLite database. For platform engineers working with cluster credentials and deployment configs, this isn't a feature — it's a prerequisite. I'm not sending my policy violations through someone else's API.

Terminal backends that work. Seven backends including local, Docker, SSH, and serverless options. SSH means Hermes runs commands on your actual infrastructure. Docker means you can sandbox it. Serverless means it hibernates when idle, wakes on demand. This is infrastructure-native thinking, not "here's a chat UI that can run Python."

Built-in cron scheduling. Natural-language-configured scheduled tasks with delivery to Telegram, Discord, Slack, or Signal. For infrastructure monitoring, this is table stakes — and Hermes is one of the few agent frameworks that ships it natively, no external cron daemon or YAML required.

200+ model support. Switch between cheap models for routine audits and powerful ones for complex diagnosis with a single command. No code changes. Operational flexibility that platform engineers actually need.

What Hermes Doesn't Solve (Yet)

Honesty about limitations matters more than hype when we're talking about tools that touch production infrastructure.

Domain reasoning is shallow. Hermes can follow procedures and build skill documents, but it can't replace a senior engineer's intuition about why a particular autoscaler configuration causes cascading latency under specific traffic patterns. The skill system captures what to do, not why it works.

Multi-cluster coordination is manual. NeuroScale runs on a single cluster. For federated infrastructure across regions, Hermes' per-instance memory doesn't federate. Each agent builds its own skill library independently. There's no skill-sharing protocol between agents yet.

Approval workflows need hardening. The --yolo flag bypasses all approval prompts. For infrastructure work, that's terrifying. The approval system needs declarative rules about what the agent can and cannot do — something like Kyverno's admission policies, not just per-command approve/deny. The tools/ directory has approval pinning in progress, but it's not production-ready for high-stakes operations.

The Bigger Picture: Agents as Infrastructure Primitives

Here's the perspective I haven't seen anyone else articulate.

Hermes Agent isn't just a tool for platform engineers. It's a new kind of infrastructure primitive.

Think about the trajectory: manual server management → configuration management → infrastructure as code → GitOps → platform engineering. Each layer abstracted the layer below and added intelligence.

Hermes represents the next step: infrastructure as conversation. Not in the shallow "chat with your cluster" sense. In the sense that an agent with persistent memory, self-improving procedures, and scheduled automation can become a layer in your control plane.

A layer that:

Observes continuously (cron + terminal access)
Learns from incidents (GEPA → Skill Documents)
Enforces patterns (skill-driven validation)
Communicates across channels (Telegram/Slack/Discord)
Costs almost nothing when idle (serverless backends)

That's not a chatbot. That's an operator — in the Kubernetes sense of the word.

Why I'm Betting on Hermes

The tools that win aren't the ones with the most features. They're the ones with the right architecture for compounding.

ArgoCD won over manual deploys because GitOps compounds — every deployment is auditable, reproducible, reversible. Kyverno won over manual policy checks because admission policies compound — every new policy protects every future deployment.

Hermes Agent's architecture compounds the same way. Every task makes it better at the next one. Every incident resolution becomes a skill document. Every audit pattern becomes a scheduled automation.

164,000 GitHub stars in under three months. MIT licensed. Runs on a $5 VPS. Data stays on your machine.

For platform engineers who've spent years building systems that self-heal, self-monitor, and self-govern — Hermes Agent is the first AI framework that actually speaks our language.

I'm Sodiq, and I build ML infrastructure platforms. NeuroScale is open source: github.com/sodiq-code/neuroscale-platform — PRs welcome. If you want to see how Hermes Agent could fit into a real Kubernetes-based ML platform, that's where I'd start.

Star the repo if this perspective was useful. And if you've tried Hermes against your own infrastructure — what broke first? I want to know.

I Spent 108 Commits Building Infrastructure. Google I/O 2026 Shipped It as One API Call.

Sodiq Jimoh — Sat, 23 May 2026 21:25:59 +0000

Six days before Google I/O, I pushed the final commit to my open-source project. Two days later, Google announced they'd productized the hardest parts of it.

That's the story. And it's not a complaint — it's the most useful thing I can share about what I/O 2026 actually means for how we build software.

The Setup: What I Spent Six Months Building

NeuroScale is a self-service platform that lets developers deploy AI models to production without understanding the infrastructure underneath. A developer fills in a form, a pull request gets created automatically, automated checks validate it, it deploys itself, and a working AI endpoint goes live.

The developer sees a form and an endpoint. They don't see the 13 things that happen in between.

Building those 13 invisible things is where the 108 commits went.

Here's what "safe AI deployment" actually requires — and why each piece exists:

An isolation layer. When your AI model runs, it needs to run in its own contained environment. If it crashes, it shouldn't take anything else down. If it leaks data, it shouldn't reach other teams' workloads. Building and managing these containers is non-trivial.

A policy enforcement layer. Left to their own devices, developers will deploy AI models with no memory limits, no ownership labels, no cost attribution. In a shared environment, one badly-configured model can exhaust the entire cluster's memory. You need something that physically blocks bad deployments before they happen — not a warning, a block.

A cost attribution layer. "AI is expensive" is not useful information. "Team Alpha's recommendation model cost $47 this month and Team Beta's classifier cost $112" is. Someone has to plumb the labeling, the metrics collection, and the cost analysis to make that possible.

A drift correction and validation layer. Production systems get manually modified — someone "just quickly" adjusts a setting directly. Without continuous watching, your infrastructure drifts from what you think it is. And before any configuration reaches production, it should be checked against the organization's policies in a pull request — exponentially cheaper than catching problems in production.

Each of those layers generated its own postmortem.

The isolation layer? I spent three hours debugging why no AI model could deploy at all. The cause: a single configuration flag, disableIstioVirtualHost: true, that wasn't in any documentation. Found it by reading the source code of the deployment tool. One boolean. Three hours.

The policy layer? For two weeks, our automated checks reported "all policies passed" on every pull request — a green checkmark, every time. They were lying. The command-line tool we used exits with a success code even when it finds policy violations. We had to rewrite the check to parse the tool's text output because the exit code couldn't be trusted. Two weeks of false confidence.

These aren't edge cases. They're what platform engineering actually looks like. Every layer that developers never see was built by someone who got paged at 2 AM because it didn't exist yet.

The I/O Announcement That Stopped Me Cold

At Google I/O 2026, Google announced Managed Agents in the Gemini API. Here's how they described it:

"A single API call provisions a remote Linux environment where the agent can reason, plan and call tools; execute code and manage files in an isolated sandbox; and browse the web to fetch and process live data."

The parallels to what I'd built were immediate:

Isolation layer? Each agent interaction gets its own sandboxed environment — similar problem, radically simpler solution.

Policy layer? Antigravity has "cross-platform terminal sandboxing, credential masking, and hardened Git policies." Not identical to organizational policy enforcement, but covering the same ground at the infrastructure level.

State persistence? Environments carry state between calls. It's not the same as GitOps-style drift reconciliation — Google isn't watching your production cluster and reverting unauthorized changes — but it solves the narrower problem of agents losing context between interactions.

And here's the code:

from google import genai

client = genai.Client()

interaction = client.interactions.create(
    agent="antigravity-preview-05-2026",
    input="Deploy this model and run inference on the test dataset.",
    environment="remote",
)

print(interaction.output_text)

That's it. No configuration file. No 13-step pipeline. No three-hour debugging sessions hunting boolean flags in source code.

The infrastructure abstraction I spent six months building — Google offered a dramatically simpler path to the same developer outcome.

What Google Actually Got Right

I don't mean this as criticism of what I built. The NeuroScale platform works. But Google's announcement revealed something important about where I was solving the problem.

I solved it at the infrastructure layer — Kubernetes, container orchestration, policy engines, GitOps controllers. Powerful. Flexible. Also: requires a developer to know what all of those words mean before they can onboard.

Google solved it at the interface layer. "Write a sentence describing what you want. We handle everything underneath."

The difference isn't capability. It's entry cost.

My platform asks developers to trust a form that maps to a manifest that feeds a policy engine that connects to a deployment controller. Three layers of abstraction they can't see, maintained by someone (me) they have to trust.

Google's Managed Agents ask developers to write a sentence.

There's also something deeper in how Google defined agents. They introduced AGENTS.md and SKILL.md — plain markdown files that define what an agent knows how to do and how it should behave. No special syntax. No proprietary format. Just files.

Compare that to my platform's equivalent: a Backstage plugin with a custom template schema, connected to a Kubernetes custom resource definition, validated by a policy engine configured in YAML. Same goal — "define how deployments should behave" — wildly different complexity.

Google's insight is that infrastructure behavior should be readable by the people who define it, not just by the engineers who build the platforms that enforce it.

What Managed Agents Still Can't Do

Here's where I stop nodding and start pushing back.

The organizational layer is still a human problem.

Managed Agents provision environments. They don't model the organizational reality that surrounds those environments.

When NeuroScale blocks a deployment, it's not because a technical constraint was violated. It's because this organization decided that every AI workload must carry an owner label and a cost-center label before it can run. That policy came from a finance conversation, not a technical specification. It was encoded into the platform as a business rule.

Google's Managed Agents don't know about your finance team's requirements. The Antigravity SDK (also announced at I/O) gives you the infrastructure to build that enforcement — but you still have to build it.

Shared infrastructure still needs shared accounting.

On NeuroScale, every AI workload carries labels. Those labels flow into a cost monitoring tool. At the end of the month, I can tell you exactly which team's models cost how much. That's not a technical feature — it's an organizational requirement that someone had to translate into a technical implementation.

Managed Agents bill at the API level. You know what your API calls cost. You don't automatically know which business unit those costs should be attributed to. That mapping is still manual.

Compliance environments don't get to outsource their audit trails.

Healthcare, finance, government. For these industries, "who deployed what, when, with what configuration, and who approved it" is not a preference — it's a requirement. That audit trail has to live somewhere you control, in a format your auditors accept.

Google's infrastructure is excellent. It's not your infrastructure. There's a difference that matters in regulated industries, and Managed Agents don't close it.

The Broader Signal Most I/O Coverage Missed

Everyone wrote about the flashy announcements. Gemini Omni. Universal Cart. Gemini Spark doing your tasks while your laptop is off.

But for developers who build things for other developers — the platform engineers, the DevOps practitioners, the infrastructure leads — the signal at I/O 2026 was quieter and more consequential.

Google announced a complete rethinking of what "developer infrastructure" means:

Managed Agents: The unit of deployment is no longer a container. It's an agent interaction.
Antigravity SDK: Google's own agent harness, self-hostable on your infrastructure.
AGENTS.md / SKILL.md: Infrastructure behavior defined as readable markdown files.
Chrome DevTools for Agents: Your debugging tools follow your agents into the browser.
WebMCP: A proposed standard for AI agents to interact with any website as a tool.

These aren't separate products. They're an architecture. Google is describing a world where:

You describe what you want in natural language or markdown.
An agent figures out how to do it in an isolated, managed environment.
The infrastructure is someone else's problem.

For solo developers and small teams, this is unambiguously good. Stop building platforms. Start building products.

For platform engineers at organizations with compliance requirements, cost accountability, and multi-team governance — the work isn't going away. But the tools are getting better, and the layer where the work happens is shifting upward.

A Framework for Every Developer Reading This

The most useful thing I can offer from six months of platform building:

You do NOT need custom infrastructure if:

You're one developer or a small team
You're prototyping or experimenting with a new idea
Your data and compliance requirements are standard
Speed to working product matters more than organizational control

→ Use Managed Agents. Use Google AI Studio. Ship your product, not your platform.

You still need custom infrastructure if:

You're in a regulated industry with specific audit requirements
Multiple teams share compute and need separate cost accountability
Your organization has security or data residency constraints that preclude managed services
You need to enforce business rules at the infrastructure layer — not just as guidelines, but as technical blocks

→ The Antigravity SDK exists for you. Take Google's agent harness, host it yourself, build your organizational layer on top.

The honest summary: most developers building most things should use what Google shipped at I/O. The exceptions are real, they're specific, and they're usually organizational rather than technical.

Want to feel the difference yourself? Install the SDK (pip install google-genai), grab an API key from Google AI Studio, and run the code block above. In under a minute, you'll have an agent running in an isolated sandbox — doing what took me weeks of Kubernetes configuration to set up. That gap between "weeks" and "one minute" is the entire argument of this post.

What I'm Actually Taking Away From I/O 2026

NeuroScale is staying up. The organizational requirements it solves are real.

But the part of it I'm most proud of — the developer experience, where someone fills in a form and infrastructure appears — is now a solved problem. Google solved it. One API call.

That frees me to focus on the part that isn't solved: the organizational layer that every company has to build for themselves, because it encodes their specific rules, their specific accountability structures, their specific compliance requirements.

Platform engineering used to be 80% "build the scaffolding so developers don't have to" and 20% "encode the organization's requirements."

After I/O 2026, I think those numbers flip.

The scaffolding is increasingly a service you call. The organization's requirements are still yours to build. And that's the more interesting work anyway — it requires understanding the business, not just the toolchain.

108 commits taught me what the invisible infrastructure actually contains. Google I/O 2026 told me where the next 108 should go.

NeuroScale is open source: github.com/sodiq-code/neuroscale-platform. Every failure, every fix, every postmortem is documented. The platform still works. The smoke tests still pass. The world just got a shortcut to where I started.

Gemma 4 Broke My Kubernetes Resource Model. Here's What I Measured.

Sodiq Jimoh — Sat, 23 May 2026 20:21:24 +0000

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I build a self-service AI inference platform called NeuroScale. Developers fill a Backstage form, the platform generates KServe manifests via PR, ArgoCD deploys them, and a production inference endpoint goes live. 108 commits, 21 smoke tests, 6 milestone postmortems.

When I wrote Gemma 4's InferenceService manifest, the resource requests looked identical to a dense model of the same size. The numbers said otherwise: 48 GB of VRAM holding a model that activates 3.8B of 25.2B parameters per token, GPU compute utilization at 40% while p95 latency doubled, and OpenCost billing two teams the same rate for workloads with 2× different per-token costs. Every broken assumption traces back to one architectural decision: Gemma 4 doesn't replace its dense FFN with experts — it runs both.

The Architecture That Breaks the Assumptions

You need to see why Gemma 4's MoE breaks things that Mixtral and DeepSeek don't.

Standard MoE (Mixtral, DeepSeek, Qwen): the dense FFN is replaced by sparse experts. Router picks a subset per token. Dense path is gone.

Gemma 4 26B keeps three pathways running in parallel:

Gemma 4 26B MoE block (actual architecture):

                 Input
                   │
                   ▼
              [Attention]
                   │
       ┌───────────┼───────────┐
       ▼           ▼           ▼
  [Dense FFN] [Shared Exp]  [Router]
  (always on)  (always on)     │
       │           │    ┌──────┼──────┐
       │           │    ▼      ▼      ▼
       │           │ [Exp 1] [Exp 2]...[Exp 128]
       │           │    │      │       │
       │           │    └──────┴───────┘
       │           │       (8 fire)
       └───────────┴───────────┘
                   │
           [Sum all three]
                   │
                   ▼
                Output

128 routed experts, 8 active per token, one shared expert, plus a dense FFN — all summed together. Total parameters: ~25.2B. Active per token: ~3.8B. Active-to-total ratio: 0.15.

For comparison: Mixtral 8x7B activates 13B of 47B (ratio: 0.28). DeepSeek-V2 activates 21B of 236B (ratio: 0.09, but across a far larger total). Gemma 4's ratio is the lowest among sub-30B MoE models because the always-on dense FFN and shared expert consume parameter budget without contributing to the "active" count in the way you'd expect. The dense FFN is a structural safety net — if the router picks wrong experts, the always-on pathways carry the signal. This is why Gemma 4 trains more stably and degrades gracefully under quantization.

For platform engineers, the consequence: the gap between "parameters in memory" and "parameters doing compute" is wider than any previous MoE at this scale. That gap breaks three things in Kubernetes.

Consequence 1: The Numbers That Broke My Cost Model

On NeuroScale, every InferenceService requires a cost-center label. Our Kyverno admission policy blocks any deployment without one, and OpenCost reads these labels to attribute GPU-hours to teams.

Here's what the numbers look like when you deploy both on equivalent hardware via vLLM:

Metric	26B MoE (A4B)	31B Dense
VRAM consumed (BF16)	48.1 GB	62 GB
Active params per token	3.8B	31B
Decode throughput (single user, H100)	177.1 tok/s	40.3 tok/s
Decode throughput (concurrency 16, H100)	—	375 tok/s
TTFT (concurrency 1, H100)	—	67.7 ms
Per-token cost (cloud API)	$0.06 / 1M tokens	$0.12 / 1M tokens
GPU allocated	1× A100	1× A100

(H100 throughput: JarvisLabs SPEED-Bench on vLLM 0.8.5; VRAM: controlled A6000 BF16 benchmark; API pricing: Google Cloud Vertex AI. Reproduce the throughput test: vllm serve google/gemma-4-27b-it --dtype bfloat16 --max-model-len 8192 then hit /v1/completions with a 512-token prompt.)

OpenCost attributes cost based on resource allocation, not utilization. Both models allocate 1× A100. Both get billed identically per GPU-hour. But the MoE delivers 4.4× higher single-user throughput and 2× lower per-token cost.

In a multi-tenant cluster where Team A runs the 26B MoE and Team B runs the 31B Dense, OpenCost bills them the same rate. Team A gets 4.4× the throughput on identical hardware. The fair billing unit for MoE isn't GPU-hours — it's GPU-hours weighted by active parameter ratio. No Kubernetes cost attribution tool I've found supports this.

The math: Gemma 4's total-to-active ratio is 6.6:1 (25.2B / 3.8B). Mixtral's is 3.6:1 (47B / 13B). The cost attribution error for Gemma 4 is almost double Mixtral's — a direct consequence of the 3-pathway design keeping more parameters loaded but inactive.

Consequence 2: GPU Utilization Lies to the Autoscaler

KServe supports HPA-based autoscaling. The standard setup:

metadata:
  annotations:
    serving.kserve.io/autoscalerClass: "hpa"
    serving.kserve.io/targetUtilizationPercentage: "70"
    serving.kserve.io/metric: "cpu"

For dense models, GPU compute utilization roughly tracks inference load. More requests → more compute → higher utilization. Scaling at 70% is a reasonable proxy.

For the MoE, this proxy breaks. The causal chain:

LLM decode is memory-bandwidth-bound, not compute-bound. Each output token reads model weights from VRAM. Google Cloud's analysis found a cluster pushing 1M tokens/second showed only 4.4% GPU FLOPS utilization — tensor cores finish in microseconds, then wait for data.

Gemma 4 amplifies this: 25.2B of weights sit in VRAM, but only 3.8B do arithmetic per token. The memory bus shuttles expert weights that may not even fire.

Result: memory bandwidth saturates — throughput degrades, latency climbs — while GPU compute utilization reads 30–40%.

HPA sees 40% against a 70% target. Decision: don't scale.

The correct autoscaling metric for MoE isn't utilization — it's request latency:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gemma4-moe-hpa
spec:
  scaleTargetRef:
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    name: gemma4-moe
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_request_p95_latency_seconds
        target:
          type: AverageValue
          averageValue: "2.0"

This requires a custom metrics pipeline (Prometheus adapter → HPA custom metrics API). Significantly more infrastructure than built-in CPU scaling. But without it, the default autoscaling path that works for every dense model silently fails for MoE.

Consequence 3: Expert Parallelism Crashes Under Data Parallelism

vLLM supports expert parallelism for MoE models — distributing individual experts across GPUs:

vllm serve google/gemma-4-26B-A4B-it \
    --tensor-parallel-size 2 \
    --enable-expert-parallel \
    --max-model-len 32768

Adding --data-parallel-size 2 to scale horizontally: weights load, CUDA graphs capture, API servers start — then it crashes on the first inference request.

Reproduce it yourself (2× GPU required):

vllm serve google/gemma-4-26B-A4B-it \
    --tensor-parallel-size 2 \
    --data-parallel-size 2 \
    --max-model-len 32768
# Send any request → crash

The exact error from vLLM issue #38999:

File "vllm/distributed/device_communicators/cuda_communicator.py"
  in _all_gather_single
AssertionError: 1 != 36

Root cause: The MoE fused expert layer assumes that multiple GPUs means expert parallelism, triggering inter-GPU all_gather operations for expert routing. Under data parallelism, each GPU should run an independent full copy — no cross-GPU expert communication. The DP workers initialize with EP-style communication, causing a tensor shape mismatch.

Dense models (31B, E4B) work fine with --data-parallel-size > 1. The crash is specific to MoE + DP.

The vLLM collaborator confirmed: you must pass --enable-expert-parallel when deploying MoE in DP mode. Without it, DP mode crashes. With it, you get expert parallelism semantics (experts distributed across GPUs) instead of data parallelism semantics (full model copies per GPU).

For a platform that lets developers self-serve model deployments, this means: you cannot expose "number of GPUs" as a user-facing knob for MoE models. The parallelism strategy determines whether the deployment runs or crashes, and the correct choice depends on the model architecture — not the model size.

The Outage That Taught Me to Distrust Defaults

I wouldn't have found any of this without getting burned first.

KServe's default config assumes Istio. We ran Kourier (~100 MB vs Istio's ~1 GB) because our k3d dev cluster didn't have the RAM. Result: all InferenceService creation blocked cluster-wide. Three hours of downtime. The fix:

ingress:
  disableIstioVirtualHost: true

One boolean, not in the getting-started docs. Found it in the controller source.

The lesson applies directly: when the YAML looks right and everything deploys green, check what the YAML doesn't say. The MoE manifest is valid. The pod starts. The cost model, autoscaler, and parallelism strategy are all silently wrong.

The CI That Lied for Two Weeks

For two weeks, Kyverno policy checks showed green on every PR. Policies weren't enforced at all.

The bug: kyverno-cli apply exits code 0 even when policies are violated. Violations print to stdout; the exit code — what CI checks — says "success." The fix:

OUTPUT=$(kyverno-cli apply ./policies/ --resource "$manifest" 2>&1)
if echo "$OUTPUT" | grep -qi "fail\|violation\|denied"; then
    echo "Policy violation detected"
    exit 1
fi

Any developer could have deployed an InferenceService without resource limits, cost labels, or ownership metadata. CI was green. Governance was broken.

Same pattern as the MoE manifest: the configuration is valid, the deployment succeeds, and the assumptions underneath are wrong.

What Platform Engineers Should Do

After measuring all of this, here's the playbook:

1. Tag InferenceServices with architecture metadata.

labels:
  model-architecture: "moe"
  active-param-ratio: "0.15"    # 3.8B / 25.2B
  total-params: "25.2B"
  cost-center: "cc-ml-inference"

Your cost attribution, alerting, and capacity planning all need to know that this 48 GB model computes like a 4B model. Gemma 4's 3-pathway design (dense FFN + shared expert + routed experts) makes the total-to-active ratio higher than other MoE architectures — the metadata must capture this.

2. Autoscale on latency, not utilization.

GPU compute utilization is a lagging, misleading indicator for MoE. Memory bandwidth saturation hits first. Use vllm_request_p95_latency_seconds or vllm_tokens_per_second as your HPA metric.

3. Don't expose parallelism strategy as a user knob.

Your platform should detect MoE vs Dense from the model config and set --enable-expert-parallel accordingly. Self-serve GPU count selection will produce runtime crashes for MoE models if the parallelism strategy is wrong — the model loads, CUDA graphs capture, then it crashes on the first request.

4. Budget VRAM for total params, compute for active params.

Resource requests = total parameters (scheduling). Capacity planning = active parameters (throughput). One A100 running Gemma 4 MoE: 177 tok/s. Same A100 running 31B Dense: 40 tok/s. Same requests, 4.4× throughput difference. Your capacity model must account for this or you'll over-provision MoE by 4×.

The Deeper Pattern: Architecture-Aware Scheduling

These three failures (cost, scaling, parallelism) point to a structural gap in Kubernetes: the scheduler is architecture-blind.

resources.requests.nvidia.com/gpu: 1 tells the scheduler to find a node with a free GPU. It says nothing about whether that GPU will be memory-bandwidth-bound or compute-bound, whether the model activates 15% or 100% of its parameters, or whether horizontal scaling requires expert parallelism flags. The scheduler treats a 26B MoE and a 31B Dense as identical workloads because they request the same resource.

This was fine when every model was dense. With Gemma 4's 3-pathway MoE entering production, the abstraction leaks. The fix isn't just labels and custom metrics — it's treating model architecture as a first-class scheduling dimension, the same way we treat GPU type, memory, and topology today.

Every claim in this post — cost attribution errors, autoscaler blindness, DP crashes — is reproducible. The vLLM issue is public. The benchmarks are from published controlled tests (JarvisLabs, Google Cloud). The KServe outage and Kyverno bug happened on a real platform with 108 commits of history.

The model works. The YAML looks right. The CI is green.

The governance is broken. And now I can prove it.

NeuroScale is open source: github.com/sodiq-code/neuroscale-platform. The BEFORE.md and AFTER.md in each milestone directory tell the full story of what went wrong and what I learned.

I Revived a Broken MLOps Platform — Now It's Self-Service, Policy-Guarded, and Operationally Credible

Sodiq Jimoh — Sat, 23 May 2026 00:42:39 +0000

Submitted for the GitHub Copilot Challenge — deadline June 7, 2026. Built with GitHub Copilot as an active architectural and debugging partner.

On April 4th, I abandoned this platform. Backstage crashed. ArgoCD was broken. KServe couldn't serve a single model. I walked away and left it for dead.

48 days later, I came back and rebuilt it into a self-service AI inference platform with GitOps, policy enforcement, and deterministic recovery.

21 checks. 0 failures. Reproducible on any machine.

Repo: github.com/sodiq-code/neuroscale-platform

Watch It Run: 21 Checks, 0 Failures

This is not a claim. The video below shows every check running live against a real k3d cluster.

▶ Watch the full smoke test demo — 21 checks, 0 failures

━━━ Milestone A — GitOps Spine (ArgoCD) ━━━
  [✓ PASS] All ArgoCD pods are Running
  [✓ PASS] ArgoCD Applications: 7 Healthy, 0 Progressing, 7 total
  [✓ PASS] ArgoCD sync visibility: no Unknown states (7/7 Synced)
  [✓ PASS] Drift self-heal: nginx-test recreated and Ready in ~20s

━━━ Milestone B — AI Serving Baseline (KServe) ━━━
  [✓ PASS] KServe controller-manager: 1 replica available
  [✓ PASS] InferenceServices: 2/2 Ready=True
  [✓ PASS] Inference request: demo-iris-2 returned predictions
           ↳ Response: {"predictions":[1,1]}

━━━ Milestone C — Golden Path (Backstage) ━━━
  [✓ PASS] Backstage deployment: 1 replica available
  [✓ PASS] demo-iris-2 InferenceService exists (scaffolder output)
  [✓ PASS] demo-iris-2 ArgoCD Application exists (ApplicationSet output)

━━━ Milestone D — Guardrails (Kyverno + CI) ━━━
  [✓ PASS] Kyverno pods running: 3
  [✓ PASS] Kyverno ClusterPolicies installed: 5 policies
  [✓ PASS] Admission block: non-compliant InferenceService correctly denied

━━━ Milestone F — Production Hardening ━━━
  [✓ PASS] ApplicationSet neuroscale-model-endpoints exists
  [✓ PASS] ArgoCD has 7 Applications (ApplicationSet + static)
  [✓ PASS] ResourceQuota exists in namespace default
  [✓ PASS] LimitRange exists in namespace default
  [✓ PASS] Non-root admission block: root-container Deployment denied
  [✓ PASS] OpenCost deployment healthy: 1 replica available

  PASS 21 / FAIL 0 / SKIP 1
  ✓ All checks passed. Platform is healthy and ready to demo.

The single SKIP is the drift self-heal pre-condition check — normal after a previous test run. The drift self-heal itself passed, visible in the output above and in the video.

Reproducible on any machine:

bash scripts/bootstrap.sh     # ~5 minutes — requires Docker + k3d
bash scripts/smoke-test.sh    # 21 checks, all green

The Problem: A Platform That Was Abandoned and Dangerous

NeuroScale started in February 2026 as an AI inference platform on Kubernetes. By early April it was abandoned — Backstage crashing, ArgoCD broken, KServe unable to serve a single model. The last commit before this challenge was April 4th. Then 48 days of silence.

Here's what I found when I came back:

Backstage: CrashLoopBackOff — 14 restarts. A Helm values nesting bug caused probe timings to be silently ignored.
ArgoCD repo-server: CrashLoopBackOff — every application showed Unknown, meaning ArgoCD couldn't even evaluate their state.
KServe: READY=False — default config assumed Istio for ingress, but the cluster ran Kourier. Error: "virtual service not found".
Policy enforcement: None. Root containers, no resource limits, :latest tags — deployed freely.
Drift detection: None. Manual kubectl changes accumulated silently.

The deployment process was vim → kubectl apply → hope. Developers feared deploying models. The platform was technically worse than not having one.

What I Built: Five Enforcement Layers

Layer 1: Self-Service Golden Path

A developer fills in a Backstage form. The platform does everything else.

Backstage form → PR created → CI validates → Merge → ArgoCD syncs
  → ApplicationSet discovers → KServe endpoint live → Predictions working

No kubectl. No YAML editing. No tribal knowledge. The template.yaml generates a compliant InferenceService manifest, opens a PR, and the neuroscale-model-endpoints ApplicationSet auto-discovers it.

Five fields. No kubectl. No YAML. DNS pattern enforced client-side, cost center required. Click Next and Backstage does the rest.

Two steps, 9 seconds total. PR opened, ApplicationSet picks it up on next ArgoCD sync.

Layer 2: GitOps Drift Control

Git is the source of truth. Drift is auto-corrected.

$ kubectl delete deploy nginx-test -n default
# 20 seconds later...
$ kubectl get deploy nginx-test -n default
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
nginx-test   1/1     1            1           8s   # Auto-recreated by ArgoCD

selfHeal: true and prune: true. Manual cluster changes cannot persist.

Layer 3: Policy Guardrails — Shift-Left + Shift-Down

At PR time (CI): kubeconform validates schemas. kyverno-cli simulates all 5 policies against rendered manifests with a dual exit-code + stdout check to guard against false-greens. Full pipeline at .github/workflows/guardrails-checks.yaml.

At admission time (cluster): Kyverno blocks non-compliant resources before they reach the cluster.

Five enforced policies:

Policy	What It Blocks
`require-standard-labels-inferenceservice`	Missing `owner` + `cost-center` labels
`require-standard-labels-deployment`	Missing ownership labels on Deployments
`require-resource-requests-limits`	No CPU/memory requests or limits
`disallow-latest-image-tag`	Floating `:latest` image tags
`disallow-root-containers`	Containers without `runAsNonRoot: true`

Layer 4: Cost Attribution

Every workload carries owner and cost-center labels enforced by Kyverno — you can't deploy without them. OpenCost reads these via Prometheus for per-team cost breakdowns. The CI pipeline also comments on PRs with CPU/memory deltas and flags workloads exceeding thresholds.

Layer 5: Operational Recovery

Documented runbooks for every failure mode encountered. 3-command, 2-minute recovery procedures. Full runbook at docs/runbook.md.

Copilot Partnership: Three Moments That Mattered

Copilot didn't write this platform. It functioned as a senior infrastructure advisor at three exact moments where I could have stayed stuck for days.

Moment 1: The Architectural Decision — Kourier vs Istio

Problem: KServe stuck at READY=False. Error: "virtual service not found" — an Istio concept on a cluster running Kourier.

Copilot searched the actual repo files, confirmed the non-Istio setup was already correct, and identified the root cause: stale cached config. Critical tradeoff it surfaced — Istio adds ~1GB memory overhead; Kourier is under 200MB. On a shared 8GB dev node, Istio would have killed reproducibility.

Fix: Reapply the serving-stack overlay, verify disableIstioVirtualHost=true in ConfigMaps, restart control plane pods. Result: working inference, 800MB freed.

Moment 2: The Silent Bug — CI Guardrails That Can't False-Green

Problem: kyverno-cli apply looked green in CI. Then I tested with a deliberately non-compliant manifest. It still passed. The guardrail was checking nothing.

Two undocumented kyverno-cli behaviors Copilot surfaced:

A single --resource flag with multiple paths silently ignores every path after the first.
Exit code is 0 even when violations are printed to stdout.

The fix (live in .github/workflows/guardrails-checks.yaml):

# Per-resource fan-out — one file per --resource flag
mapfile -t app_files < <(find apps -type f \( -name '*.yaml' -o -name '*.yml' \) | sort)
failed=0

for resource in "${app_files[@]}"; do
  log="$(mktemp)"
  if ! docker run --rm -v "$PWD:/work" -w /work ghcr.io/kyverno/kyverno-cli:v1.12.5 \
      apply infrastructure/kyverno/policies/*.yaml --resource "$resource" 2>&1 | tee "$log"; then
    failed=1
  fi
  # Dual check: exit code AND stdout — never trust one signal alone
  if grep -qiE 'denied|violat|fail|error' "$log"; then
    failed=1
  fi
done

exit "$failed"

A guardrail that silently passes is worse than no guardrail. Every team using kyverno-cli in CI without this pattern has a potential false-green.

Moment 3: The Recurring Crash — CrashLoopBackOff and the Runbook

Problem: Backstage in CrashLoopBackOff — 7 restarts. Pod failing health checks it never properly configured.

I pasted the raw kubectl get pods -n backstage -w output directly into Copilot:

Root cause: Backstage is a dependency chart — app settings must be nested under backstage.backstage.*. Keys at the wrong level meant Helm silently used defaults, including failureThreshold: 3 with aggressive timings and no startupProbe. The container kept failing before the plugin system finished initializing.

ArgoCD hit the same pattern: Kyverno installation disrupted the repo-server's gRPC channel, which doesn't auto-reconnect — causing all apps to show Unknown. Copilot identified both as the same root pattern and helped write a deterministic recovery:

kubectl -n argocd rollout restart deploy/argocd-repo-server
kubectl -n argocd rollout status deploy/argocd-repo-server --timeout=120s
kubectl -n argocd get applications  # All Synced/Healthy

That became docs/runbook.md. The platform doesn't just work — it's recoverable. That's the difference between a demo and a real platform.

Before vs After

For Developers

Before	After
Edit YAML by hand, `kubectl apply`, hope	Fill a Backstage form, review a PR, merge
No policy feedback until deployment fails	CI blocks non-compliant manifests before merge
No cost visibility	PR comment shows CPU/memory delta
Tribal knowledge required	Anyone can deploy

For Operators

Before	After
Manual cluster inspection for drift	ArgoCD self-heals in ~20 seconds
No runbooks — "ask the person who built it"	Documented recovery for every failure mode
No smoke tests	21-check automated verification, any machine
No namespace governance	ResourceQuota + LimitRange enforced

For the Platform

Before	After
Abandoned since April 4th	Finished, documented, reproducible
Collection of broken parts	6 milestones, 21 verified checks, 0 failures
Manual and error-prone	Automated and policy-enforced end-to-end

What Made This Real: The Failures

This was not built on the happy path. Every milestone hit real failures:

Milestone	Key Failure	What It Taught Me
A — GitOps Spine	ArgoCD `Unknown` ≠ `Error` — comparison engine couldn't run	Don't confuse UI status with root cause
B — KServe Serving	Istio/Kourier mismatch — undocumented KServe default	Always verify infrastructure defaults on constrained clusters
C — Golden Path	Backstage CrashLoopBackOff from Helm mis-nesting — probes silently ignored	CI must validate rendered manifests, not just source YAML
D — Guardrails	Kyverno webhook disrupts all ArgoCD apps during install window	Admission controllers need deployment ordering
E — Cost & CI	`kyverno-cli` false-green: exit 0 with actual violations	Dual-check exit code AND stdout — never trust one signal
F — Hardening	ApplicationSet replaced per-app files — requires skeleton alignment	Scaffolder templates must match GitOps discovery patterns

Full failure log and recovery steps in docs/runbook.md.

Try It Yourself

# Clone and bootstrap (requires Docker + k3d + kubectl + helm)
git clone https://github.com/sodiq-code/neuroscale-platform.git
cd neuroscale-platform
bash scripts/bootstrap.sh     # ~5 minutes

# Verify everything works
bash scripts/smoke-test.sh    # 21 checks, 0 failures

# Open all UIs
bash scripts/port-forward-all.sh

After port-forward-all.sh:

Backstage at http://localhost:7010 — developer portal
ArgoCD at http://localhost:8080 — 7 synced applications
OpenCost at http://localhost:9090 — per-workload cost attribution

5 minutes from git clone to a fully working platform. The smoke test proves it all.

The Bottom Line

Copilot helped at the exact points where strong engineering judgment mattered most: an architectural tradeoff that saved 800MB of memory, a silent CI bug that every kyverno-cli user faces, and operational recovery that turns a 2-hour outage into a 2-minute runbook.

21 checks. 0 failures. Reproducible on any machine.

What's one abandoned project you wish you had finished? Drop it in the comments.

Your Kyverno CI Is Lying to You: Why kyverno-cli Exits 0 on Policy Violations

Sodiq Jimoh — Tue, 14 Apr 2026 06:08:38 +0000

This article is about a CI system that looked healthy for two weeks while
enforcing absolutely nothing.

The symptom was simple: a PR with a deliberately non-compliant Kubernetes
manifest was passing CI. The Kyverno policy check step showed green.
The manifest was missing a required label. Kyverno should have blocked it.
It did not.

This is the story of why, and the exact fix.

This is part of a series on building a production-hardened AI inference
platform on Kubernetes:

Project repo:
github.com/sodiq-code/neuroscale-platform

The context: what I was enforcing

The NeuroScale platform requires every InferenceService and Deployment
in the default namespace to carry two labels:

owner — which team owns the workload
cost-center — which budget the resource consumption is charged to

These labels feed directly into OpenCost for cost attribution. Without them,
workloads appear as uncategorised spend. With Kyverno admission policies
enforcing them at the cluster level, every resource is guaranteed to carry
cost attribution metadata.

The Kyverno ClusterPolicy looks like this:

# infrastructure/kyverno/policies/
#   require-standard-labels-inferenceservice.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-standard-labels-inferenceservice
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-owner-and-cost-center-on-isvc
      match:
        any:
          - resources:
              kinds:
                - InferenceService
      validate:
        message: >
          InferenceService resources must set
          metadata.labels.owner and metadata.labels.cost-center.
        pattern:
          metadata:
            labels:
              owner: "?*"
              cost-center: "?*"

Admission enforcement works. Apply an InferenceService without those labels
and Kyverno blocks it at the API server:

$ kubectl apply -f bad-model.yaml
Error from server: admission webhook
  "clusterpolice.kyverno.svc" denied the request:
  resource InferenceService/default/bad-model was blocked
  due to the following policies
  require-standard-labels-inferenceservice:
    check-owner-and-cost-center-on-isvc:
      'validation error: InferenceService resources must set
      metadata.labels.owner and metadata.labels.cost-center.'

That part worked correctly. The CI part did not.

The false-green: what it looked like

The CI workflow ran kyverno-cli against every changed manifest in apps/
on every pull request. The intent was to catch non-compliant manifests before
merge — shift-left enforcement before the manifest ever reached the cluster.

The original CI command:

docker run --rm -v "$PWD:/work" -w /work \
  ghcr.io/kyverno/kyverno-cli:v1.12.5 \
  apply infrastructure/kyverno/policies/*.yaml \
  --resource "${app_files[@]}"

A test PR was created with a manifest that deliberately lacked cost-center:

# apps/test-bad-model/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: test-bad-model
  namespace: default
  labels:
    owner: platform-team
    # cost-center intentionally missing

Expected result: CI fails.
Actual result: CI passed.

$ git push origin feature/test-bad-policy
# ... CI runs ...
# Result: validate-policies-against-app-manifests: ✅ PASSED

The violation was printed to stdout. The job still showed green.

Root Cause Part 1: kyverno-cli apply exits 0 on violations

This is the core issue. kyverno-cli apply in the version used (v1.12.x)
exits with code 0 when it finds policy violations. It prints them to
stdout, but returns a success exit code.

You can verify this directly:

docker run --rm -v "$PWD:/work" -w /work \
  ghcr.io/kyverno/kyverno-cli:v1.12.5 \
  apply infrastructure/kyverno/policies/*.yaml \
  --resource apps/test-bad-model/inference-service.yaml

# Output:
# PASS: 0, FAIL: 1, WARN: 0, ERROR: 0, SKIP: 0
#
# policy require-standard-labels-inferenceservice ->
#   resource default/InferenceService/test-bad-model
#   FAIL: check-owner-and-cost-center-on-isvc
#
echo $?
0   # <-- exits 0 despite the FAIL

The violation is visible in stdout. The exit code is 0. Any CI step that
only checks the exit code will report success.

Root Cause Part 2: $? captures tee, not kyverno

The CI command piped output through tee to capture it for logging:

docker run ... kyverno-cli apply ... \
  2>&1 | tee /tmp/kyverno-output.txt

Even if kyverno-cli had exited non-zero, $? in bash captures the exit
code of the last command in the pipe — which is tee. tee always
exits 0 if it can write to the file.

This means two separate problems were stacked:

kyverno-cli apply exits 0 on violations (kyverno behavior)
$? captures tee exit code, not kyverno exit code (bash pipe behavior)

Either problem alone would have caused the false-green.
Together they made the enforcement completely invisible.

The fix: dual check with $PIPESTATUS[0]

${PIPESTATUS[0]} captures the exit code of the first command in a
pipe, regardless of what the subsequent commands return. Combined with
stdout parsing for violation markers, this creates a reliable enforcement
check.

set +e
docker run --rm -v "$PWD:/work" -w /work \
  ghcr.io/kyverno/kyverno-cli:v1.12.5 \
  apply infrastructure/kyverno/policies/*.yaml \
  --resource "${app_files[@]}" \
  2>&1 | tee /tmp/kyverno-output.txt
kyverno_exit="${PIPESTATUS[0]}"
set -e

if [ "${kyverno_exit}" -ne 0 ] \
    || grep -qE "^FAIL" /tmp/kyverno-output.txt \
    || grep -qE "fail: [1-9][0-9]*" /tmp/kyverno-output.txt; then
  echo "Kyverno policy violations detected. Failing CI."
  exit 1
fi

Why two checks instead of one:

The ${PIPESTATUS[0]} check handles cases where kyverno-cli itself
exits non-zero — which may happen in future versions or on error conditions.
The stdout grep checks handle the current v1.12.x behavior where violations
print FAIL to stdout but exit 0. Together they cover both current behavior
and future behavior changes.

Why set +e before the command:

Without set +e, a non-zero exit from kyverno would immediately abort the
script before ${PIPESTATUS[0]} could be captured. set +e disables
errexit temporarily so we can capture and evaluate the exit code explicitly.

Verification: proving the fix works

After the fix, the same test PR with the missing cost-center label now
fails CI correctly:

$ git push origin feature/test-non-compliant

# CI output:
validate-policies-against-app-manifests
  Running kyverno policy check...

  PASS: 0, FAIL: 1, WARN: 0, ERROR: 0, SKIP: 0
  policy require-standard-labels-inferenceservice ->
    resource default/InferenceService/test-bad-model
    FAIL: check-owner-and-cost-center-on-isvc

  Kyverno policy violations detected. Failing CI.

# Result: validate-policies-against-app-manifests: ❌ FAILED

A compliant manifest with both labels passes:

$ kubectl apply -f apps/demo-iris-2/inference-service.yaml
# CI result: validate-policies-against-app-manifests: ✅ PASSED

The complete GitHub Actions workflow step

Here is the full implementation used in the NeuroScale platform:

# .github/workflows/guardrails-checks.yaml
- name: Validate policies against app manifests
  run: |
    app_files=()
    while IFS= read -r -d '' f; do
      app_files+=("--resource" "$f")
    done < <(find apps/ -name "*.yaml" -print0)

    if [ ${#app_files[@]} -eq 0 ]; then
      echo "No app manifests found. Skipping policy check."
      exit 0
    fi

    set +e
    docker run --rm -v "$PWD:/work" -w /work \
      ghcr.io/kyverno/kyverno-cli:v1.12.5 \
      apply infrastructure/kyverno/policies/*.yaml \
      "${app_files[@]}" \
      2>&1 | tee /tmp/kyverno-output.txt
    kyverno_exit="${PIPESTATUS[0]}"
    set -e

    echo "--- Kyverno output ---"
    cat /tmp/kyverno-output.txt
    echo "--- Exit code: ${kyverno_exit} ---"

    if [ "${kyverno_exit}" -ne 0 ] \
        || grep -qE "^FAIL" /tmp/kyverno-output.txt \
        || grep -qE "fail: [1-9][0-9]*" /tmp/kyverno-output.txt; then
      echo "Kyverno policy violations detected. Failing CI."
      exit 1
    fi

    echo "All manifests passed policy checks."

Why this matters beyond a single platform

The kyverno-cli apply exit code behavior is not a bug — it is documented
behavior for the apply subcommand. But it is not prominently surfaced in
the getting-started documentation, and most CI examples in the wild use
the exit code check alone.

If your team is using Kyverno for compliance or security enforcement in CI,
and your CI step looks like this:

kyverno apply policies/ --resource manifests/
if [ $? -ne 0 ]; then
  echo "Policy violation detected"
  exit 1
fi

Your enforcement is silently not enforcing. The admission webhook at the
cluster level is still blocking violations — but your shift-left CI gate
is not. Developers will only discover policy violations after merging and
watching ArgoCD fail, not before.

The distinction between "guardrails exist" and "guardrails enforce" is
exactly what separates platform engineering from platform theater.

The two-layer enforcement model

The fix is not just about the CI step. The NeuroScale platform uses two
enforcement layers that work together:

Layer 1 — PR time (shift-left): kyverno-cli in CI catches violations
before merge. Developers get fast feedback without needing a running cluster.

Layer 2 — Admission time (shift-down): Kyverno admission webhook blocks
non-compliant resources at the Kubernetes API server. Even if CI is bypassed
or misconfigured, nothing non-compliant reaches the cluster.

PR opened
    ↓
CI: kyverno-cli apply + $PIPESTATUS[0] check
    ↓ (blocks here if non-compliant)
PR merged
    ↓
ArgoCD sync
    ↓
Kyverno admission webhook
    ↓ (blocks here as second layer)
Resource created in cluster

Layer 1 gives fast developer feedback. Layer 2 is the safety net.
Both are required. Neither alone is sufficient.

Debugging Commands Reference

# Test a policy manually against a manifest
docker run --rm -v "$PWD:/work" -w /work \
  ghcr.io/kyverno/kyverno-cli:v1.12.5 \
  apply infrastructure/kyverno/policies/require-standard-labels-inferenceservice.yaml \
  --resource apps/demo-iris-2/inference-service.yaml

# Check Kyverno admission webhook registrations
kubectl get validatingwebhookconfigurations | grep kyverno
kubectl get mutatingwebhookconfigurations | grep kyverno

# Verify Kyverno pods are healthy
kubectl -n kyverno get pods
kubectl -n kyverno get endpoints kyverno-svc

# List all installed ClusterPolicies and their enforcement mode
kubectl get clusterpolicies -o wide

# Review Kyverno admission decisions in controller logs
kubectl -n kyverno logs deploy/kyverno --tail=50 | \
  grep -i "admit\|deny\|block"

What I Would Add Next

A kyverno test subcommand integration in CI for cases that require explicit pass/fail test fixtures rather than live policy simulation
Background scan results surfaced as PR comments using the Kyverno policy report API
Separate policy validation for Deployment and InferenceService resources to give more specific failure messages per resource type

Deploying Backstage on Kubernetes with the Helm Chart: The Infrastructure-First Guide

Sodiq Jimoh — Mon, 06 Apr 2026 02:02:26 +0000

Who this is for: Engineers deploying Backstage on Kubernetes via the
official Helm chart who want a working portal, not just a running pod.
This guide starts where most tutorials end — after helm install succeeds
but before anything actually works.

A few weeks ago I published an article called
"Nine Ways Backstage Breaks Before Your Developer Portal Works".
A Backstage maintainer read it and gave me structured feedback. The core of
it was this: several of the failures I documented were caused by not
following the official getting-started documentation before using the Helm
chart, and by using the demo image as if it were a production-ready base.

They were right. This article is the follow-up they suggested — and the one
I should have written first.

It does not repeat the previous article. It starts earlier, goes deeper on
Helm-specific configuration, and correctly attributes failures to their
actual causes rather than blaming Backstage for things that are ArgoCD,
Traefik, or operator error.

Official resources you should read alongside this guide:

Project repo referenced throughout:
github.com/sodiq-code/neuroscale-platform

The one thing you must understand before installing the Helm chart

The Backstage Helm chart uses a demo image by default. The chart README
contains this explicit warning:

The Backstage chart is not an official Backstage project and is not
supported by the Backstage core team. The default image used in this chart
is for demo purposes only.

This single fact explains most of the configuration friction you will
encounter. The demo image does not behave like a real Backstage application
built with backstage new app. It has different startup characteristics,
different configuration defaults, and different failure modes.

What this means practically:

If you are building a real developer portal — not just running a demo — you
should follow the official getting started guide
to create your own Backstage application first, build a custom Docker image
from it, and then use the Helm chart to deploy that image. The chart's
image.repository and image.tag values are where you point to your
own image.

If you are experimenting, learning, or building an integration platform
where Backstage is one component (as in the NeuroScale project), the demo
image path is workable — but you need to understand its limitations and
configure it correctly.

This guide covers the Helm chart path specifically, with the official docs
as the reference point throughout.

The values hierarchy that breaks everything silently

This is the most important configuration concept in the entire Helm chart.
Get this wrong and every override you write will be silently ignored.

The Backstage Helm chart is a wrapper chart — Backstage itself is a
dependency inside it. The dependency is named backstage. This means
configuration for the Backstage application container must be nested under
backstage.backstage.*, not backstage.*.

Wrong — values are silently ignored:

# This looks correct but is placed at the wrong hierarchy level
backstage:
  appConfig:
    app:
      title: My Platform
  startupProbe:
    initialDelaySeconds: 120
  resources:
    requests:
      cpu: 100m

Correct — values reach the Backstage container:

backstage:
  backstage:           # <-- this second level is required
    appConfig:
      app:
        title: My Platform
    startupProbe:
      initialDelaySeconds: 120
    resources:
      requests:
        cpu: 100m

The Helm chart processes the outer backstage key as the dependency name.
Values placed directly under backstage.* are interpreted as chart-level
configuration, not as container configuration. Kubernetes then uses chart
defaults — including probe timings — rather than your overrides.

How to verify your values are actually applied:

Render the Helm chart before applying it and inspect the output Deployment
spec directly:

helm template neuroscale-backstage backstage/backstage \
  -f infrastructure/backstage/values.yaml \
  --namespace backstage \
  | grep -A 30 "startupProbe"

If you see initialDelaySeconds: 120 in the output, your probe override
reached the container. If you see initialDelaySeconds: 5 or a very small
number, your values are at the wrong nesting level.

This verification step should be part of your CI pipeline. In the NeuroScale
platform, scripts/ci/render_backstage.sh runs this check on every PR:

#!/bin/bash
# scripts/ci/render_backstage.sh
helm template neuroscale-backstage backstage/backstage \
  -f infrastructure/backstage/values.yaml \
  --namespace backstage \
  | grep "initialDelaySeconds" \
  | grep -q "120" || {
    echo "ERROR: startupProbe initialDelaySeconds not set correctly"
    exit 1
  }
echo "Helm values nesting verified"

Required configuration keys for the demo image

The demo image requires specific configuration keys to be present at
startup. Missing any of them causes the frontend to crash on load with a
JavaScript error that is only visible in browser developer tools — the page
itself shows a blank white screen with no visible error.

The minimum required appConfig block:

backstage:
  backstage:
    appConfig:
      app:
        title: Your Platform Name    # required — crash if absent
        baseUrl: http://localhost:7010
      backend:
        baseUrl: http://localhost:7010
        cors:
          origin: http://localhost:7010
        database:
          client: better-sqlite3
          connection: ':memory:'

Why baseUrl matters:

The app.baseUrl and backend.baseUrl values must match the URL you are
actually using to access Backstage. If you port-forward on port 7010 but
the config says port 7007, the frontend React app loads but all API calls
fail — the UI appears to work while the backend connection is broken.

Why better-sqlite3 for local deployments:

The demo image ships with SQLite support. For local Kubernetes deployments
where you want zero external dependencies, the in-memory SQLite connection
is sufficient. For production, replace this with a PostgreSQL connection
pointing at a managed database service. The chart includes optional
PostgreSQL deployment — see
the chart's database configuration docs.

Probe timings: the demo image starts slowly

Backstage is a Node.js application. The demo image takes approximately 60
to 90 seconds to complete startup on a typical Kubernetes node. Kubernetes
default probe timings assume a 2-second initial delay. The result is
predictable: the startup probe fires before the application is ready, the
pod fails the probe, Kubernetes kills it, and the pod enters
CrashLoopBackOff.

This is not a Backstage bug. It is a configuration requirement that the
Helm chart does not prominently surface. The correct probe settings:

backstage:
  backstage:
    startupProbe:
      httpGet:
        path: /healthcheck
        port: 7007
      initialDelaySeconds: 120    # give Node.js time to start
      periodSeconds: 10
      failureThreshold: 30        # 30 × 10s = 5 minutes maximum wait
    readinessProbe:
      httpGet:
        path: /healthcheck
        port: 7007
      initialDelaySeconds: 120
      periodSeconds: 10
      failureThreshold: 3
    livenessProbe:
      httpGet:
        path: /healthcheck
        port: 7007
      initialDelaySeconds: 300    # only check liveness after 5 minutes
      periodSeconds: 30
      failureThreshold: 3

How to diagnose probe failures:

# Watch pod status in real time
kubectl get pods -n backstage -w

# When you see CrashLoopBackOff, describe the pod
kubectl describe pod -n backstage <pod-name>

# Look for this in Events:
# Warning  Unhealthy  kubelet  Startup probe failed: connection refused

# Check logs from the previous container instance
kubectl logs -n backstage <pod-name> --previous --tail=100

If you see Startup probe failed: connection refused in events but the
previous container logs show normal Node.js startup messages, the
application is starting correctly — the probe is just firing too early.
Increase initialDelaySeconds.

A full incident postmortem for this specific failure, including the exact
Kubernetes events and the Helm values diff before and after the fix, is in
infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md.

Authentication: local dev vs production

The Backstage new backend architecture (introduced in version 1.x) includes
an internal authentication policy that requires all service-to-service calls
to include a valid Backstage token. This affects how the scaffolder frontend
talks to the scaffolder backend — a call that was unauthenticated in older
versions.

For local development only, the quickest fix is to use the guest auth
provider:

backstage:
  backstage:
    appConfig:
      auth:
        providers:
          guest:
            dangerouslyAllowOutsideDevelopment: true

This keeps the auth subsystem active and provides a real
user:default/guest identity to all plugins — which is safer than
disabling auth entirely with dangerouslyDisableDefaultAuthPolicy: true.
Plugins that assume a user context will behave correctly.

For production, use the GitHub OAuth provider:

# infrastructure/backstage/values-prod.yaml
backstage:
  backstage:
    appConfig:
      auth:
        environment: production
        providers:
          github:
            production:
              clientId: ${GITHUB_CLIENT_ID}
              clientSecret: ${GITHUB_CLIENT_SECRET}

Store GITHUB_CLIENT_ID and GITHUB_CLIENT_SECRET as Kubernetes secrets,
not in values.yaml. The Helm chart's extraEnvVarsSecrets field handles
this:

backstage:
  backstage:
    extraEnvVarsSecrets:
      - backstage-secrets

Then create the secret:

kubectl create secret generic backstage-secrets \
  -n backstage \
  --from-literal=GITHUB_CLIENT_ID="your-client-id" \
  --from-literal=GITHUB_CLIENT_SECRET="your-client-secret"

How to verify auth is configured correctly:

# Check the scaffolder actions API directly
curl http://localhost:7010/api/scaffolder/v2/actions

# If you get 401: auth is not configured for your environment
# If you get 200 with a JSON list of actions: auth is working

If you get a 401 with {"error":{"name":"AuthenticationError","message":"Missing credentials"}},
the scaffolder form will load but render blank — the page returns HTTP 200
but has no data to display. This is only visible in browser developer tools.

Catalog configuration: registering templates

The Backstage catalog applies security rules to what entity kinds are
accepted from each registered location. The default allow list for
repository-based locations does not include Template.

This is documented in
the catalog rules documentation
and the
adding templates documentation.

The registration pattern that works:

backstage:
  backstage:
    appConfig:
      catalog:
        locations:
          - type: url
            target: https://github.com/your-org/your-repo/blob/main/backstage/templates/your-template/template.yaml
            rules:
              - allow: [Template]

Without the rules: - allow: [Template] block, the entity is silently
rejected at ingestion time. The only signal is a warning in the Backstage
server logs — nothing appears in the UI.

How to diagnose catalog ingestion failures:

kubectl logs -n backstage deploy/backstage --tail=100 \
  | grep -i "warn\|error\|forbidden\|NotAllowedError"

Look for NotAllowedError: Forbidden: entity of kind Template is not allowed from that location. If you see this, your rules block is missing
or at the wrong YAML nesting level.

After updating the config, restart Backstage to re-ingest:

kubectl rollout restart deploy/backstage -n backstage
kubectl rollout status deploy/backstage -n backstage --timeout=300s

The template should appear in /create within 60 seconds of the pod
becoming ready.

You can validate your app-config.yaml structure using the Backstage CLI:

npx @backstage/cli config:check --config app-config.yaml

GitHub integration: the token secret

The scaffolder requires a GitHub token to open pull requests. The token
must be present as an environment variable in the running Backstage pod.

backstage:
  backstage:
    appConfig:
      integrations:
        github:
          - host: github.com
            token: ${GITHUB_TOKEN}

Store the token as a Kubernetes secret:

# Create or update the secret
kubectl create secret generic backstage-github-token \
  -n backstage \
  --from-literal=GITHUB_TOKEN="ghp_your_token_here" \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart to reload the environment variable
kubectl rollout restart deploy/backstage -n backstage

Critical: environment variables from Kubernetes secrets are injected at
pod start time. Updating the secret does not update the running pod. You
must restart the deployment after updating the secret for the new value
to take effect.

How to verify the token is present without exposing the value:

# Check character length — a valid GitHub token is 40+ characters
kubectl exec -n backstage deploy/backstage -- \
  sh -c 'echo "Token length: ${#GITHUB_TOKEN}"'

If this returns Token length: 0 or Token length: 16 (the length of a
placeholder like <YOUR_TOKEN_HERE>), the secret was not updated correctly
or the pod was not restarted after the update.

A working minimal values.yaml for local development

This is the minimum configuration that produces a functioning Backstage
portal on a local Kubernetes cluster with the demo image:

# infrastructure/backstage/values.yaml
backstage:
  backstage:
    image:
      registry: ghcr.io
      repository: backstage/backstage
      tag: latest           # pin to a specific version in production

    appConfig:
      app:
        title: Your Platform
        baseUrl: http://localhost:7010

      backend:
        baseUrl: http://localhost:7010
        cors:
          origin: http://localhost:7010
        database:
          client: better-sqlite3
          connection: ':memory:'

      auth:
        providers:
          guest:
            dangerouslyAllowOutsideDevelopment: true

      integrations:
        github:
          - host: github.com
            token: ${GITHUB_TOKEN}

      catalog:
        locations:
          - type: url
            target: https://github.com/your-org/your-repo/blob/main/backstage/templates/your-template/template.yaml
            rules:
              - allow: [Template]

    startupProbe:
      httpGet:
        path: /healthcheck
        port: 7007
      initialDelaySeconds: 120
      periodSeconds: 10
      failureThreshold: 30

    readinessProbe:
      httpGet:
        path: /healthcheck
        port: 7007
      initialDelaySeconds: 120
      periodSeconds: 10
      failureThreshold: 3

    livenessProbe:
      httpGet:
        path: /healthcheck
        port: 7007
      initialDelaySeconds: 300
      periodSeconds: 30
      failureThreshold: 3

    resources:
      requests:
        cpu: 100m
        memory: 512Mi

    extraEnvVarsSecrets:
      - backstage-github-token

  postgresql:
    enabled: false    # using in-memory SQLite for local dev

Deploying and verifying

Install:

helm repo add backstage https://backstage.github.io/charts
helm repo update

kubectl create namespace backstage

# Create the GitHub token secret first
kubectl create secret generic backstage-github-token \
  -n backstage \
  --from-literal=GITHUB_TOKEN="your-token"

# Install
helm install backstage backstage/backstage \
  -n backstage \
  -f infrastructure/backstage/values.yaml

Watch the startup:

kubectl get pods -n backstage -w

Expect the pod to stay in Running 0/1 for 60–120 seconds while Node.js
starts. Do not interpret this as a failure. The startup probe will not
pass until the application is ready.

Access the portal:

kubectl -n backstage port-forward svc/backstage 7010:7007
# Open: http://localhost:7010

Verify the backend is responding:

curl http://localhost:7010/healthcheck
# Expected: {"status":"ok"}

curl http://localhost:7010/api/scaffolder/v2/actions
# Expected: JSON list of available scaffolder actions

Verify catalog ingestion:

kubectl logs -n backstage deploy/backstage --tail=50 \
  | grep -i "processed\|warn\|error"

Look for Processed N entities with no NotAllowedError lines.

The production values profile

Separate your dev and prod configuration into two files. The difference is
significant enough that sharing a single file creates dangerous defaults
in production.

# infrastructure/backstage/values-prod.yaml
backstage:
  backstage:
    image:
      registry: ghcr.io
      repository: your-org/your-backstage-app   # your own image
      tag: "1.2.3"                               # pinned, never latest

    replicaCount: 2

    appConfig:
      app:
        baseUrl: https://backstage.your-domain.com
      backend:
        baseUrl: https://backstage.your-domain.com
        database:
          client: pg
          connection:
            host: ${POSTGRES_HOST}
            port: 5432
            user: ${POSTGRES_USER}
            password: ${POSTGRES_PASSWORD}
            database: backstage

      auth:
        environment: production
        providers:
          github:
            production:
              clientId: ${GITHUB_CLIENT_ID}
              clientSecret: ${GITHUB_CLIENT_SECRET}

    startupProbe:
      initialDelaySeconds: 60     # your own image starts faster
      failureThreshold: 18        # 3 minutes maximum

    resources:
      requests:
        cpu: 200m
        memory: 512Mi
      limits:
        cpu: "1"
        memory: 1Gi

Apply both files together:

helm upgrade backstage backstage/backstage \
  -n backstage \
  -f infrastructure/backstage/values.yaml \
  -f infrastructure/backstage/values-prod.yaml

The prod values file overrides only what it specifies. Everything else
comes from the base values.yaml.

Diagnostic command reference

# Pod status and events
kubectl get pods -n backstage
kubectl describe pod -n backstage <pod-name>

# Application logs
kubectl logs -n backstage deploy/backstage --tail=100
kubectl logs -n backstage deploy/backstage --previous --tail=100

# Catalog ingestion errors
kubectl logs -n backstage deploy/backstage --tail=200 \
  | grep -i "warn\|error\|forbidden"

# Verify rendered Helm values
helm template backstage backstage/backstage \
  -f infrastructure/backstage/values.yaml \
  --namespace backstage \
  | grep -A 5 "startupProbe"

# Verify token is loaded in the running container
kubectl exec -n backstage deploy/backstage -- \
  sh -c 'echo "GITHUB_TOKEN length: ${#GITHUB_TOKEN}"'

# Health check endpoints
curl http://localhost:7010/healthcheck
curl http://localhost:7010/api/catalog/entities?limit=1
curl http://localhost:7010/api/scaffolder/v2/actions

What this guide does not cover

This guide covers the Helm chart deployment path specifically. It does not
cover:

Building your own Backstage application — start with the official getting started guide and backstage new app for a real production portal
Writing custom plugins — see plugin development docs
TechDocs integration — covered separately in the TechDocs docs
Production ingress and TLS — specific to your cloud provider and ingress controller

9 Failures That Hit Me Building a Backstage Golden Path for KServe — Every Error, Every Fix

Sodiq Jimoh — Mon, 30 Mar 2026 23:12:36 +0000

Edit (Apr 2026): Updated title and added framing context based on community feedback.

Series context: This is Part 3 of building a production-hardened AI inference platform.

Part 1: Why Your KServe InferenceService Won't Become Ready

Part 2: 5 GitOps Failure Modes That Break KServe Deployments

Project repo: github.com/sodiq-code/neuroscale-platform

If you have ever deployed Backstage and stared at a blank /create page wondering what went wrong, this article is for you.

Most Backstage tutorials end at "the portal is running." This one starts there.

One important framing note before we begin: this article documents the path starting from the official Backstage Helm chart, not from backstage new app. If you're building a real Backstage application from source, some of these failures won't apply. But if you're doing what a lot of platform engineers do, reaching for helm install first, then every single one of these will.

This is a complete production failure log from implementing a Backstage Golden Path that deploys KServe model inference endpoints on Kubernetes. Nine distinct failures. Every one with exact error output, root cause, and the fix that worked.

The goal: a developer fills a Backstage form, a GitHub PR opens, the PR merges, ArgoCD deploys a KServe InferenceService, and the endpoint responds to predictions.
Getting there took nine failures across three days.

What I was trying to build

The Golden Path demo contract:

Backstage form → PR opened → merge → ArgoCD sync → InferenceService Ready=True → curl returns {"predictions":[1,1]}

Stack:

Backstage (Helm chart, self-hosted on k3d)
ArgoCD (GitOps reconciliation)
KServe (model inference endpoints)
GitHub (scaffolder target)
Kyverno (admission policies)

Failure 1: Template Not Visible in Catalog — Silent Rejection With No UI Error

Time lost: 30 minutes

Symptom

After adding the template file and registering it in infrastructure/backstage/values.yaml, the template did not appear in Backstage's /create page. No error was visible in the UI. The page simply showed an empty catalog.

Digging In

$ kubectl -n backstage logs deploy/neuroscale-backstage --tail=50
...
[backstage] warn  Failed to process location
  {"location":{"type":"url","target":"https://github.com/sodiq-code/
  neuroscale-platform/blob/main/backstage/templates/model-endpoint/
  template.yaml"},
  "error":"NotAllowedError: Forbidden: entity of kind Template
  is not allowed from that location"}

The error only appears in server logs. The UI shows nothing.

Root Cause

Backstage's catalog configuration allows only specific entity kinds from each registered location. The default allow list for repository-based locations does not include Template. Without an explicit allow: [Template] rule, entities of kind Template are silently rejected. This is security-by-default behavior — but the complete silence in the UI makes it look like a misconfiguration rather than a permission issue.

Fix

# infrastructure/backstage/values.yaml
backstage:
  backstage:
    appConfig:
      catalog:
        locations:
          - type: url
            target: https://github.com/sodiq-code/neuroscale-platform/blob/main/backstage/templates/model-endpoint/template.yaml
            rules:
              - allow: [Template]

After rolling out the updated Backstage deployment:

$ kubectl -n backstage rollout restart deploy/neuroscale-backstage
$ kubectl -n backstage rollout status deploy/neuroscale-backstage --timeout=300s
deployment "neuroscale-backstage" successfully rolled out

The template appeared in /create within 60 seconds.

Lesson

For a platform team deploying Backstage for internal users, this silent failure means developers see an empty template catalog and assume the platform is broken — not that a config rule is missing. Always check server logs, not just the UI, when Backstage catalog ingestion seems to fail.

Failure 2: Scaffolder /create Page Loads Blank — 401 on Actions API

Time lost: 45 minutes

Symptom

After the template was visible, clicking into it showed a blank form. The browser developer console revealed:

GET /api/scaffolder/v2/actions HTTP/1.1 401 Unauthorized
{"error":{"name":"AuthenticationError","message":"Missing credentials"}}

The page route returned HTTP 200 — the React app loaded — but the actions API returned 401, so the form had no data to render.

Root Cause

Backstage's new backend architecture (introduced in 1.x) adds an internal authentication policy requiring all service-to-service calls to include a valid Backstage token. The scaffolder frontend makes an internal API call to list available actions. Because no auth provider was configured for local development, this internal call was rejected. This is a breaking change from older Backstage versions where the actions endpoint was unauthenticated.

Fix

# infrastructure/backstage/values.yaml
backstage:
  backstage:
    appConfig:
      backend:
        auth:
          dangerouslyDisableDefaultAuthPolicy: true

Production note: dangerouslyDisableDefaultAuthPolicy: true is acceptable for local development only. For production, configure GitHub OAuth via values-prod.yaml with a proper sign-in policy. The production profile uses auth.providers.guest.dangerouslyAllowOutsideDevelopment: true instead — which keeps the auth subsystem active and provides a real user:default/guest identity, rather than disabling auth entirely.

Lesson

An empty scaffolder form is indistinguishable from a misconfigured form to an end user. The 401 error is only visible in browser developer tools. This is the second failure in this series that generated zero visible error in the UI.

Failure 3: Frontend Crashes With Blank White Screen — Missing Required Config Key

Time lost: 20 minutes

Symptom

After the auth policy fix, reloading Backstage showed a blank white screen. The browser console:

Uncaught Error: Missing required config value at 'app.title' in 'app'
    at validateConfigSchema (config.esm.js:234)
    at BackstageApp.render (app.esm.js:891)

Root Cause

The Backstage frontend requires app.title to be present in the runtime configuration. This key was absent from the appConfig section of values.yaml. The React application crashed on initialization before any content could render. This is a required configuration key not documented prominently as "required on first boot."

Fix

# infrastructure/backstage/values.yaml
backstage:
  backstage:
    appConfig:
      app:
        title: NeuroScale Platform
        baseUrl: http://localhost:7010
      backend:
        baseUrl: http://localhost:7010
        cors:
          origin: http://localhost:7010

Note: app.baseUrl and backend.baseUrl also needed to match the port used for port-forwarding (7010), not the default 7007.

Lesson

A blank white screen with no network errors means the JavaScript runtime crashed before rendering. Always check the browser console — not just network requests — for Backstage frontend failures.

Failure 4: Backstage CrashLoopBackOff — Helm Dependency Values Mis-Nesting

Time lost: 2 hours | Impact: Developer portal completely unavailable

Symptom

$ kubectl get pods -n backstage -w
NAME                                    READY   STATUS             RESTARTS
neuroscale-backstage-7d9f5b8c4-xqr2m   0/1     CrashLoopBackOff   8          12m

$ kubectl describe pod neuroscale-backstage-7d9f5b8c4-xqr2m -n backstage
...
Events:
  Warning  Unhealthy  30s  kubelet
    Startup probe failed: connect: connection refused

Root Cause

The Backstage Helm chart is a wrapper chart with backstage as a dependency. Configuration for the Backstage container itself must be nested under backstage.backstage.*, not backstage.*. The misconfiguration meant probe settings and resource requests were silently ignored, so Kubernetes used default probe timings — a 2-second initial delay — that were far too aggressive for Backstage's ~90-second startup time.

The pod was killed before it could become healthy, triggering CrashLoopBackOff.

Backstage requires:

startupProbe:
  initialDelaySeconds: 120
  failureThreshold: 30

The default gives it 2 seconds.

Fix

Correct the values hierarchy and harden probe timings:

# infrastructure/backstage/values.yaml
backstage:
  backstage:           # <-- must be nested here, not at backstage.*
    appConfig:
      ...
    startupProbe:
      initialDelaySeconds: 120
      failureThreshold: 30
    readinessProbe:
      initialDelaySeconds: 120
    livenessProbe:
      initialDelaySeconds: 300
    resources:
      requests:
        cpu: 100m
        memory: 512Mi

Lesson

If a Helm chart is a wrapper with a dependency, configuration for the dependency must be nested under the dependency's alias key. Values placed at the wrong hierarchy level are silently ignored — Kubernetes uses chart defaults, not your overrides. This incident directly motivated adding CI validation for rendered Helm manifests: if the final Deployment spec had been checked in CI, the wrong probe values would have been caught before deployment. Full RCA: infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md

Failure 5: PR Creation Fails — GitHub Token Secret Contains Placeholder Value

Time lost: 30 minutes

Symptom

After the portal was stable, the scaffolder's "Open pull request" step spun for 30 seconds then failed:

Error: Request failed with status 401: Bad credentials

No PR was created in GitHub.

Root Cause

The Kubernetes Secret neuroscale-backstage-secrets contained a placeholder GITHUB_TOKEN value — literally <YOUR_TOKEN_HERE>. The environment variable was present, satisfying kubectl describe secret output, but the value was not a valid token.

A secondary issue: after updating the secret with the correct token, the running pod did not pick up the change. Environment variables from Secrets are injected at pod start time, not dynamically. The pod needed a restart.

Fix

# Update the secret with a valid token
read -s GITHUB_TOKEN
kubectl -n backstage create secret generic neuroscale-backstage-secrets \
  --from-literal=GITHUB_TOKEN="$GITHUB_TOKEN" \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart to reload env vars
kubectl -n backstage rollout restart deploy/neuroscale-backstage
kubectl -n backstage rollout status deploy/neuroscale-backstage --timeout=300s

# Verify token is present — check length, never the value
kubectl -n backstage exec deploy/neuroscale-backstage -- \
  sh -c 'echo ${#GITHUB_TOKEN} chars'

Lesson

kubectl describe secret shows the key exists and has bytes. It does not show whether the value is a valid token or a placeholder string. Always verify token presence by checking character length in the running container, never by reading the secret value directly.

Failure 6: PR Merged But ArgoCD Stays OutOfSync — Fix Not Committed to Git

Time lost: 1 hour of confusion

Symptom

The Backstage scaffolder created the PR correctly. CI passed. The PR was merged. ArgoCD detected the new application. But the child app immediately showed OutOfSync/Degraded:

$ kubectl -n argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Degraded

$ kubectl -n argocd describe application demo-iris-2
...
Message: Internal error occurred: failed calling webhook
  "inferenceservice.kserve-webhook-server.validator.webhook":
  no endpoints available for service "kserve-webhook-server-service"

Root Cause

This was the kube-rbac-proxy ImagePullBackOff failure from earlier — reappearing after a cluster restart. The fix had been applied with kubectl patch directly, not committed to Git. ArgoCD's selfHeal: true reverted it on the next sync cycle. The cluster restart exposed that the fix was never persisted.

Fix

# Verify the patch is in kustomization.yaml
cat infrastructure/serving-stack/kustomization.yaml | grep -A2 patches

# Commit and push
git add infrastructure/serving-stack/
git commit -m "serving-stack: persist kube-rbac-proxy removal patch"
git push origin main

ArgoCD picked up the change within 3 minutes.

Lesson

Any fix applied with kubectl directly in a GitOps-managed cluster is temporary. The next sync cycle will revert it. Every fix must be committed to Git to survive. The PR-merged-but-nothing-deployed experience is the worst possible failure for a Golden Path demo — the developer did everything correctly and the platform failed silently.

Failure 7: Inference Endpoint Returns HTTP 307 Redirect — Traefik Intercepts Before Kourier

Time lost: 45 minutes

Symptom

After demo-iris-2 became Ready=True, the inference test returned an unexpected redirect:

$ curl -v \
  -H 'Content-Type: application/json' \
  -d '{"instances":[[6.8,2.8,4.8,1.4]]}' \
  http://172.20.0.3/v1/models/demo-iris-2:predict

< HTTP/1.1 307 Temporary Redirect
< Location: https://172.20.0.3/v1/models/demo-iris-2:predict

Root Cause

k3d's built-in Traefik ingress was intercepting the request and applying an HTTP-to-HTTPS redirect before it reached Kourier. The request never reached the Knative routing layer at all.

Fix

Use direct pod port-forward for canonical local verification, bypassing Traefik and Kourier entirely:

# Find predictor pod
kubectl -n default get pods \
  -l serving.knative.dev/revision=demo-iris-2-predictor-00001 \
  -o jsonpath='{.items[0].metadata.name}'

# Port-forward directly to the pod
kubectl -n default port-forward \
  pod/demo-iris-2-predictor-00001-deployment-<hash> 18080:8080

# Predict
curl -sS \
  -H "Content-Type: application/json" \
  -d '{"instances":[[6.8,2.8,4.8,1.4],[6.0,3.4,4.5,1.6]]}' \
  http://127.0.0.1:18080/v1/models/demo-iris-2:predict

{"predictions":[1,1]}

Lesson

A healthy inference endpoint can look completely broken if your test path hits an unexpected intermediary. For local k3d clusters, disable Traefik at cluster creation:

k3d cluster create neuroscale \
  --k3s-arg "--disable=traefik@server:0"

Failure 8: Catalog Ingestion Silently Rejects Template After Values Update

Time lost: 20 minutes

Symptom

After updating values.yaml and rolling out a new Backstage deployment, the template disappeared from /create again — the same symptom as Failure 1, but after it had been working.

Root Cause

The rolling update caused a brief period where the new pod was starting and the old pod was terminating. During this window, the catalog re-ingested all locations. The updated values.yaml had a YAML indentation error in the catalog.locations block, which caused the allow rule for Template to be silently dropped during parsing.

Fix

# Check catalog ingestion in the new pod logs immediately after rollout
kubectl -n backstage logs deploy/neuroscale-backstage --tail=100 | \
  grep -i "warn\|error\|fail\|forbidden"

Fixed the YAML indentation:

# Correct indentation
catalog:
  locations:
    - type: url
      target: https://github.com/...
      rules:
        - allow: [Template]   # must be under rules:, not misaligned

Lesson

YAML indentation errors in Backstage config values are never surfaced as errors — the field is simply ignored. After every Backstage rollout that touches appConfig, immediately verify catalog ingestion by checking server logs and confirming the template appears in /create.

Failure 9: Scaffolder Task Hangs Then Fails — Port-Forward Session Died Mid-Task

Time lost: 15 minutes

Symptom

The scaffolder task started successfully, progress spinner ran for 60 seconds, then failed with a network error. The Backstage UI showed the task as failed with no specific error message. A second attempt worked immediately.

Root Cause

The kubectl port-forward session for Backstage had silently died between opening the browser and submitting the scaffolder form. The React app was loaded from cache — so the page appeared fully functional — but all API calls were failing because the backend was unreachable. The scaffolder task started, sent the first API call, and failed on the network layer.

Fix

# Before running any Backstage scaffolder task, verify the port-forward is alive
curl -s http://localhost:7010/api/catalog/entities?limit=1 | head -c 100

# If it returns nothing or errors, restart the port-forward
kubectl -n backstage port-forward svc/neuroscale-backstage 7010:7007

Use scripts/port-forward-all.sh from the repository which starts all required port-forwards as background processes with clean shutdown handling.

Lesson

A React app loaded from browser cache looks fully functional even when the backend is unreachable. Always verify the backend API is responding before running a scaffolder task, not just that the UI loaded.

What the Golden Path Actually Proves After Nine Failures

Final working state:

$ kubectl -n default get inferenceservice demo-iris-2
NAME          URL                                       READY   AGE
demo-iris-2   http://demo-iris-2.default.example.com   True    25m

$ curl -sS \
  -H "Content-Type: application/json" \
  -d '{"instances":[[6.8,2.8,4.8,1.4],[6.0,3.4,4.5,1.6]]}' \
  http://127.0.0.1:18080/v1/models/demo-iris-2:predict

{"predictions":[1,1]}

The Golden Path demo is a chain of seven moving parts: Backstage config, GitHub auth, ArgoCD app-of-apps, KServe controller, Knative routing, Kourier gateway, and the predictor pod. In production, any link in that chain can fail independently.

The debugging process for these nine failures is a direct map to what a platform SRE does on an on-call shift.

Debugging Commands Reference

# Backstage catalog ingestion errors
kubectl -n backstage logs deploy/neuroscale-backstage | \
  grep -i "warn\|error\|fail\|forbidden"

# Backstage runtime config
kubectl -n backstage describe configmap neuroscale-backstage-app-config

# Verify GitHub token is present (check length only)
kubectl -n backstage exec deploy/neuroscale-backstage -- \
  sh -c 'echo ${#GITHUB_TOKEN} chars'

# ArgoCD child app sync status
kubectl -n argocd get applications
kubectl -n argocd describe application demo-iris-2

# InferenceService conditions
kubectl -n default describe inferenceservice demo-iris-2

# Admission webhook endpoints
kubectl -n kserve get endpoints kserve-webhook-server-service

The Pattern Across All Nine Failures

Looking back at the nine failures, they fall into three categories:

Silent failures (no UI error, log only):
Failures 1, 2, 8 — catalog ingestion rejections and auth failures that show nothing in the UI. Rule: always check server logs, not just the browser.

Configuration hierarchy failures:
Failures 3, 4 — missing required keys and wrong Helm nesting. Rule: validate rendered manifests in CI before applying them.

State and dependency failures:
Failures 5, 6, 7, 9 — stale secrets, unreversioned fixes, intercepting proxies, dead sessions. Rule: verify the complete dependency chain before debugging the thing that appears broken.

Beyond InferenceService Readiness: 5 GitOps Failure Modes That Break KServe Deployments

Sodiq Jimoh — Mon, 30 Mar 2026 22:52:16 +0000

A sequel to my KServe readiness post — five GitOps control-plane failure modes with exact terminal output, diagnostics, and repeatable fixes for ArgoCD + KServe stacks.

This post is a follow-up to my earlier KServe piece on endpoint readiness:

👉 Why Your KServe InferenceService Won't Become Ready: Four Production Failures and Fixes

That article focused on why an InferenceService may not become Ready.

This one zooms out to a broader question:

What breaks when the GitOps control plane itself is unstable?

Most GitOps + AI serving tutorials still focus on the happy path — install ArgoCD, apply KServe, deploy InferenceService, done. But in real platform work, the happy path is the easy part.

The hard part is when your app is OutOfSync, the webhook has no endpoints, and everything looks healthy except the thing you actually need.

This post covers the five failure modes that repeatedly broke KServe deployments in a real production-grade platform build, with exact terminal output, root causes, and the fixes that worked.

All failures come from hands-on implementation work documented here:
Project repo: github.com/sodiq-code/neuroscale-platform

The platform context

Stack:

ArgoCD — GitOps reconciliation
KServe — model serving (InferenceService, runtimes)
Knative + Kourier — serving networking
Kyverno — policy guardrails
Backstage — self-service PR generation

GitOps root app:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: neuroscale-infrastructure
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/sodiq-code/neuroscale-platform.git
    targetRevision: main
    path: infrastructure/apps

Failure Mode 1: Webhook Has No Endpoints — Sync Fails Cluster-Wide

Time lost: ~1 hour | Impact: All InferenceService operations blocked

Symptom

ArgoCD syncs child apps and hits this:

$ kubectl -n argocd describe application ai-model-alpha
...
Message: admission webhook
  "inferenceservice.kserve-webhook-server.validator.webhook"
  denied the request: Internal error occurred:
  no endpoints available for service "kserve-webhook-server-service"

Meanwhile the KServe controller pod shows only 1 of 2 containers ready:

$ kubectl -n kserve get pods
NAME                                        READY   STATUS
kserve-controller-manager-8d7c5b9f4-xr2lm  1/2     Running

$ kubectl -n kserve describe pod kserve-controller-manager-8d7c5b9f4-xr2lm
...
  kube-rbac-proxy:
    State:   Waiting
    Reason:  ImagePullBackOff
    Image:   gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
Events:
  Warning  Failed  kubelet
    Failed to pull image: unexpected status code 403 Forbidden

Root Cause

The kube-rbac-proxy sidecar inside kserve-controller-manager was pulling from gcr.io/kubebuilder/ — a registry that restricted access in late 2025. The manager container was healthy but because the sidecar was not running, the webhook server had no valid certificate endpoint. Result: every InferenceService apply or update was blocked cluster-wide.

Fix

Remove the sidecar via Kustomize strategic merge patch:

# infrastructure/serving-stack/patches/
#   kserve-controller-kube-rbac-proxy-image.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kserve-controller-manager
  namespace: kserve
spec:
  template:
    spec:
      containers:
        - name: kube-rbac-proxy
          $patch: delete

Verify webhook endpoints are restored after re-sync:

$ kubectl -n kserve get endpoints kserve-webhook-server-service
NAME                           ENDPOINTS          AGE
kserve-webhook-server-service  10.42.0.23:9443    45s

Lesson

When webhook endpoints are missing, your app YAML is never the real problem. Diagnose the controller first. An external registry access change can silently kill your entire admission layer cluster-wide with no obvious error in the app itself.

Failure Mode 2: CRD Deleted by a Misapplied Patch — All Endpoints Gone Instantly

Time lost: 4 minutes recovery | Impact: SEV-1 equivalent — all InferenceServices deleted

Symptom

All InferenceService objects disappeared silently:

$ kubectl -n default get inferenceservices
No resources found in default namespace.

$ kubectl -n argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Missing

Root Cause

A Kustomize patch file named remove-inferenceservice-crd.yaml was mistakenly applied directly with kubectl apply -f instead of being used as a build-time patch inside kustomization.yaml. The file contained a $patch: delete directive:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: inferenceservices.serving.kserve.io
$patch: delete

When applied directly, it deleted the actual CRD from Kubernetes. When a CRD is deleted, Kubernetes immediately garbage-collects every custom resource of that type. Every InferenceService was gone within seconds.

Fix

Restore the CRD immediately:

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.1/kserve.yaml

kubectl wait crd/inferenceservices.serving.kserve.io \
  --for=condition=Established --timeout=60s

kubectl -n argocd patch application demo-iris-2 \
  --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

Lesson

$patch: delete in a Kustomize file is a build-time instruction — it tells kustomize build to omit that resource from output. It must never be applied directly with kubectl apply -f. Ambiguous filenames like remove-inferenceservice-crd.yaml are dangerous footguns. In a production cluster with 50 deployed models this would be a full SEV-1.

⚠️ Rule: Any file containing $patch: delete must only ever be referenced inside a kustomization.yaml patches block, never applied directly.

Failure Mode 3: Permanent OutOfSync Due to Label Key Mismatch

Time lost: 2 weeks undetected | Impact: CI was green while policy enforcement was silently broken

Symptom

A PR is merged, ArgoCD syncs, but the InferenceService stays OutOfSync/Degraded:

$ kubectl -n argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Degraded

Kyverno denies the resource at admission:

Error from server: error when creating "STDIN":
  admission webhook "clusterpolice.kyverno.svc" denied the request:
  resource InferenceService/default/test-model was blocked due to the following policies
  require-standard-labels-inferenceservice:
    check-owner-and-cost-center-on-isvc: 'validation error:
    InferenceService resources must set metadata.labels.owner and
    metadata.labels.cost-center.
    rule check-owner-and-cost-center-on-isvc failed at path
    /metadata/labels/cost-center/'

But the label is present in the manifest:

$ kubectl -n default get inferenceservice demo-iris-2 \
    -o jsonpath='{.metadata.labels}' | python3 -m json.tool
{
    "owner": "platform-team",
    "costCenter": "ai-platform"
}

Root Cause

costCenter (camelCase) and cost-center (kebab-case) are completely different Kubernetes label keys. The Backstage template skeleton was generating costCenter. The Kyverno policy required cost-center. CI passed because CI used the same manifest that would pass — the mismatch only surfaced at admission time.

Additionally, kyverno-cli apply exits with code 0 even when policy violations are found. CI was checking $? rather than ${PIPESTATUS[0]}, so the CI step appeared green while enforcement was completely broken for two weeks.

Fix

Standardize on kebab-case throughout (Kubernetes convention):

# Backstage template skeleton
# apps/${{ values.name }}/inference-service.yaml
labels:
  owner: platform-team
  cost-center: ai-platform   # was: costCenter

Fix the CI Kyverno check to catch actual violations:

set +e
docker run --rm -v "$PWD:/work" -w /work ghcr.io/kyverno/kyverno-cli:v1.12.5 \
  apply infrastructure/kyverno/policies/*.yaml \
  --resource "${app_files[@]}" \
  2>&1 | tee /tmp/kyverno-output.txt
kyverno_exit="${PIPESTATUS[0]}"
set -e

if [ "${kyverno_exit}" -ne 0 ] \
    || grep -qE "^FAIL" /tmp/kyverno-output.txt \
    || grep -qE "fail: [1-9][0-9]*" /tmp/kyverno-output.txt; then
  echo "Kyverno policy violations detected. Failing CI."
  exit 1
fi

Lesson

$? captures the exit code of tee, not kyverno. ${PIPESTATUS[0]} captures kyverno's actual exit code. "Guardrails exist" and "guardrails enforce" are different states. The most dangerous failure mode for a policy system is silent false positives — everything looks green while nothing is actually being enforced.

Failure Mode 4: Kyverno Install Breaks ArgoCD Reconciliation Loop

Time lost: 2–5 minutes per cluster | Impact: All ArgoCD apps enter Unknown state

Symptom

After adding Kyverno to the platform, previously healthy apps enter Unknown state:

$ kubectl -n argocd get applications
NAME                       SYNC STATUS   HEALTH STATUS
neuroscale-infrastructure  Synced         Healthy
serving-stack              Unknown        Unknown    # was Healthy 10 minutes ago
policy-guardrails          Synced         Healthy

$ kubectl -n argocd describe application serving-stack
...
Message: rpc error: code = Unavailable desc = connection refused

Root Cause

Kyverno installs its own ValidatingWebhookConfiguration and MutatingWebhookConfiguration during install. While Kyverno is initializing, the webhook configurations are registered but point to endpoints that do not exist yet. During this window, any kubectl apply operation — including ArgoCD's sync reconciliation loop — times out waiting for a response from a not-yet-running webhook server. This cascades into the ArgoCD repo-server losing its connection.

Fix

Add a Kyverno webhookAnnotations ConfigMap patch to suppress automatic webhook registration during the initialization window:

# infrastructure/kyverno/kustomization.yaml
patches:
  - target:
      kind: ConfigMap
      name: kyverno
    patch: |-
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: kyverno
        namespace: kyverno
      data:
        webhookAnnotations: "{}"

After Kyverno reaches Running state, force a hard refresh:

kubectl -n argocd patch application serving-stack \
  --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

Lesson

Adding a policy engine to an existing cluster disrupts all other ArgoCD-managed applications during the install window. In production, this requires a maintenance window or a canary install strategy. Kyverno must be fully healthy before any other component syncs.

Failure Mode 5: Stale Admission Webhook Blocks All Resource Creation

Time lost: 30+ minutes | Impact: All Deployments in the namespace silently blocked

Symptom

After fixing the repo-server, apps sync but Deployments never appear:

$ kubectl get applications -n argocd
NAME                       SYNC STATUS   HEALTH STATUS
neuroscale-infrastructure  Synced         Healthy
test-app                   Synced         Progressing   # stuck

$ kubectl get deploy -n default
No resources found in default namespace.

ArgoCD shows the Deployment as "synced" but it does not exist — a contradiction. Checking conditions:

$ kubectl -n argocd get application test-app -o yaml | grep -A 20 conditions
  conditions:
  - message: 'Failed sync attempt: one or more objects failed to apply,
      reason: Internal error occurred: failed calling webhook
      "validate.nginx.ingress.kubernetes.io":
      failed to call webhook: Post
      "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/...":
      dial tcp 10.96.x.x:443: connect: connection refused'
    type: SyncError

Root Cause

A ValidatingWebhookConfiguration from a previous cluster experiment was still registered but pointing to a service that no longer existed. Kubernetes admission webhooks are cluster-scoped. The stale ingress-nginx webhook was intercepting every resource creation attempt and failing them — the error only appears in ArgoCD events, not on the Deployment itself.

Fix

# Discover stale webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# Delete the stale one
kubectl delete validatingwebhookconfiguration ingress-nginx-admission

# Force ArgoCD to retry
kubectl -n argocd patch application test-app \
  --type merge \
  -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'

After deletion:

$ kubectl get deploy -n default
NAME          READY   UP-TO-DATE   AVAILABLE   AGE
nginx-test    1/1     1            1           23s

Lesson

A stale webhook from a previous workload silently blocks all resource creation in the affected namespace for hours without any obvious error message. The admission error only appears in ArgoCD events logs, not on the resource itself. Always check for stale webhooks before blaming manifests.

The Triage Sequence That Saves Hours

When a KServe app is failing in ArgoCD, run this exact order before touching any manifest:

# 1. Environment gate — if this fails, stop and fix environment first
kubectl get nodes
kubectl -n argocd get applications

# 2. Control-plane health
kubectl -n kserve get deploy,pods,svc,endpoints
kubectl get crd | grep serving.kserve.io

# 3. Controller logs
kubectl -n kserve logs deploy/kserve-controller-manager --tail=100

# 4. Webhook availability
kubectl -n kserve get endpoints kserve-webhook-server-service

# 5. Stale webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# 6. App-level sync error detail
kubectl -n argocd get application <app-name> -o yaml | grep -A 20 conditions

Only after every step above passes should you edit app manifests.

Why This Matters for Platform Teams

A platform is credible when it supports both:

Self-service delivery — the Golden Path works
Self-service recovery — failures are understandable and fixable without a platform expert

Most teams build the first and postpone the second. That creates operational debt fast.

The fix is not more dashboards. It is better failure-model documentation, tighter GitOps guardrails, and the discipline to document what breaks — not just what works.

A platform is not "done" when the happy path works. It's done when the failure path is understandable and recoverable.

What I Would Improve Next

Pre-merge CI assertions for probe and resource fields in rendered manifests
Explicit dependency ordering using ArgoCD sync waves to prevent Kyverno install disruption
Conformance checks for Helm dependency values nesting to catch silently ignored overrides
Policy test fixtures that verify both pass and fail cases in CI

Why Your KServe InferenceService Won't Become Ready: Four Production Failures and Fixes

Sodiq Jimoh — Mon, 30 Mar 2026 03:37:24 +0000

A practitioner's account of the errors the KServe getting-started documentation doesn't tell you about — with exact terminal output, root causes, and working Kustomize patches.

This article documents four production failures I encountered while deploying KServe on a local k3d cluster as part of building NeuroScale — a self-service AI inference platform. None of these failures appear in the official KServe getting-started documentation. If you are deploying KServe without Istio, this will save you several hours of debugging.

What I Was Building

NeuroScale is a self-service AI inference platform on Kubernetes. The goal was simple: one InferenceService named sklearn-iris reaches Ready=True and responds to a prediction request.

The install had to be GitOps-managed via ArgoCD — not "I ran some scripts." Getting there took two days and four distinct failures. Here is every one of them.

Stack: k3d (local Kubernetes) · KServe 0.12.1 · ArgoCD · Kourier (no Istio) · Knative Serving

📝 Author's Note: This article was originally documented in the NeuroScale platform repository.
File: docs/REALITY_CHECK_MILESTONE_2_KSERVE_SERVING.md
Repo: github.com/sodiq-code/neuroscale-platform

Failure 1: KServe InferenceService Stuck Not Ready — Istio vs Kourier Ingress Mismatch Causes ReconcileError Loop

Time lost: ~3 hours

Symptom

After applying the KServe installation via ArgoCD (serving-stack app), the InferenceService was created but never became Ready:

$ kubectl -n default get inferenceservice sklearn-iris
NAME           URL   READY   PREV   LATEST   AGE
sklearn-iris         False          100      8m

# READY=False with no URL = KServe controller did not complete ingress setup.
# No Knative Route was created. No external URL was assigned.

Digging In

$ kubectl -n default describe inferenceservice sklearn-iris
...
Status:
  Conditions:
    Message: Failed to reconcile ingress
    Reason:  ReconcileError
    Status:  False
    Type:    IngressReady

$ kubectl -n kserve logs deploy/kserve-controller-manager --tail=50
...
ERROR controller.inferenceservice Failed to reconcile ingress
  {"error": "virtual service not found: sklearn-iris.default.svc.cluster.local"}

The error referenced a virtual service — that is an Istio concept. But we were running Kourier. The KServe controller was attempting to create an Istio VirtualService in a cluster that had no Istio control plane.

Root Cause: Default KServe Ingress Mode Assumes Istio

KServe's default inferenceservice-config ConfigMap expects Istio as the ingress provider. It sets ingressClassName: istio and the key disableIstioVirtualHost defaults to false. When Istio is absent, the controller enters an error loop trying to create resources that will never exist.

Setting disableIstioVirtualHost: true tells KServe to skip Istio and fall back to Knative route objects that Kourier can handle.

Why Kourier instead of Istio: Istio adds ~1 GB of memory overhead. On a local k3d cluster shared with Docker Desktop, Backstage, and the KServe controller, that exhausts available RAM. Kourier's entire footprint is under 200 MB.

The Fix: ConfigMap Patch in serving-stack

# infrastructure/serving-stack/patches/inferenceservice-config-ingress.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: inferenceservice-config
  namespace: kserve
data:
  ingress: |-
    {
      "ingressGateway": "knative-serving/knative-ingress-gateway",
      "ingressDomain": "example.com",
      "ingressClassName": "istio",
      "urlScheme": "http",
      "disableIstioVirtualHost": true,
      "disableIngressCreation": false
    }

After this patch was applied and the KServe controller restarted:

$ kubectl -n default get inferenceservice sklearn-iris
NAME           URL                                       READY   AGE
sklearn-iris   http://sklearn-iris.default.example.com   True    2m

Business impact: This failure cost approximately 3 hours. The KServe documentation does not prominently state that the default configuration requires Istio. The error message "virtual service not found" is Istio-specific vocabulary that only makes sense if you already know Istio is the default — a classic undocumented assumption in infrastructure tooling.

Failure 2: ArgoCD Serving-Stack Sync Fails — Duplicate Knative CRD Exceeds 256 KB Annotation Size Limit

Time lost: ~30 minutes

Symptom

$ kubectl -n argocd get application serving-stack
NAME            SYNC STATUS   HEALTH STATUS
serving-stack   OutOfSync     Degraded

$ kubectl -n argocd describe application serving-stack
...
Message: CustomResourceDefinition "services.serving.knative.dev"
  is invalid: metadata.annotations:
  Too long: may not be more than 262144 bytes

Root Cause

ArgoCD stores kubectl.kubernetes.io/last-applied-configuration in the annotation. For large CRDs, this annotation plus the apply payload exceeds Kubernetes' 256 KB annotation size limit. The Knative CRD is approximately 400 KB as a YAML object.

A rendering overlap compounded the issue: the kserve.yaml bundle already includes its own version of the Knative Serving CRDs, and we were also referencing serving-core.yaml directly. This created two attempts to manage the same CRDs, causing comparison instability.

Fix

# infrastructure/serving-stack/kustomization.yaml

# 1. Use server-side apply to bypass the annotation size limit
commonAnnotations:
  argocd.argoproj.io/sync-options: ServerSideApply=true

# 2. Ignore runtime-mutated fields on Knative CRDs
#    (In ArgoCD Application spec)
ignoreDifferences:
  - group: apiextensions.k8s.io
    kind: CustomResourceDefinition
    name: services.serving.knative.dev
    jsonPointers:
      - /spec/preserveUnknownFields

Business impact: ArgoCD's error says "Too long" but does not tell you which annotation or why it got too long. Debugging requires knowing ArgoCD's internal server-side apply mechanism.

Failure 3: kube-rbac-proxy ImagePullBackOff Blocks KServe Admission Webhook — gcr.io Access Restriction

Time lost: ~1 hour | Cluster-wide impact

Symptom

$ kubectl -n argocd describe application ai-model-alpha
...
Message: admission webhook
  "inferenceservice.kserve-webhook-server.validator.webhook"
  denied the request: no endpoints available for
  service "kserve-webhook-server-service"

$ kubectl -n kserve get pods
NAME                            READY   STATUS
kserve-controller-manager-xxx   1/2     Running   # only 1 of 2 ready

$ kubectl -n kserve describe pod kserve-controller-manager-xxx
  kube-rbac-proxy:
    State:   Waiting
    Reason:  ImagePullBackOff
    Image:   gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
Events:
  Warning  Failed  kubelet
    Failed to pull image: unexpected status code 403 Forbidden

Root Cause

KServe 0.12.1's kserve-controller-manager Deployment includes a kube-rbac-proxy sidecar from gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1. Google Container Registry restricted access to kubebuilder images in late 2025.

The manager container itself was healthy (1 of 2 ready). But without the sidecar, the webhook server certificate was not being served, so the admission webhook had no healthy endpoints. The alternative registry.k8s.io/kube-rbac-proxy:v0.13.1 did not exist at the new location either.

Fix: Remove the Sidecar via Kustomize Strategic Merge Patch

# infrastructure/serving-stack/patches/
#   kserve-controller-kube-rbac-proxy-image.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kserve-controller-manager
  namespace: kserve
spec:
  template:
    spec:
      containers:
        - name: kube-rbac-proxy
          $patch: delete

After this patch and a re-sync:

$ kubectl -n kserve get pods
NAME                            READY   STATUS
kserve-controller-manager-yyy   1/1     Running   # fixed

$ kubectl -n kserve get endpoints kserve-webhook-server-service
NAME                            ENDPOINTS          AGE
kserve-webhook-server-service   10.42.0.23:9443    45s

Known tradeoff: Removing kube-rbac-proxy disables the Prometheus metrics proxy endpoint for the KServe controller. In production, source a verified replacement image from an accessible registry before deploying.

Business impact: An external registry access change cascaded into a complete admission webhook outage. Any InferenceService creation or update was blocked cluster-wide while the sidecar was failing. This class of failure has no good solution without upstream monitoring of your image dependencies.

Failure 4: Inference Request Returns HTTP 405 — IngressDomain Placeholder Resolves to Public Internet

Time lost: ~1 hour

Symptom

$ kubectl -n default get inferenceservice sklearn-iris \
    -o jsonpath='{.status.url}'
http://sklearn-iris.default.example.com

$ curl -sS \
  -H "Content-Type: application/json" \
  -d '{"instances":[[5.1,3.5,1.4,0.2]]}' \
  http://sklearn-iris.default.example.com/v1/models/sklearn-iris:predict
<html><head><title>405 Not Allowed</title></head>...

# The request hit the public example.com server, not our Kourier gateway.

Root Cause

The ingressDomain in the KServe ConfigMap was set to example.com — a literal placeholder. The generated URL resolves publicly to Cloudflare/IANA servers, not the local cluster.

Additionally, Kourier routes by Host header, not by IP. Just port-forwarding Kourier and hitting 127.0.0.1 does not work without the correct Host header.

Fix: Direct Predictor Pod Port-Forward

Bypass Knative routing and Kourier entirely for local verification:

# Step 1: Get the predictor pod name
kubectl -n default get pods \
  -l serving.knative.dev/revision=sklearn-iris-predictor-00001

# Step 2: Port-forward directly to the predictor container
kubectl -n default port-forward \
  pod/sklearn-iris-predictor-00001-deployment-<hash> 18080:8080

# Step 3: Predict (no Host header, no Kourier, no DNS needed)
curl -sS \
  -H "Content-Type: application/json" \
  -d '{"instances":[[5.1,3.5,1.4,0.2],[6.2,3.4,5.4,2.3]]}' \
  http://127.0.0.1:18080/v1/models/sklearn-iris:predict

{"predictions":[0,2]}

For the full Kourier routing path, always pass the Host header:

kubectl -n kourier-system port-forward svc/kourier 18080:80

curl -sS \
  -H 'Host: sklearn-iris-predictor.default.127.0.0.1.sslip.io' \
  -H "Content-Type: application/json" \
  -d '{"instances":[[5.1,3.5,1.4,0.2]]}' \
  http://127.0.0.1:18080/v1/models/sklearn-iris:predict

Business impact: False-negative inference verification. A healthy endpoint looked broken because the test URL resolved to the wrong server. Always verify the complete network path — DNS resolution, ingress routing, pod health — as separate steps rather than assuming a single curl test is conclusive.

What This Proves After the Failures

After working through the above failures, the inference baseline worked:

$ kubectl -n default get inferenceservice sklearn-iris
NAME           URL                                       READY   AGE
sklearn-iris   http://sklearn-iris.default.example.com   True    45m

$ curl -sS \
  -H "Content-Type: application/json" \
  -d '{"instances":[[5.1,3.5,1.4,0.2],[6.2,3.4,5.4,2.3]]}' \
  http://127.0.0.1:18080/v1/models/sklearn-iris:predict

{"predictions":[0,2]}

The Istio/Kourier mismatch is the canonical example of why "default configuration" is dangerous in complex systems. KServe's default assumes a specific network topology that is not disclosed in the getting-started docs. Recognizing this class of failure — configuration that works in the tool author's environment but not yours — is a senior platform engineering competency.

What This Setup Does NOT Solve (Known Tradeoffs)

No Istio service mesh: No mTLS between services, no advanced traffic management. Acceptable for local dev; requires a replacement security layer in production.
kube-rbac-proxy removed: Prometheus metrics from the KServe controller are unavailable. Re-add this sidecar from a working registry before any production deployment.
Port-forward for inference: The Host-header workaround is local only. Cloud deployment requires a real ingress with DNS and TLS. On EKS, swap Kourier for an ALB and set ingressDomain to your real domain. See the Cloud Promotion Guide in the repository.

Debugging Commands Reference

Run these in order when an InferenceService will not become Ready.

1 — InferenceService Conditions

kubectl -n default describe inferenceservice sklearn-iris
kubectl -n kserve logs deploy/kserve-controller-manager --tail=50
kubectl -n kserve logs deploy/kserve-controller-manager -c manager --tail=50

2 — Webhook Endpoint Availability

kubectl -n kserve get endpoints kserve-webhook-server-service
kubectl -n kserve describe endpoints kserve-webhook-server-service
kubectl -n default get ksvc
kubectl -n default get route

3 — ConfigMap and Pod Status

kubectl -n kserve get configmap inferenceservice-config -o yaml
kubectl -n kserve get pods -o wide
kubectl -n kserve describe pod <pod-name>

The One Thing to Remember

KServe's default configuration assumes Istio is installed. This assumption is not prominently stated in the getting-started documentation. Every engineer running KServe on k3d, k3s, GKE Autopilot, or any non-Istio cluster will hit ReconcileError and see error messages referencing "virtual services" — an Istio concept — with no obvious resolution path.

The fix is one ConfigMap patch. It takes 30 seconds to apply. Finding it took three hours.

The kube-rbac-proxy 403 from gcr.io is an external dependency failure that silently kills your admission webhook cluster-wide. The $patch: delete Kustomize strategy is the fastest recovery path when no alternative registry image is available.

Full platform source — all six Reality Check documents, Backstage Golden Path, Kyverno policy enforcement, cost attribution, and a Cloud Promotion Guide to EKS/GKE: Check out the full NeuroScale repo here.

Forem: Sodiq Jimoh

I Build ML Infrastructure for a Living — Here's Why Hermes Agent Changes the Game for Platform Engineers

The Problem Nobody Talks About: AI Agents Are Stateless in a Stateful World

Three-Layer Memory: What It Actually Means for Infrastructure

Where Hermes Creates Real Value in an ML Platform Stack

1. Configuration Drift Diagnosis

2. Policy Validation Before Merge

3. Incident Response That Compounds

What Hermes Gets Right That Other Frameworks Don't

What Hermes Doesn't Solve (Yet)

The Bigger Picture: Agents as Infrastructure Primitives

Why I'm Betting on Hermes

I Spent 108 Commits Building Infrastructure. Google I/O 2026 Shipped It as One API Call.

The Setup: What I Spent Six Months Building

The I/O Announcement That Stopped Me Cold

What Google Actually Got Right

What Managed Agents Still Can't Do

The Broader Signal Most I/O Coverage Missed

A Framework for Every Developer Reading This

What I'm Actually Taking Away From I/O 2026

Gemma 4 Broke My Kubernetes Resource Model. Here's What I Measured.

The Architecture That Breaks the Assumptions

Consequence 1: The Numbers That Broke My Cost Model

Consequence 2: GPU Utilization Lies to the Autoscaler

Consequence 3: Expert Parallelism Crashes Under Data Parallelism

The Outage That Taught Me to Distrust Defaults

The CI That Lied for Two Weeks

What Platform Engineers Should Do

The Deeper Pattern: Architecture-Aware Scheduling

I Revived a Broken MLOps Platform — Now It's Self-Service, Policy-Guarded, and Operationally Credible

Watch It Run: 21 Checks, 0 Failures

The Problem: A Platform That Was Abandoned and Dangerous

What I Built: Five Enforcement Layers

Layer 1: Self-Service Golden Path

Layer 2: GitOps Drift Control

Layer 3: Policy Guardrails — Shift-Left + Shift-Down

Layer 4: Cost Attribution

Layer 5: Operational Recovery

Copilot Partnership: Three Moments That Mattered

Moment 1: The Architectural Decision — Kourier vs Istio

Moment 2: The Silent Bug — CI Guardrails That Can't False-Green

Moment 3: The Recurring Crash — CrashLoopBackOff and the Runbook

Before vs After

For Developers

For Operators

For the Platform

What Made This Real: The Failures

Try It Yourself

The Bottom Line

Your Kyverno CI Is Lying to You: Why kyverno-cli Exits 0 on Policy Violations

The context: what I was enforcing

The false-green: what it looked like

Root Cause Part 1: kyverno-cli apply exits 0 on violations

Root Cause Part 2: $? captures tee, not kyverno

The fix: dual check with $PIPESTATUS[0]

Verification: proving the fix works

The complete GitHub Actions workflow step

Why this matters beyond a single platform

The two-layer enforcement model

Debugging Commands Reference

What I Would Add Next

See Also

Deploying Backstage on Kubernetes with the Helm Chart: The Infrastructure-First Guide

The one thing you must understand before installing the Helm chart

The values hierarchy that breaks everything silently

Required configuration keys for the demo image

Probe timings: the demo image starts slowly

Authentication: local dev vs production

Catalog configuration: registering templates

GitHub integration: the token secret

A working minimal values.yaml for local development

Deploying and verifying

The production values profile

Diagnostic command reference

What this guide does not cover

See Also

9 Failures That Hit Me Building a Backstage Golden Path for KServe — Every Error, Every Fix

What I was trying to build

Failure 1: Template Not Visible in Catalog — Silent Rejection With No UI Error

Symptom