<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sodiq Jimoh</title>
    <description>The latest articles on Forem by Sodiq Jimoh (@sodiqjimoh).</description>
    <link>https://forem.com/sodiqjimoh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3850139%2Fd2edc9dc-4ca6-4299-8708-8fb3c454bd56.jpg</url>
      <title>Forem: Sodiq Jimoh</title>
      <link>https://forem.com/sodiqjimoh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sodiqjimoh"/>
    <language>en</language>
    <item>
      <title>I Build ML Infrastructure for a Living — Here's Why Hermes Agent Changes the Game for Platform Engineers</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Sat, 23 May 2026 22:30:00 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/i-build-ml-infrastructure-for-a-living-heres-why-hermes-agent-changes-the-game-for-platform-1k9h</link>
      <guid>https://forem.com/sodiqjimoh/i-build-ml-infrastructure-for-a-living-heres-why-hermes-agent-changes-the-game-for-platform-1k9h</guid>
      <description>&lt;p&gt;&lt;em&gt;Hermes Agent Challenge Submission: &lt;a href="https://dev.to/challenges/hermes"&gt;Write About Hermes Agent Challenge Page&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I've spent the past year building NeuroScale — an open-source AI inference platform on Kubernetes. 108 commits. 21 automated smoke checks across 6 milestones. The kind of platform where a developer fills in a Backstage form and gets a production-grade inference endpoint with drift control, policy guardrails, and cost attribution — no &lt;code&gt;kubectl&lt;/code&gt; required.&lt;/p&gt;

&lt;p&gt;I'm telling you this because I need you to understand where I'm coming from when I say: Hermes Agent isn't just another AI coding assistant. It's the first agent framework that actually thinks like a platform engineer.&lt;/p&gt;

&lt;p&gt;I don't say that lightly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Nobody Talks About: AI Agents Are Stateless in a Stateful World
&lt;/h2&gt;

&lt;p&gt;Building ML infrastructure teaches you one thing fast: everything is state.&lt;/p&gt;

&lt;p&gt;Your ArgoCD sync status is state. Your Kyverno policy violations are state. The drift between what's in Git and what's running in the cluster — state. The fact that someone ran &lt;code&gt;kubectl apply&lt;/code&gt; directly at 2am and broke the GitOps contract — that's state too.&lt;/p&gt;

&lt;p&gt;Every AI agent I've used before Hermes treats each conversation like a blank canvas. You explain your architecture. You describe the problem. You get a plausible answer. Then you close the tab and do it all over again tomorrow.&lt;/p&gt;

&lt;p&gt;Groundhog Day for infrastructure debugging.&lt;/p&gt;

&lt;p&gt;Hermes Agent is architecturally different, and the difference matters specifically for the kind of work platform engineers do.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three-Layer Memory: What It Actually Means for Infrastructure
&lt;/h2&gt;

&lt;p&gt;Most people writing about Hermes focus on the memory system as a convenience feature. "It remembers your preferences." "It knows your name."&lt;/p&gt;

&lt;p&gt;That's not what makes it interesting.&lt;/p&gt;

&lt;p&gt;Hermes runs a three-layer memory architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Short-term&lt;/strong&gt; — current conversation context (same as every other agent)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium-term&lt;/strong&gt; — session summaries that persist between conversations, built through periodic "memory nudges"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term&lt;/strong&gt; — Skill Documents that capture how it solved specific types of problems, stored as reusable procedures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a platform engineer, this maps directly to something we already understand: runbooks.&lt;/p&gt;

&lt;p&gt;When I troubleshoot an ArgoCD sync failure, I don't start from first principles. I check the runbook. Token expiry? Webhook misconfiguration? Sync wave ordering? The runbook encodes prior incident resolution as a procedure.&lt;/p&gt;

&lt;p&gt;Hermes does this automatically. After roughly 15 tasks, its GEPA loop (Goal → Execute → self-Prompted introspection → Adapt — published at ICLR 2026 as an Oral) kicks in: it reviews its own performance, identifies patterns, and writes new Skill Documents. Agents with 20+ self-generated skills complete similar future tasks 40% faster than fresh instances.&lt;/p&gt;

&lt;p&gt;That's not "remembering your name." That's an agent building its own runbook library. It's the difference between a junior on-call engineer and a senior who's seen every failure mode before.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Hermes Creates Real Value in an ML Platform Stack
&lt;/h2&gt;

&lt;p&gt;Abstract possibilities are cheap. Let me be specific about where this matters in a stack like NeuroScale.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Configuration Drift Diagnosis
&lt;/h3&gt;

&lt;p&gt;NeuroScale uses ArgoCD with &lt;code&gt;selfHeal: true&lt;/code&gt; — drift is auto-corrected. But detecting drift &lt;em&gt;before&lt;/em&gt; ArgoCD catches it, and understanding &lt;em&gt;why&lt;/em&gt; it happened, is a different problem.&lt;/p&gt;

&lt;p&gt;Here's what a Hermes scheduled audit looks like in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes task add &lt;span class="nt"&gt;--cron&lt;/span&gt; &lt;span class="s2"&gt;"0 */6 * * *"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"Check the diff between Git-declared state in infrastructure/apps/ &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
   and live cluster state. If they diverge, summarize what changed, &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
   correlate with recent kubectl audit logs, and flag whether the &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
   change was human-initiated or a controller reconciliation. &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
   Send results to Telegram."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most agents can run a diff. Hermes does the part that matters: building a pattern library over time. After a month of audits, it knows that drift in the &lt;code&gt;serving-stack&lt;/code&gt; namespace is almost always a Knative autoscaler update (harmless), while drift in &lt;code&gt;kyverno/policies/&lt;/code&gt; is almost always someone bypassing admission control (critical).&lt;/p&gt;

&lt;p&gt;That context accumulates in Skill Documents. I haven't seen another agent framework that does this out of the box.&lt;/p&gt;

&lt;p&gt;Here's what a drift report from Hermes actually looks like after a few weeks of accumulated context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📋 Drift Audit — 2026-05-23 12:00 UTC

Cluster: neuroscale-prod
Namespaces scanned: 4

✅ serving-stack: 2 diffs detected
   → Both are Knative autoscaler reconciliations (harmless)
   → Matches pattern from Skill: "knative-autoscaler-drift"
   → No action required.

⚠️ kyverno/policies: 1 diff detected
   → ClusterPolicy "require-resource-limits" modified in-cluster
   → Not present in Git (infrastructure/policies/)
   → kubectl audit: manual apply by user "ops-admin" at 03:12 UTC
   → FLAGGED: Possible admission control bypass.
   → Recommend: Revert in-cluster change or commit to Git.

📎 Context: This is the 3rd manual policy edit in 14 days.
   Previous incidents resolved by reverting. See Skill:
   "kyverno-drift-response" for standard procedure.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the last three lines. That's not a generic diff. That's an agent referencing its own operational history — correlating today's anomaly with patterns it learned from previous audits. A fresh agent instance can't do that. One with a month of Skill Documents can.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Policy Validation Before Merge
&lt;/h3&gt;

&lt;p&gt;NeuroScale enforces 5 Kyverno ClusterPolicies — requiring resource limits, standard labels, non-root containers, no &lt;code&gt;:latest&lt;/code&gt; tags. But violations caught at admission mean the deploy already failed. The earlier you catch them, the cheaper the fix.&lt;/p&gt;

&lt;p&gt;This is where Skill Documents become genuinely powerful. You write one that encodes your specific policies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Skill: NeuroScale Policy Pre-Check&lt;/span&gt;
&lt;span class="gu"&gt;## When to Use&lt;/span&gt;
When reviewing PRs that modify files under &lt;span class="sb"&gt;`apps/`&lt;/span&gt; or &lt;span class="sb"&gt;`infrastructure/`&lt;/span&gt;.
&lt;span class="gu"&gt;## Procedure&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Check for &lt;span class="sb"&gt;`owner`&lt;/span&gt; and &lt;span class="sb"&gt;`cost-center`&lt;/span&gt; labels on all InferenceService manifests
&lt;span class="p"&gt;2.&lt;/span&gt; Verify &lt;span class="sb"&gt;`resources.requests`&lt;/span&gt; and &lt;span class="sb"&gt;`resources.limits`&lt;/span&gt; are set
&lt;span class="p"&gt;3.&lt;/span&gt; Flag any image tag that is &lt;span class="sb"&gt;`latest`&lt;/span&gt; or missing
&lt;span class="p"&gt;4.&lt;/span&gt; Verify &lt;span class="sb"&gt;`securityContext.runAsNonRoot: true`&lt;/span&gt;
&lt;span class="gu"&gt;## Known False Positives&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; ClusterServingRuntime objects are exempt from label requirements
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's not a prompt. It's a procedural memory document — loaded on-demand, zero tokens until needed, self-improving based on new violations it discovers.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Incident Response That Compounds
&lt;/h3&gt;

&lt;p&gt;Real scenario from NeuroScale development: Backstage went into a CrashLoop. Root cause was a token refresh issue with the Kubernetes service account. I documented it in &lt;code&gt;INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With Hermes running persistently — which you can do on a $5 VPS or a serverless backend that hibernates when idle — it would have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detected the CrashLoop via scheduled health check&lt;/li&gt;
&lt;li&gt;Correlated with recent changes (cert rotation? secret update?)&lt;/li&gt;
&lt;li&gt;Checked its Skill Documents for prior Backstage incidents&lt;/li&gt;
&lt;li&gt;Either resolved it or escalated with a structured diagnosis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next time a similar issue occurs, it resolves faster because the skill from incident #1 already exists. That's the compounding effect that makes experienced SREs more valuable over time — now encoded in an agent's memory.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Hermes Gets Right That Other Frameworks Don't
&lt;/h2&gt;

&lt;p&gt;I've looked at the landscape — LangChain, CrewAI, AutoGen. Here's what Hermes gets structurally right for infrastructure:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local-first data residency.&lt;/strong&gt; Everything lives in a local SQLite database. For platform engineers working with cluster credentials and deployment configs, this isn't a feature — it's a prerequisite. I'm not sending my policy violations through someone else's API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terminal backends that work.&lt;/strong&gt; Seven backends including local, Docker, SSH, and serverless options. SSH means Hermes runs commands on your actual infrastructure. Docker means you can sandbox it. Serverless means it hibernates when idle, wakes on demand. This is infrastructure-native thinking, not "here's a chat UI that can run Python."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Built-in cron scheduling.&lt;/strong&gt; Natural-language-configured scheduled tasks with delivery to Telegram, Discord, Slack, or Signal. For infrastructure monitoring, this is table stakes — and Hermes is one of the few agent frameworks that ships it natively, no external cron daemon or YAML required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;200+ model support.&lt;/strong&gt; Switch between cheap models for routine audits and powerful ones for complex diagnosis with a single command. No code changes. Operational flexibility that platform engineers actually need.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Hermes Doesn't Solve (Yet)
&lt;/h2&gt;

&lt;p&gt;Honesty about limitations matters more than hype when we're talking about tools that touch production infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain reasoning is shallow.&lt;/strong&gt; Hermes can follow procedures and build skill documents, but it can't replace a senior engineer's intuition about why a particular autoscaler configuration causes cascading latency under specific traffic patterns. The skill system captures &lt;em&gt;what to do&lt;/em&gt;, not &lt;em&gt;why it works&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-cluster coordination is manual.&lt;/strong&gt; NeuroScale runs on a single cluster. For federated infrastructure across regions, Hermes' per-instance memory doesn't federate. Each agent builds its own skill library independently. There's no skill-sharing protocol between agents yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approval workflows need hardening.&lt;/strong&gt; The &lt;code&gt;--yolo&lt;/code&gt; flag bypasses all approval prompts. For infrastructure work, that's terrifying. The approval system needs declarative rules about what the agent can and cannot do — something like Kyverno's admission policies, not just per-command approve/deny. The &lt;code&gt;tools/&lt;/code&gt; directory has approval pinning in progress, but it's not production-ready for high-stakes operations.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture: Agents as Infrastructure Primitives
&lt;/h2&gt;

&lt;p&gt;Here's the perspective I haven't seen anyone else articulate.&lt;/p&gt;

&lt;p&gt;Hermes Agent isn't just a tool for platform engineers. It's a new kind of infrastructure primitive.&lt;/p&gt;

&lt;p&gt;Think about the trajectory: manual server management → configuration management → infrastructure as code → GitOps → platform engineering. Each layer abstracted the layer below and added intelligence.&lt;/p&gt;

&lt;p&gt;Hermes represents the next step: &lt;strong&gt;infrastructure as conversation&lt;/strong&gt;. Not in the shallow "chat with your cluster" sense. In the sense that an agent with persistent memory, self-improving procedures, and scheduled automation can become a layer in your control plane.&lt;/p&gt;

&lt;p&gt;A layer that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Observes continuously (cron + terminal access)&lt;/li&gt;
&lt;li&gt;Learns from incidents (GEPA → Skill Documents)&lt;/li&gt;
&lt;li&gt;Enforces patterns (skill-driven validation)&lt;/li&gt;
&lt;li&gt;Communicates across channels (Telegram/Slack/Discord)&lt;/li&gt;
&lt;li&gt;Costs almost nothing when idle (serverless backends)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's not a chatbot. That's an &lt;em&gt;operator&lt;/em&gt; — in the Kubernetes sense of the word.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I'm Betting on Hermes
&lt;/h2&gt;

&lt;p&gt;The tools that win aren't the ones with the most features. They're the ones with the right architecture for compounding.&lt;/p&gt;

&lt;p&gt;ArgoCD won over manual deploys because GitOps compounds — every deployment is auditable, reproducible, reversible. Kyverno won over manual policy checks because admission policies compound — every new policy protects every future deployment.&lt;/p&gt;

&lt;p&gt;Hermes Agent's architecture compounds the same way. Every task makes it better at the next one. Every incident resolution becomes a skill document. Every audit pattern becomes a scheduled automation.&lt;/p&gt;

&lt;p&gt;164,000 GitHub stars in under three months. MIT licensed. Runs on a $5 VPS. Data stays on your machine.&lt;/p&gt;

&lt;p&gt;For platform engineers who've spent years building systems that self-heal, self-monitor, and self-govern — Hermes Agent is the first AI framework that actually speaks our language.&lt;/p&gt;




&lt;p&gt;I'm Sodiq, and I build ML infrastructure platforms. NeuroScale is open source: &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt; — PRs welcome. If you want to see how Hermes Agent could fit into a real Kubernetes-based ML platform, that's where I'd start.&lt;/p&gt;

&lt;p&gt;Star the repo if this perspective was useful. And if you've tried Hermes against your own infrastructure — what broke first? I want to know.&lt;/p&gt;

</description>
      <category>hermesagentchallenge</category>
      <category>devchallenge</category>
      <category>agents</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>I Spent 108 Commits Building Infrastructure. Google I/O 2026 Shipped It as One API Call.</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Sat, 23 May 2026 21:25:59 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/i-spent-108-commits-building-infrastructure-google-io-2026-shipped-it-as-one-api-call-bp9</link>
      <guid>https://forem.com/sodiqjimoh/i-spent-108-commits-building-infrastructure-google-io-2026-shipped-it-as-one-api-call-bp9</guid>
      <description>&lt;p&gt;Six days before Google I/O, I pushed the final commit to my open-source project. Two days later, Google announced they'd productized the hardest parts of it.&lt;/p&gt;

&lt;p&gt;That's the story. And it's not a complaint — it's the most useful thing I can share about what I/O 2026 actually means for how we build software.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup: What I Spent Six Months Building
&lt;/h2&gt;

&lt;p&gt;NeuroScale is a self-service platform that lets developers deploy AI models to production without understanding the infrastructure underneath. A developer fills in a form, a pull request gets created automatically, automated checks validate it, it deploys itself, and a working AI endpoint goes live.&lt;/p&gt;

&lt;p&gt;The developer sees a form and an endpoint. They don't see the 13 things that happen in between.&lt;/p&gt;

&lt;p&gt;Building those 13 invisible things is where the 108 commits went.&lt;/p&gt;

&lt;p&gt;Here's what "safe AI deployment" actually requires — and why each piece exists:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An isolation layer.&lt;/strong&gt; When your AI model runs, it needs to run in its own contained environment. If it crashes, it shouldn't take anything else down. If it leaks data, it shouldn't reach other teams' workloads. Building and managing these containers is non-trivial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A policy enforcement layer.&lt;/strong&gt; Left to their own devices, developers will deploy AI models with no memory limits, no ownership labels, no cost attribution. In a shared environment, one badly-configured model can exhaust the entire cluster's memory. You need something that physically blocks bad deployments before they happen — not a warning, a block.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A cost attribution layer.&lt;/strong&gt; "AI is expensive" is not useful information. "Team Alpha's recommendation model cost $47 this month and Team Beta's classifier cost $112" is. Someone has to plumb the labeling, the metrics collection, and the cost analysis to make that possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A drift correction and validation layer.&lt;/strong&gt; Production systems get manually modified — someone "just quickly" adjusts a setting directly. Without continuous watching, your infrastructure drifts from what you think it is. And before any configuration reaches production, it should be checked against the organization's policies in a pull request — exponentially cheaper than catching problems in production.&lt;/p&gt;

&lt;p&gt;Each of those layers generated its own postmortem.&lt;/p&gt;

&lt;p&gt;The isolation layer? I spent three hours debugging why no AI model could deploy at all. The cause: a single configuration flag, &lt;code&gt;disableIstioVirtualHost: true&lt;/code&gt;, that wasn't in any documentation. Found it by reading the source code of the deployment tool. One boolean. Three hours.&lt;/p&gt;

&lt;p&gt;The policy layer? For two weeks, our automated checks reported "all policies passed" on every pull request — a green checkmark, every time. They were lying. The command-line tool we used exits with a success code even when it finds policy violations. We had to rewrite the check to parse the tool's text output because the exit code couldn't be trusted. Two weeks of false confidence.&lt;/p&gt;

&lt;p&gt;These aren't edge cases. They're what platform engineering actually looks like. Every layer that developers never see was built by someone who got paged at 2 AM because it didn't exist yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  The I/O Announcement That Stopped Me Cold
&lt;/h2&gt;

&lt;p&gt;At Google I/O 2026, Google announced Managed Agents in the Gemini API. Here's how they described it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"A single API call provisions a remote Linux environment where the agent can reason, plan and call tools; execute code and manage files in an isolated sandbox; and browse the web to fetch and process live data."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The parallels to what I'd built were immediate:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Isolation layer?&lt;/strong&gt; Each agent interaction gets its own sandboxed environment — similar problem, radically simpler solution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Policy layer?&lt;/strong&gt; Antigravity has "cross-platform terminal sandboxing, credential masking, and hardened Git policies." Not identical to organizational policy enforcement, but covering the same ground at the infrastructure level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;State persistence?&lt;/strong&gt; Environments carry state between calls. It's not the same as GitOps-style drift reconciliation — Google isn't watching your production cluster and reverting unauthorized changes — but it solves the narrower problem of agents losing context between interactions.&lt;/p&gt;

&lt;p&gt;And here's the code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;interaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;antigravity-preview-05-2026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Deploy this model and run inference on the test dataset.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No configuration file. No 13-step pipeline. No three-hour debugging sessions hunting boolean flags in source code.&lt;/p&gt;

&lt;p&gt;The infrastructure abstraction I spent six months building — Google offered a dramatically simpler path to the same developer outcome.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Google Actually Got Right
&lt;/h2&gt;

&lt;p&gt;I don't mean this as criticism of what I built. The NeuroScale platform works. But Google's announcement revealed something important about where I was solving the problem.&lt;/p&gt;

&lt;p&gt;I solved it at the infrastructure layer — Kubernetes, container orchestration, policy engines, GitOps controllers. Powerful. Flexible. Also: requires a developer to know what all of those words mean before they can onboard.&lt;/p&gt;

&lt;p&gt;Google solved it at the interface layer. "Write a sentence describing what you want. We handle everything underneath."&lt;/p&gt;

&lt;p&gt;The difference isn't capability. It's entry cost.&lt;/p&gt;

&lt;p&gt;My platform asks developers to trust a form that maps to a manifest that feeds a policy engine that connects to a deployment controller. Three layers of abstraction they can't see, maintained by someone (me) they have to trust.&lt;/p&gt;

&lt;p&gt;Google's Managed Agents ask developers to write a sentence.&lt;/p&gt;

&lt;p&gt;There's also something deeper in how Google defined agents. They introduced &lt;code&gt;AGENTS.md&lt;/code&gt; and &lt;code&gt;SKILL.md&lt;/code&gt; — plain markdown files that define what an agent knows how to do and how it should behave. No special syntax. No proprietary format. Just files.&lt;/p&gt;

&lt;p&gt;Compare that to my platform's equivalent: a Backstage plugin with a custom template schema, connected to a Kubernetes custom resource definition, validated by a policy engine configured in YAML. Same goal — "define how deployments should behave" — wildly different complexity.&lt;/p&gt;

&lt;p&gt;Google's insight is that infrastructure behavior should be readable by the people who define it, not just by the engineers who build the platforms that enforce it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Managed Agents Still Can't Do
&lt;/h2&gt;

&lt;p&gt;Here's where I stop nodding and start pushing back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The organizational layer is still a human problem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Managed Agents provision environments. They don't model the organizational reality that surrounds those environments.&lt;/p&gt;

&lt;p&gt;When NeuroScale blocks a deployment, it's not because a technical constraint was violated. It's because &lt;em&gt;this organization&lt;/em&gt; decided that every AI workload must carry an owner label and a cost-center label before it can run. That policy came from a finance conversation, not a technical specification. It was encoded into the platform as a business rule.&lt;/p&gt;

&lt;p&gt;Google's Managed Agents don't know about your finance team's requirements. The Antigravity SDK (also announced at I/O) gives you the infrastructure to build that enforcement — but you still have to build it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared infrastructure still needs shared accounting.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On NeuroScale, every AI workload carries labels. Those labels flow into a cost monitoring tool. At the end of the month, I can tell you exactly which team's models cost how much. That's not a technical feature — it's an organizational requirement that someone had to translate into a technical implementation.&lt;/p&gt;

&lt;p&gt;Managed Agents bill at the API level. You know what your API calls cost. You don't automatically know which business unit those costs should be attributed to. That mapping is still manual.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compliance environments don't get to outsource their audit trails.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Healthcare, finance, government. For these industries, "who deployed what, when, with what configuration, and who approved it" is not a preference — it's a requirement. That audit trail has to live somewhere you control, in a format your auditors accept.&lt;/p&gt;

&lt;p&gt;Google's infrastructure is excellent. It's not &lt;em&gt;your&lt;/em&gt; infrastructure. There's a difference that matters in regulated industries, and Managed Agents don't close it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Broader Signal Most I/O Coverage Missed
&lt;/h2&gt;

&lt;p&gt;Everyone wrote about the flashy announcements. Gemini Omni. Universal Cart. Gemini Spark doing your tasks while your laptop is off.&lt;/p&gt;

&lt;p&gt;But for developers who build things for other developers — the platform engineers, the DevOps practitioners, the infrastructure leads — the signal at I/O 2026 was quieter and more consequential.&lt;/p&gt;

&lt;p&gt;Google announced a complete rethinking of what "developer infrastructure" means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Managed Agents:&lt;/strong&gt; The unit of deployment is no longer a container. It's an agent interaction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Antigravity SDK:&lt;/strong&gt; Google's own agent harness, self-hostable on your infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AGENTS.md / SKILL.md:&lt;/strong&gt; Infrastructure behavior defined as readable markdown files.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chrome DevTools for Agents:&lt;/strong&gt; Your debugging tools follow your agents into the browser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebMCP:&lt;/strong&gt; A proposed standard for AI agents to interact with any website as a tool.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't separate products. They're an architecture. Google is describing a world where:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You describe what you want in natural language or markdown.&lt;/li&gt;
&lt;li&gt;An agent figures out how to do it in an isolated, managed environment.&lt;/li&gt;
&lt;li&gt;The infrastructure is someone else's problem.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For solo developers and small teams, this is unambiguously good. Stop building platforms. Start building products.&lt;/p&gt;

&lt;p&gt;For platform engineers at organizations with compliance requirements, cost accountability, and multi-team governance — the work isn't going away. But the tools are getting better, and the layer where the work happens is shifting upward.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Framework for Every Developer Reading This
&lt;/h2&gt;

&lt;p&gt;The most useful thing I can offer from six months of platform building:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You do NOT need custom infrastructure if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're one developer or a small team&lt;/li&gt;
&lt;li&gt;You're prototyping or experimenting with a new idea&lt;/li&gt;
&lt;li&gt;Your data and compliance requirements are standard&lt;/li&gt;
&lt;li&gt;Speed to working product matters more than organizational control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ Use Managed Agents. Use Google AI Studio. Ship your product, not your platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You still need custom infrastructure if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're in a regulated industry with specific audit requirements&lt;/li&gt;
&lt;li&gt;Multiple teams share compute and need separate cost accountability&lt;/li&gt;
&lt;li&gt;Your organization has security or data residency constraints that preclude managed services&lt;/li&gt;
&lt;li&gt;You need to enforce business rules at the infrastructure layer — not just as guidelines, but as technical blocks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ The Antigravity SDK exists for you. Take Google's agent harness, host it yourself, build your organizational layer on top.&lt;/p&gt;

&lt;p&gt;The honest summary: most developers building most things should use what Google shipped at I/O. The exceptions are real, they're specific, and they're usually organizational rather than technical.&lt;/p&gt;

&lt;p&gt;Want to feel the difference yourself? Install the SDK (&lt;code&gt;pip install google-genai&lt;/code&gt;), grab an API key from Google AI Studio, and run the code block above. In under a minute, you'll have an agent running in an isolated sandbox — doing what took me weeks of Kubernetes configuration to set up. That gap between "weeks" and "one minute" is the entire argument of this post.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'm Actually Taking Away From I/O 2026
&lt;/h2&gt;

&lt;p&gt;NeuroScale is staying up. The organizational requirements it solves are real.&lt;/p&gt;

&lt;p&gt;But the part of it I'm most proud of — the developer experience, where someone fills in a form and infrastructure appears — is now a solved problem. Google solved it. One API call.&lt;/p&gt;

&lt;p&gt;That frees me to focus on the part that isn't solved: the organizational layer that every company has to build for themselves, because it encodes their specific rules, their specific accountability structures, their specific compliance requirements.&lt;/p&gt;

&lt;p&gt;Platform engineering used to be 80% "build the scaffolding so developers don't have to" and 20% "encode the organization's requirements."&lt;/p&gt;

&lt;p&gt;After I/O 2026, I think those numbers flip.&lt;/p&gt;

&lt;p&gt;The scaffolding is increasingly a service you call. The organization's requirements are still yours to build. And that's the more interesting work anyway — it requires understanding the business, not just the toolchain.&lt;/p&gt;

&lt;p&gt;108 commits taught me what the invisible infrastructure actually contains. Google I/O 2026 told me where the next 108 should go.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;NeuroScale is open source: &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;. Every failure, every fix, every postmortem is documented. The platform still works. The smoke tests still pass. The world just got a shortcut to where I started.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>googleiochallenge</category>
    </item>
    <item>
      <title>Gemma 4 Broke My Kubernetes Resource Model. Here's What I Measured.</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Sat, 23 May 2026 20:21:24 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/i-build-the-infrastructure-that-serves-ai-models-gemma-4-just-made-my-job-existential-4cek</link>
      <guid>https://forem.com/sodiqjimoh/i-build-the-infrastructure-that-serves-ai-models-gemma-4-just-made-my-job-existential-4cek</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Gemma 4 Challenge: Write about Gemma 4 Submission&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is a submission for the Gemma 4 Challenge: Write About Gemma 4&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I build a self-service AI inference platform called NeuroScale. Developers fill a Backstage form, the platform generates KServe manifests via PR, ArgoCD deploys them, and a production inference endpoint goes live. 108 commits, 21 smoke tests, 6 milestone postmortems.&lt;/p&gt;

&lt;p&gt;When I wrote Gemma 4's InferenceService manifest, the resource requests looked identical to a dense model of the same size. The numbers said otherwise: 48 GB of VRAM holding a model that activates 3.8B of 25.2B parameters per token, GPU compute utilization at 40% while p95 latency doubled, and OpenCost billing two teams the same rate for workloads with 2× different per-token costs. Every broken assumption traces back to one architectural decision: Gemma 4 doesn't replace its dense FFN with experts — it runs both.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture That Breaks the Assumptions
&lt;/h2&gt;

&lt;p&gt;You need to see why Gemma 4's MoE breaks things that Mixtral and DeepSeek don't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standard MoE (Mixtral, DeepSeek, Qwen):&lt;/strong&gt; the dense FFN is replaced by sparse experts. Router picks a subset per token. Dense path is gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemma 4 26B keeps three pathways running in parallel:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemma 4 26B MoE block (actual architecture):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                 Input
                   │
                   ▼
              [Attention]
                   │
       ┌───────────┼───────────┐
       ▼           ▼           ▼
  [Dense FFN] [Shared Exp]  [Router]
  (always on)  (always on)     │
       │           │    ┌──────┼──────┐
       │           │    ▼      ▼      ▼
       │           │ [Exp 1] [Exp 2]...[Exp 128]
       │           │    │      │       │
       │           │    └──────┴───────┘
       │           │       (8 fire)
       └───────────┴───────────┘
                   │
           [Sum all three]
                   │
                   ▼
                Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;128 routed experts, 8 active per token, one shared expert, plus a dense FFN — all summed together. Total parameters: ~25.2B. Active per token: ~3.8B. Active-to-total ratio: 0.15.&lt;/p&gt;

&lt;p&gt;For comparison: Mixtral 8x7B activates 13B of 47B (ratio: 0.28). DeepSeek-V2 activates 21B of 236B (ratio: 0.09, but across a far larger total). Gemma 4's ratio is the lowest among sub-30B MoE models because the always-on dense FFN and shared expert consume parameter budget without contributing to the "active" count in the way you'd expect. The dense FFN is a structural safety net — if the router picks wrong experts, the always-on pathways carry the signal. This is why Gemma 4 trains more stably and degrades gracefully under quantization.&lt;/p&gt;

&lt;p&gt;For platform engineers, the consequence: the gap between "parameters in memory" and "parameters doing compute" is wider than any previous MoE at this scale. That gap breaks three things in Kubernetes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Consequence 1: The Numbers That Broke My Cost Model
&lt;/h2&gt;

&lt;p&gt;On NeuroScale, every InferenceService requires a cost-center label. Our Kyverno admission policy blocks any deployment without one, and OpenCost reads these labels to attribute GPU-hours to teams.&lt;/p&gt;

&lt;p&gt;Here's what the numbers look like when you deploy both on equivalent hardware via vLLM:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;26B MoE (A4B)&lt;/th&gt;
&lt;th&gt;31B Dense&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VRAM consumed (BF16)&lt;/td&gt;
&lt;td&gt;48.1 GB&lt;/td&gt;
&lt;td&gt;62 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active params per token&lt;/td&gt;
&lt;td&gt;3.8B&lt;/td&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode throughput (single user, H100)&lt;/td&gt;
&lt;td&gt;177.1 tok/s&lt;/td&gt;
&lt;td&gt;40.3 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode throughput (concurrency 16, H100)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;375 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTFT (concurrency 1, H100)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;67.7 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-token cost (cloud API)&lt;/td&gt;
&lt;td&gt;$0.06 / 1M tokens&lt;/td&gt;
&lt;td&gt;$0.12 / 1M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU allocated&lt;/td&gt;
&lt;td&gt;1× A100&lt;/td&gt;
&lt;td&gt;1× A100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;(H100 throughput: JarvisLabs SPEED-Bench on vLLM 0.8.5; VRAM: controlled A6000 BF16 benchmark; API pricing: Google Cloud Vertex AI. Reproduce the throughput test: &lt;code&gt;vllm serve google/gemma-4-27b-it --dtype bfloat16 --max-model-len 8192&lt;/code&gt; then hit &lt;code&gt;/v1/completions&lt;/code&gt; with a 512-token prompt.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;OpenCost attributes cost based on resource allocation, not utilization. Both models allocate 1× A100. Both get billed identically per GPU-hour. But the MoE delivers 4.4× higher single-user throughput and 2× lower per-token cost.&lt;/p&gt;

&lt;p&gt;In a multi-tenant cluster where Team A runs the 26B MoE and Team B runs the 31B Dense, OpenCost bills them the same rate. Team A gets 4.4× the throughput on identical hardware. The fair billing unit for MoE isn't GPU-hours — it's &lt;strong&gt;GPU-hours weighted by active parameter ratio&lt;/strong&gt;. No Kubernetes cost attribution tool I've found supports this.&lt;/p&gt;

&lt;p&gt;The math: Gemma 4's total-to-active ratio is 6.6:1 (25.2B / 3.8B). Mixtral's is 3.6:1 (47B / 13B). The cost attribution error for Gemma 4 is almost double Mixtral's — a direct consequence of the 3-pathway design keeping more parameters loaded but inactive.&lt;/p&gt;




&lt;h2&gt;
  
  
  Consequence 2: GPU Utilization Lies to the Autoscaler
&lt;/h2&gt;

&lt;p&gt;KServe supports HPA-based autoscaling. The standard setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;serving.kserve.io/autoscalerClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hpa"&lt;/span&gt;
    &lt;span class="na"&gt;serving.kserve.io/targetUtilizationPercentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;70"&lt;/span&gt;
    &lt;span class="na"&gt;serving.kserve.io/metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For dense models, GPU compute utilization roughly tracks inference load. More requests → more compute → higher utilization. Scaling at 70% is a reasonable proxy.&lt;/p&gt;

&lt;p&gt;For the MoE, this proxy breaks. The causal chain:&lt;/p&gt;

&lt;p&gt;LLM decode is memory-bandwidth-bound, not compute-bound. Each output token reads model weights from VRAM. Google Cloud's analysis found a cluster pushing 1M tokens/second showed only 4.4% GPU FLOPS utilization — tensor cores finish in microseconds, then wait for data.&lt;/p&gt;

&lt;p&gt;Gemma 4 amplifies this: 25.2B of weights sit in VRAM, but only 3.8B do arithmetic per token. The memory bus shuttles expert weights that may not even fire.&lt;/p&gt;

&lt;p&gt;Result: memory bandwidth saturates — throughput degrades, latency climbs — while GPU compute utilization reads 30–40%.&lt;/p&gt;

&lt;p&gt;HPA sees 40% against a 70% target. Decision: &lt;em&gt;don't scale.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The correct autoscaling metric for MoE isn't utilization — it's request latency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v2&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-moe-hpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-moe&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pods&lt;/span&gt;
      &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm_request_p95_latency_seconds&lt;/span&gt;
        &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AverageValue&lt;/span&gt;
          &lt;span class="na"&gt;averageValue&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2.0"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This requires a custom metrics pipeline (Prometheus adapter → HPA custom metrics API). Significantly more infrastructure than built-in CPU scaling. But without it, the default autoscaling path that works for every dense model silently fails for MoE.&lt;/p&gt;




&lt;h2&gt;
  
  
  Consequence 3: Expert Parallelism Crashes Under Data Parallelism
&lt;/h2&gt;

&lt;p&gt;vLLM supports expert parallelism for MoE models — distributing individual experts across GPUs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve google/gemma-4-26B-A4B-it &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--enable-expert-parallel&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 32768
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding &lt;code&gt;--data-parallel-size 2&lt;/code&gt; to scale horizontally: weights load, CUDA graphs capture, API servers start — then it crashes on the first inference request.&lt;/p&gt;

&lt;p&gt;Reproduce it yourself (2× GPU required):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve google/gemma-4-26B-A4B-it &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--data-parallel-size&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 32768
&lt;span class="c"&gt;# Send any request → crash&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exact error from vLLM issue #38999:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vllm/distributed/device_communicators/cuda_communicator.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
  &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_all_gather_single&lt;/span&gt;
&lt;span class="nb"&gt;AssertionError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;36&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Root cause: The MoE fused expert layer assumes that multiple GPUs means expert parallelism, triggering inter-GPU &lt;code&gt;all_gather&lt;/code&gt; operations for expert routing. Under data parallelism, each GPU should run an independent full copy — no cross-GPU expert communication. The DP workers initialize with EP-style communication, causing a tensor shape mismatch.&lt;/p&gt;

&lt;p&gt;Dense models (31B, E4B) work fine with &lt;code&gt;--data-parallel-size &amp;gt; 1&lt;/code&gt;. The crash is specific to MoE + DP.&lt;/p&gt;

&lt;p&gt;The vLLM collaborator confirmed: you must pass &lt;code&gt;--enable-expert-parallel&lt;/code&gt; when deploying MoE in DP mode. Without it, DP mode crashes. With it, you get expert parallelism semantics (experts distributed across GPUs) instead of data parallelism semantics (full model copies per GPU).&lt;/p&gt;

&lt;p&gt;For a platform that lets developers self-serve model deployments, this means: &lt;strong&gt;you cannot expose "number of GPUs" as a user-facing knob for MoE models.&lt;/strong&gt; The parallelism strategy determines whether the deployment runs or crashes, and the correct choice depends on the model architecture — not the model size.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Outage That Taught Me to Distrust Defaults
&lt;/h2&gt;

&lt;p&gt;I wouldn't have found any of this without getting burned first.&lt;/p&gt;

&lt;p&gt;KServe's default config assumes Istio. We ran Kourier (~100 MB vs Istio's ~1 GB) because our k3d dev cluster didn't have the RAM. Result: all InferenceService creation blocked cluster-wide. Three hours of downtime. The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;disableIstioVirtualHost&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One boolean, not in the getting-started docs. Found it in the controller source.&lt;/p&gt;

&lt;p&gt;The lesson applies directly: when the YAML looks right and everything deploys green, check what the YAML doesn't say. The MoE manifest is valid. The pod starts. The cost model, autoscaler, and parallelism strategy are all silently wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CI That Lied for Two Weeks
&lt;/h2&gt;

&lt;p&gt;For two weeks, Kyverno policy checks showed green on every PR. Policies weren't enforced at all.&lt;/p&gt;

&lt;p&gt;The bug: &lt;code&gt;kyverno-cli apply&lt;/code&gt; exits code 0 even when policies are violated. Violations print to stdout; the exit code — what CI checks — says "success." The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OUTPUT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;kyverno-cli apply ./policies/ &lt;span class="nt"&gt;--resource&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$manifest&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;&amp;amp;1&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$OUTPUT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qi&lt;/span&gt; &lt;span class="s2"&gt;"fail&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;violation&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;denied"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Policy violation detected"&lt;/span&gt;
    &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any developer could have deployed an InferenceService without resource limits, cost labels, or ownership metadata. CI was green. Governance was broken.&lt;/p&gt;

&lt;p&gt;Same pattern as the MoE manifest: the configuration is valid, the deployment succeeds, and the assumptions underneath are wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Platform Engineers Should Do
&lt;/h2&gt;

&lt;p&gt;After measuring all of this, here's the playbook:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Tag InferenceServices with architecture metadata.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;model-architecture&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;moe"&lt;/span&gt;
  &lt;span class="na"&gt;active-param-ratio&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.15"&lt;/span&gt;    &lt;span class="c1"&gt;# 3.8B / 25.2B&lt;/span&gt;
  &lt;span class="na"&gt;total-params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;25.2B"&lt;/span&gt;
  &lt;span class="na"&gt;cost-center&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cc-ml-inference"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your cost attribution, alerting, and capacity planning all need to know that this 48 GB model computes like a 4B model. Gemma 4's 3-pathway design (dense FFN + shared expert + routed experts) makes the total-to-active ratio higher than other MoE architectures — the metadata must capture this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Autoscale on latency, not utilization.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GPU compute utilization is a lagging, misleading indicator for MoE. Memory bandwidth saturation hits first. Use &lt;code&gt;vllm_request_p95_latency_seconds&lt;/code&gt; or &lt;code&gt;vllm_tokens_per_second&lt;/code&gt; as your HPA metric.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Don't expose parallelism strategy as a user knob.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your platform should detect MoE vs Dense from the model config and set &lt;code&gt;--enable-expert-parallel&lt;/code&gt; accordingly. Self-serve GPU count selection will produce runtime crashes for MoE models if the parallelism strategy is wrong — the model loads, CUDA graphs capture, then it crashes on the first request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Budget VRAM for total params, compute for active params.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Resource requests = total parameters (scheduling). Capacity planning = active parameters (throughput). One A100 running Gemma 4 MoE: 177 tok/s. Same A100 running 31B Dense: 40 tok/s. Same requests, 4.4× throughput difference. Your capacity model must account for this or you'll over-provision MoE by 4×.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Deeper Pattern: Architecture-Aware Scheduling
&lt;/h2&gt;

&lt;p&gt;These three failures (cost, scaling, parallelism) point to a structural gap in Kubernetes: &lt;strong&gt;the scheduler is architecture-blind.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;resources.requests.nvidia.com/gpu: 1&lt;/code&gt; tells the scheduler to find a node with a free GPU. It says nothing about whether that GPU will be memory-bandwidth-bound or compute-bound, whether the model activates 15% or 100% of its parameters, or whether horizontal scaling requires expert parallelism flags. The scheduler treats a 26B MoE and a 31B Dense as identical workloads because they request the same resource.&lt;/p&gt;

&lt;p&gt;This was fine when every model was dense. With Gemma 4's 3-pathway MoE entering production, the abstraction leaks. The fix isn't just labels and custom metrics — it's treating model architecture as a first-class scheduling dimension, the same way we treat GPU type, memory, and topology today.&lt;/p&gt;

&lt;p&gt;Every claim in this post — cost attribution errors, autoscaler blindness, DP crashes — is reproducible. The vLLM issue is public. The benchmarks are from published controlled tests (JarvisLabs, Google Cloud). The KServe outage and Kyverno bug happened on a real platform with 108 commits of history.&lt;/p&gt;

&lt;p&gt;The model works. The YAML looks right. The CI is green.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The governance is broken. And now I can prove it.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;NeuroScale is open source: &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;. The &lt;code&gt;BEFORE.md&lt;/code&gt; and &lt;code&gt;AFTER.md&lt;/code&gt; in each milestone directory tell the full story of what went wrong and what I learned.&lt;/p&gt;




</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>I Revived a Broken MLOps Platform — Now It's Self-Service, Policy-Guarded, and Operationally Credible</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Sat, 23 May 2026 00:42:39 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/i-revived-a-broken-mlops-platform-now-its-self-service-policy-guarded-and-operationally-55nj</link>
      <guid>https://forem.com/sodiqjimoh/i-revived-a-broken-mlops-platform-now-its-self-service-policy-guarded-and-operationally-55nj</guid>
      <description>&lt;p&gt;&lt;em&gt;Submitted for the &lt;a href="https://dev.to/challenges/github"&gt;GitHub Copilot Challenge&lt;/a&gt; — deadline June 7, 2026. Built with GitHub Copilot as an active architectural and debugging partner.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;On April 4th, I abandoned this platform. Backstage crashed. ArgoCD was broken. KServe couldn't serve a single model. I walked away and left it for dead.&lt;/p&gt;

&lt;p&gt;48 days later, I came back and rebuilt it into a self-service AI inference platform with GitOps, policy enforcement, and deterministic recovery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;21 checks. 0 failures. Reproducible on any machine.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftkdrn4kkuvezuiyub4qu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftkdrn4kkuvezuiyub4qu.png" alt="Before vs After — CrashLoopBackOff and kubectl apply → hope on the left. Backstage form → ArgoCD synced → PASS 21/FAIL 0 on the right. 48 Days Later." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Watch It Run: 21 Checks, 0 Failures
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jybvasne6iahluim73r.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jybvasne6iahluim73r.gif" alt="NeuroScale smoke test — 21 automated checks running live: ArgoCD drift self-heal, KServe inference returning predictions, Kyverno blocking non-compliant manifests, all green" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is not a claim. The video below shows every check running live against a real k3d cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://storage.googleapis.com/runable-templates/cli-uploads%2FKOX2Ek1YgxEESzcJKMx3OuH8Kfvn0qwn%2FZzLPz_kcSP_ofvCz1mZ77%2Fsmoke-test-demo_ppdIdW.mp4" rel="noopener noreferrer"&gt;▶ Watch the full smoke test demo — 21 checks, 0 failures&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qxe1c6k6ng94ib3upod.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qxe1c6k6ng94ib3upod.png" alt="Real terminal: Final smoke test results — PASS 21, FAIL 0, SKIP 1" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;━━━ Milestone A — GitOps Spine (ArgoCD) ━━━
  [✓ PASS] All ArgoCD pods are Running
  [✓ PASS] ArgoCD Applications: 7 Healthy, 0 Progressing, 7 total
  [✓ PASS] ArgoCD sync visibility: no Unknown states (7/7 Synced)
  [✓ PASS] Drift self-heal: nginx-test recreated and Ready in ~20s

━━━ Milestone B — AI Serving Baseline (KServe) ━━━
  [✓ PASS] KServe controller-manager: 1 replica available
  [✓ PASS] InferenceServices: 2/2 Ready=True
  [✓ PASS] Inference request: demo-iris-2 returned predictions
           ↳ Response: {"predictions":[1,1]}

━━━ Milestone C — Golden Path (Backstage) ━━━
  [✓ PASS] Backstage deployment: 1 replica available
  [✓ PASS] demo-iris-2 InferenceService exists (scaffolder output)
  [✓ PASS] demo-iris-2 ArgoCD Application exists (ApplicationSet output)

━━━ Milestone D — Guardrails (Kyverno + CI) ━━━
  [✓ PASS] Kyverno pods running: 3
  [✓ PASS] Kyverno ClusterPolicies installed: 5 policies
  [✓ PASS] Admission block: non-compliant InferenceService correctly denied

━━━ Milestone F — Production Hardening ━━━
  [✓ PASS] ApplicationSet neuroscale-model-endpoints exists
  [✓ PASS] ArgoCD has 7 Applications (ApplicationSet + static)
  [✓ PASS] ResourceQuota exists in namespace default
  [✓ PASS] LimitRange exists in namespace default
  [✓ PASS] Non-root admission block: root-container Deployment denied
  [✓ PASS] OpenCost deployment healthy: 1 replica available

  PASS 21 / FAIL 0 / SKIP 1
  ✓ All checks passed. Platform is healthy and ready to demo.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The single SKIP is the drift self-heal pre-condition check — normal after a previous test run. The drift self-heal itself passed, visible in the output above and in the video.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Reproducible on any machine:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash scripts/bootstrap.sh     &lt;span class="c"&gt;# ~5 minutes — requires Docker + k3d&lt;/span&gt;
bash scripts/smoke-test.sh    &lt;span class="c"&gt;# 21 checks, all green&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Problem: A Platform That Was Abandoned and Dangerous
&lt;/h2&gt;

&lt;p&gt;NeuroScale started in February 2026 as an AI inference platform on Kubernetes. By early April it was abandoned — Backstage crashing, ArgoCD broken, KServe unable to serve a single model. The last commit before this challenge was April 4th. Then 48 days of silence.&lt;/p&gt;

&lt;p&gt;Here's what I found when I came back:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backstage:&lt;/strong&gt; &lt;code&gt;CrashLoopBackOff&lt;/code&gt; — 14 restarts. A Helm values nesting bug caused probe timings to be silently ignored.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ArgoCD repo-server:&lt;/strong&gt; &lt;code&gt;CrashLoopBackOff&lt;/code&gt; — every application showed &lt;code&gt;Unknown&lt;/code&gt;, meaning ArgoCD couldn't even evaluate their state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KServe:&lt;/strong&gt; &lt;code&gt;READY=False&lt;/code&gt; — default config assumed Istio for ingress, but the cluster ran Kourier. Error: &lt;code&gt;"virtual service not found"&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Policy enforcement:&lt;/strong&gt; None. Root containers, no resource limits, &lt;code&gt;:latest&lt;/code&gt; tags — deployed freely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drift detection:&lt;/strong&gt; None. Manual &lt;code&gt;kubectl&lt;/code&gt; changes accumulated silently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The deployment process was &lt;code&gt;vim&lt;/code&gt; → &lt;code&gt;kubectl apply&lt;/code&gt; → hope. Developers feared deploying models. The platform was technically worse than not having one.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built: Five Enforcement Layers
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnn7wmg21nd1g3442ptce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnn7wmg21nd1g3442ptce.png" alt="NeuroScale Platform Architecture — Backstage → GitHub PR + CI → ArgoCD → Kyverno → KServe + Kourier → OpenCost" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Self-Service Golden Path
&lt;/h3&gt;

&lt;p&gt;A developer fills in a Backstage form. The platform does everything else.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Backstage form → PR created → CI validates → Merge → ArgoCD syncs
  → ApplicationSet discovers → KServe endpoint live → Predictions working
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No &lt;code&gt;kubectl&lt;/code&gt;. No YAML editing. No tribal knowledge. The &lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/backstage/templates/model-endpoint/template.yaml" rel="noopener noreferrer"&gt;template.yaml&lt;/a&gt; generates a compliant &lt;code&gt;InferenceService&lt;/code&gt; manifest, opens a PR, and the &lt;code&gt;neuroscale-model-endpoints&lt;/code&gt; ApplicationSet auto-discovers it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxx9ux47cvg7y4d283dl0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxx9ux47cvg7y4d283dl0.png" alt="Backstage Scaffolder UI — KServe model endpoint form: endpoint name (DNS pattern enforced), model format dropdown, storage URI, owner label, cost center label" width="800" height="596"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Five fields. No kubectl. No YAML. DNS pattern enforced client-side, cost center required. Click Next and Backstage does the rest.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7l83qosyh2nqzhvlqng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7l83qosyh2nqzhvlqng.png" alt="Backstage scaffolder execution log — " width="800" height="351"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Two steps, 9 seconds total. PR opened, ApplicationSet picks it up on next ArgoCD sync.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: GitOps Drift Control
&lt;/h3&gt;

&lt;p&gt;Git is the source of truth. Drift is auto-corrected.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl delete deploy nginx-test &lt;span class="nt"&gt;-n&lt;/span&gt; default
&lt;span class="c"&gt;# 20 seconds later...&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get deploy nginx-test &lt;span class="nt"&gt;-n&lt;/span&gt; default
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
nginx-test   1/1     1            1           8s   &lt;span class="c"&gt;# Auto-recreated by ArgoCD&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;selfHeal: true&lt;/code&gt; and &lt;code&gt;prune: true&lt;/code&gt;. Manual cluster changes cannot persist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Policy Guardrails — Shift-Left + Shift-Down
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;At PR time (CI):&lt;/strong&gt; kubeconform validates schemas. &lt;code&gt;kyverno-cli&lt;/code&gt; simulates all 5 policies against rendered manifests with a dual exit-code + stdout check to guard against false-greens. Full pipeline at &lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;.github/workflows/guardrails-checks.yaml&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At admission time (cluster):&lt;/strong&gt; Kyverno blocks non-compliant resources before they reach the cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7udagtvnfhvsbbtyfrdx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7udagtvnfhvsbbtyfrdx.png" alt="Real terminal: Kyverno admission denial — " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Five enforced policies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Policy&lt;/th&gt;
&lt;th&gt;What It Blocks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;require-standard-labels-inferenceservice&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Missing &lt;code&gt;owner&lt;/code&gt; + &lt;code&gt;cost-center&lt;/code&gt; labels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;require-standard-labels-deployment&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Missing ownership labels on Deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;require-resource-requests-limits&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No CPU/memory requests or limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disallow-latest-image-tag&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Floating &lt;code&gt;:latest&lt;/code&gt; image tags&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disallow-root-containers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Containers without &lt;code&gt;runAsNonRoot: true&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Layer 4: Cost Attribution
&lt;/h3&gt;

&lt;p&gt;Every workload carries &lt;code&gt;owner&lt;/code&gt; and &lt;code&gt;cost-center&lt;/code&gt; labels enforced by Kyverno — you can't deploy without them. OpenCost reads these via Prometheus for per-team cost breakdowns. The CI pipeline also &lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml" rel="noopener noreferrer"&gt;comments on PRs&lt;/a&gt; with CPU/memory deltas and flags workloads exceeding thresholds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Operational Recovery
&lt;/h3&gt;

&lt;p&gt;Documented runbooks for every failure mode encountered. 3-command, 2-minute recovery procedures. Full runbook at &lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/runbook.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Copilot Partnership: Three Moments That Mattered
&lt;/h2&gt;

&lt;p&gt;Copilot didn't write this platform. It functioned as a senior infrastructure advisor at three exact moments where I could have stayed stuck for days.&lt;/p&gt;

&lt;h3&gt;
  
  
  Moment 1: The Architectural Decision — Kourier vs Istio
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; KServe stuck at &lt;code&gt;READY=False&lt;/code&gt;. Error: &lt;code&gt;"virtual service not found"&lt;/code&gt; — an Istio concept on a cluster running Kourier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkrkhxtzoaogji2kqgcgq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkrkhxtzoaogji2kqgcgq.png" alt="Copilot Chat — KServe virtual service not found. Copilot searches repo files, identifies Kourier/Istio mismatch, delivers 5-step fix with exact ConfigMap keys" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwnusbvfcva6ly79rlb5d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwnusbvfcva6ly79rlb5d.png" alt="Copilot Chat continued — exact kubectl commands: reapply serving-stack overlay, verify ConfigMap values, restart kserve-controller-manager and knative-serving controller" width="800" height="506"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Copilot searched the actual repo files, confirmed the non-Istio setup was already correct, and identified the root cause: stale cached config. Critical tradeoff it surfaced — Istio adds ~1GB memory overhead; Kourier is under 200MB. On a shared 8GB dev node, Istio would have killed reproducibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Reapply the serving-stack overlay, verify &lt;code&gt;disableIstioVirtualHost=true&lt;/code&gt; in ConfigMaps, restart control plane pods. Result: working inference, 800MB freed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Moment 2: The Silent Bug — CI Guardrails That Can't False-Green
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; &lt;code&gt;kyverno-cli apply&lt;/code&gt; looked green in CI. Then I tested with a deliberately non-compliant manifest. It still passed. The guardrail was checking nothing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw3yr9aa8t36d9czqqtlg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw3yr9aa8t36d9czqqtlg.png" alt="Copilot Chat — " width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5ypvtj485rmu7gleyrz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5ypvtj485rmu7gleyrz.png" alt="Copilot Chat continued — delivers the per-resource fan-out pattern with dual exit-status capture and independent violation detection" width="782" height="564"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two undocumented &lt;code&gt;kyverno-cli&lt;/code&gt; behaviors Copilot surfaced:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A single &lt;code&gt;--resource&lt;/code&gt; flag with multiple paths silently ignores every path after the first.&lt;/li&gt;
&lt;li&gt;Exit code is &lt;code&gt;0&lt;/code&gt; even when violations are printed to stdout.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The fix (live in &lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;.github/workflows/guardrails-checks.yaml&lt;/code&gt;&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Per-resource fan-out — one file per --resource flag&lt;/span&gt;
&lt;span class="nb"&gt;mapfile&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; app_files &amp;lt; &amp;lt;&lt;span class="o"&gt;(&lt;/span&gt;find apps &lt;span class="nt"&gt;-type&lt;/span&gt; f &lt;span class="se"&gt;\(&lt;/span&gt; &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s1"&gt;'*.yaml'&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s1"&gt;'*.yml'&lt;/span&gt; &lt;span class="se"&gt;\)&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;failed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0

&lt;span class="k"&gt;for &lt;/span&gt;resource &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;app_files&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;log&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;mktemp&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PWD&lt;/span&gt;&lt;span class="s2"&gt;:/work"&lt;/span&gt; &lt;span class="nt"&gt;-w&lt;/span&gt; /work ghcr.io/kyverno/kyverno-cli:v1.12.5 &lt;span class="se"&gt;\&lt;/span&gt;
      apply infrastructure/kyverno/policies/&lt;span class="k"&gt;*&lt;/span&gt;.yaml &lt;span class="nt"&gt;--resource&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$resource&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;&amp;amp;1 | &lt;span class="nb"&gt;tee&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$log&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nv"&gt;failed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
  &lt;span class="k"&gt;fi&lt;/span&gt;
  &lt;span class="c"&gt;# Dual check: exit code AND stdout — never trust one signal alone&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qiE&lt;/span&gt; &lt;span class="s1"&gt;'denied|violat|fail|error'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$log&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nv"&gt;failed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
  &lt;span class="k"&gt;fi
done

&lt;/span&gt;&lt;span class="nb"&gt;exit&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$failed&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;A guardrail that silently passes is worse than no guardrail.&lt;/strong&gt; Every team using &lt;code&gt;kyverno-cli&lt;/code&gt; in CI without this pattern has a potential false-green.&lt;/p&gt;

&lt;h3&gt;
  
  
  Moment 3: The Recurring Crash — CrashLoopBackOff and the Runbook
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Backstage in &lt;code&gt;CrashLoopBackOff&lt;/code&gt; — 7 restarts. Pod failing health checks it never properly configured.&lt;/p&gt;

&lt;p&gt;I pasted the raw &lt;code&gt;kubectl get pods -n backstage -w&lt;/code&gt; output directly into Copilot:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjxrjvlp1sod0now78yuw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjxrjvlp1sod0now78yuw.png" alt="Copilot Chat — kubectl output showing CrashLoopBackOff 7 restarts. Copilot plans diagnosis, checks pod termination cause" width="799" height="479"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbv7r10sojcjfghibq3u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbv7r10sojcjfghibq3u.png" alt="Copilot Chat — root cause found: wrong Helm values nesting backstage.backend/backstage.appConfig → must be backstage.backstage.*. Startup probe missing, probes using aggressive defaults." width="799" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Root cause: Backstage is a dependency chart — app settings must be nested under &lt;code&gt;backstage.backstage.*&lt;/code&gt;. Keys at the wrong level meant Helm silently used defaults, including &lt;code&gt;failureThreshold: 3&lt;/code&gt; with aggressive timings and no &lt;code&gt;startupProbe&lt;/code&gt;. The container kept failing before the plugin system finished initializing.&lt;/p&gt;

&lt;p&gt;ArgoCD hit the same pattern: Kyverno installation disrupted the repo-server's gRPC channel, which doesn't auto-reconnect — causing all apps to show &lt;code&gt;Unknown&lt;/code&gt;. Copilot identified both as the same root pattern and helped write a deterministic recovery:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd rollout restart deploy/argocd-repo-server
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd rollout status deploy/argocd-repo-server &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;120s
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get applications  &lt;span class="c"&gt;# All Synced/Healthy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That became &lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/runbook.md&lt;/code&gt;&lt;/a&gt;. The platform doesn't just work — it's recoverable. That's the difference between a demo and a real platform.&lt;/p&gt;




&lt;h2&gt;
  
  
  Before vs After
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For Developers
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Edit YAML by hand, &lt;code&gt;kubectl apply&lt;/code&gt;, hope&lt;/td&gt;
&lt;td&gt;Fill a Backstage form, review a PR, merge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No policy feedback until deployment fails&lt;/td&gt;
&lt;td&gt;CI blocks non-compliant manifests before merge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No cost visibility&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml" rel="noopener noreferrer"&gt;PR comment shows CPU/memory delta&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tribal knowledge required&lt;/td&gt;
&lt;td&gt;Anyone can deploy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  For Operators
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Manual cluster inspection for drift&lt;/td&gt;
&lt;td&gt;ArgoCD self-heals in ~20 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No runbooks — "ask the person who built it"&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;Documented recovery&lt;/a&gt; for every failure mode&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No smoke tests&lt;/td&gt;
&lt;td&gt;21-check automated verification, any machine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No namespace governance&lt;/td&gt;
&lt;td&gt;ResourceQuota + LimitRange enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  For the Platform
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Abandoned since April 4th&lt;/td&gt;
&lt;td&gt;Finished, documented, reproducible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Collection of broken parts&lt;/td&gt;
&lt;td&gt;6 milestones, 21 verified checks, 0 failures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual and error-prone&lt;/td&gt;
&lt;td&gt;Automated and policy-enforced end-to-end&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What Made This Real: The Failures
&lt;/h2&gt;

&lt;p&gt;This was not built on the happy path. Every milestone hit real failures:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Milestone&lt;/th&gt;
&lt;th&gt;Key Failure&lt;/th&gt;
&lt;th&gt;What It Taught Me&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A — GitOps Spine&lt;/td&gt;
&lt;td&gt;ArgoCD &lt;code&gt;Unknown&lt;/code&gt; ≠ &lt;code&gt;Error&lt;/code&gt; — comparison engine couldn't run&lt;/td&gt;
&lt;td&gt;Don't confuse UI status with root cause&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B — KServe Serving&lt;/td&gt;
&lt;td&gt;Istio/Kourier mismatch — undocumented KServe default&lt;/td&gt;
&lt;td&gt;Always verify infrastructure defaults on constrained clusters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C — Golden Path&lt;/td&gt;
&lt;td&gt;Backstage CrashLoopBackOff from Helm mis-nesting — probes silently ignored&lt;/td&gt;
&lt;td&gt;CI must validate rendered manifests, not just source YAML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D — Guardrails&lt;/td&gt;
&lt;td&gt;Kyverno webhook disrupts all ArgoCD apps during install window&lt;/td&gt;
&lt;td&gt;Admission controllers need deployment ordering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E — Cost &amp;amp; CI&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;kyverno-cli&lt;/code&gt; false-green: exit 0 with actual violations&lt;/td&gt;
&lt;td&gt;Dual-check exit code AND stdout — never trust one signal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F — Hardening&lt;/td&gt;
&lt;td&gt;ApplicationSet replaced per-app files — requires skeleton alignment&lt;/td&gt;
&lt;td&gt;Scaffolder templates must match GitOps discovery patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Full failure log and recovery steps in &lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/runbook.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone and bootstrap (requires Docker + k3d + kubectl + helm)&lt;/span&gt;
git clone https://github.com/sodiq-code/neuroscale-platform.git
&lt;span class="nb"&gt;cd &lt;/span&gt;neuroscale-platform
bash scripts/bootstrap.sh     &lt;span class="c"&gt;# ~5 minutes&lt;/span&gt;

&lt;span class="c"&gt;# Verify everything works&lt;/span&gt;
bash scripts/smoke-test.sh    &lt;span class="c"&gt;# 21 checks, 0 failures&lt;/span&gt;

&lt;span class="c"&gt;# Open all UIs&lt;/span&gt;
bash scripts/port-forward-all.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After &lt;code&gt;port-forward-all.sh&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backstage&lt;/strong&gt; at &lt;code&gt;http://localhost:7010&lt;/code&gt; — developer portal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ArgoCD&lt;/strong&gt; at &lt;code&gt;http://localhost:8080&lt;/code&gt; — 7 synced applications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenCost&lt;/strong&gt; at &lt;code&gt;http://localhost:9090&lt;/code&gt; — per-workload cost attribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;5 minutes from &lt;code&gt;git clone&lt;/code&gt; to a fully working platform. The smoke test proves it all.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Copilot helped at the exact points where strong engineering judgment mattered most: an architectural tradeoff that saved 800MB of memory, a silent CI bug that every &lt;code&gt;kyverno-cli&lt;/code&gt; user faces, and operational recovery that turns a 2-hour outage into a 2-minute runbook.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;21 checks. 0 failures. Reproducible on any machine.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's one abandoned project you wish you had finished? Drop it in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>githubchallenge</category>
      <category>githubcopilot</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Your Kyverno CI Is Lying to You: Why kyverno-cli Exits 0 on Policy Violations</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Tue, 14 Apr 2026 06:08:38 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/your-kyverno-ci-is-lying-to-you-why-kyverno-cli-exits-0-on-policy-violations-1gg9</link>
      <guid>https://forem.com/sodiqjimoh/your-kyverno-ci-is-lying-to-you-why-kyverno-cli-exits-0-on-policy-violations-1gg9</guid>
      <description>&lt;p&gt;This article is about a CI system that looked healthy for two weeks while&lt;br&gt;
enforcing absolutely nothing.&lt;/p&gt;

&lt;p&gt;The symptom was simple: a PR with a deliberately non-compliant Kubernetes&lt;br&gt;
manifest was passing CI. The Kyverno policy check step showed green.&lt;br&gt;
The manifest was missing a required label. Kyverno should have blocked it.&lt;br&gt;
It did not.&lt;/p&gt;

&lt;p&gt;This is the story of why, and the exact fix.&lt;/p&gt;

&lt;p&gt;This is part of a series on building a production-hardened AI inference&lt;br&gt;
platform on Kubernetes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Part 1: &lt;a href="https://dev.to/sodiqjimoh/why-your-kserve-inferenceservice-wontbecome-readyfour-production-failures-and-fixes-nei"&gt;Why Your KServe InferenceService Won't Become Ready&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Part 2: &lt;a href="https://dev.to/sodiqjimoh/beyond-inferenceservice-readiness-5-gitops-failure-modes-that-break-kserve-deployments-14fb"&gt;5 GitOps Failure Modes That Break KServe Deployments&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Part 3: &lt;a href="https://dev.to/sodiqjimoh/nine-ways-backstage-breaks-before-your-developer-portal-works-4eo1"&gt;Nine Ways Backstage Breaks Before Your Developer Portal Works&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Project repo:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The context: what I was enforcing
&lt;/h2&gt;

&lt;p&gt;The NeuroScale platform requires every &lt;code&gt;InferenceService&lt;/code&gt; and &lt;code&gt;Deployment&lt;/code&gt;&lt;br&gt;
in the default namespace to carry two labels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;owner&lt;/code&gt; — which team owns the workload&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cost-center&lt;/code&gt; — which budget the resource consumption is charged to&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These labels feed directly into OpenCost for cost attribution. Without them,&lt;br&gt;
workloads appear as uncategorised spend. With Kyverno admission policies&lt;br&gt;
enforcing them at the cluster level, every resource is guaranteed to carry&lt;br&gt;
cost attribution metadata.&lt;/p&gt;

&lt;p&gt;The Kyverno ClusterPolicy looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/kyverno/policies/&lt;/span&gt;
&lt;span class="c1"&gt;#   require-standard-labels-inferenceservice.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require-standard-labels-inferenceservice&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validationFailureAction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enforce&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;check-owner-and-cost-center-on-isvc&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
      &lt;span class="na"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="s"&gt;InferenceService resources must set&lt;/span&gt;
          &lt;span class="s"&gt;metadata.labels.owner and metadata.labels.cost-center.&lt;/span&gt;
        &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?*"&lt;/span&gt;
              &lt;span class="na"&gt;cost-center&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Admission enforcement works. Apply an InferenceService without those labels&lt;br&gt;
and Kyverno blocks it at the API server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; bad-model.yaml
Error from server: admission webhook
  &lt;span class="s2"&gt;"clusterpolice.kyverno.svc"&lt;/span&gt; denied the request:
  resource InferenceService/default/bad-model was blocked
  due to the following policies
  require-standard-labels-inferenceservice:
    check-owner-and-cost-center-on-isvc:
      &lt;span class="s1"&gt;'validation error: InferenceService resources must set
      metadata.labels.owner and metadata.labels.cost-center.'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That part worked correctly. The CI part did not.&lt;/p&gt;




&lt;h2&gt;
  
  
  The false-green: what it looked like
&lt;/h2&gt;

&lt;p&gt;The CI workflow ran &lt;code&gt;kyverno-cli&lt;/code&gt; against every changed manifest in &lt;code&gt;apps/&lt;/code&gt;&lt;br&gt;
on every pull request. The intent was to catch non-compliant manifests before&lt;br&gt;
merge — shift-left enforcement before the manifest ever reached the cluster.&lt;/p&gt;

&lt;p&gt;The original CI command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PWD&lt;/span&gt;&lt;span class="s2"&gt;:/work"&lt;/span&gt; &lt;span class="nt"&gt;-w&lt;/span&gt; /work &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/kyverno/kyverno-cli:v1.12.5 &lt;span class="se"&gt;\&lt;/span&gt;
  apply infrastructure/kyverno/policies/&lt;span class="k"&gt;*&lt;/span&gt;.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;app_files&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A test PR was created with a manifest that deliberately lacked &lt;code&gt;cost-center&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/test-bad-model/inference-service.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.kserve.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test-bad-model&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-team&lt;/span&gt;
    &lt;span class="c1"&gt;# cost-center intentionally missing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected result: CI fails.&lt;br&gt;
Actual result: CI passed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;git push origin feature/test-bad-policy
&lt;span class="c"&gt;# ... CI runs ...&lt;/span&gt;
&lt;span class="c"&gt;# Result: validate-policies-against-app-manifests: ✅ PASSED&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The violation was printed to stdout. The job still showed green.&lt;/p&gt;




&lt;h2&gt;
  
  
  Root Cause Part 1: kyverno-cli apply exits 0 on violations
&lt;/h2&gt;

&lt;p&gt;This is the core issue. &lt;code&gt;kyverno-cli apply&lt;/code&gt; in the version used (v1.12.x)&lt;br&gt;
exits with code &lt;code&gt;0&lt;/code&gt; when it finds policy violations. It prints them to&lt;br&gt;
stdout, but returns a success exit code.&lt;/p&gt;

&lt;p&gt;You can verify this directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PWD&lt;/span&gt;&lt;span class="s2"&gt;:/work"&lt;/span&gt; &lt;span class="nt"&gt;-w&lt;/span&gt; /work &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/kyverno/kyverno-cli:v1.12.5 &lt;span class="se"&gt;\&lt;/span&gt;
  apply infrastructure/kyverno/policies/&lt;span class="k"&gt;*&lt;/span&gt;.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource&lt;/span&gt; apps/test-bad-model/inference-service.yaml

&lt;span class="c"&gt;# Output:&lt;/span&gt;
&lt;span class="c"&gt;# PASS: 0, FAIL: 1, WARN: 0, ERROR: 0, SKIP: 0&lt;/span&gt;
&lt;span class="c"&gt;#&lt;/span&gt;
&lt;span class="c"&gt;# policy require-standard-labels-inferenceservice -&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;#   resource default/InferenceService/test-bad-model&lt;/span&gt;
&lt;span class="c"&gt;#   FAIL: check-owner-and-cost-center-on-isvc&lt;/span&gt;
&lt;span class="c"&gt;#&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$?&lt;/span&gt;
0   &lt;span class="c"&gt;# &amp;lt;-- exits 0 despite the FAIL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The violation is visible in stdout. The exit code is 0. Any CI step that&lt;br&gt;
only checks the exit code will report success.&lt;/p&gt;


&lt;h2&gt;
  
  
  Root Cause Part 2: $? captures tee, not kyverno
&lt;/h2&gt;

&lt;p&gt;The CI command piped output through &lt;code&gt;tee&lt;/code&gt; to capture it for logging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run ... kyverno-cli apply ... &lt;span class="se"&gt;\&lt;/span&gt;
  2&amp;gt;&amp;amp;1 | &lt;span class="nb"&gt;tee&lt;/span&gt; /tmp/kyverno-output.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even if &lt;code&gt;kyverno-cli&lt;/code&gt; had exited non-zero, &lt;code&gt;$?&lt;/code&gt; in bash captures the exit&lt;br&gt;
code of the &lt;strong&gt;last command in the pipe&lt;/strong&gt; — which is &lt;code&gt;tee&lt;/code&gt;. &lt;code&gt;tee&lt;/code&gt; always&lt;br&gt;
exits 0 if it can write to the file.&lt;/p&gt;

&lt;p&gt;This means two separate problems were stacked:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;kyverno-cli apply&lt;/code&gt; exits 0 on violations (kyverno behavior)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;$?&lt;/code&gt; captures &lt;code&gt;tee&lt;/code&gt; exit code, not kyverno exit code (bash pipe behavior)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Either problem alone would have caused the false-green.&lt;br&gt;
Together they made the enforcement completely invisible.&lt;/p&gt;


&lt;h2&gt;
  
  
  The fix: dual check with $PIPESTATUS[0]
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;${PIPESTATUS[0]}&lt;/code&gt; captures the exit code of the &lt;strong&gt;first command&lt;/strong&gt; in a&lt;br&gt;
pipe, regardless of what the subsequent commands return. Combined with&lt;br&gt;
stdout parsing for violation markers, this creates a reliable enforcement&lt;br&gt;
check.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;set&lt;/span&gt; +e
docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PWD&lt;/span&gt;&lt;span class="s2"&gt;:/work"&lt;/span&gt; &lt;span class="nt"&gt;-w&lt;/span&gt; /work &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/kyverno/kyverno-cli:v1.12.5 &lt;span class="se"&gt;\&lt;/span&gt;
  apply infrastructure/kyverno/policies/&lt;span class="k"&gt;*&lt;/span&gt;.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;app_files&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  2&amp;gt;&amp;amp;1 | &lt;span class="nb"&gt;tee&lt;/span&gt; /tmp/kyverno-output.txt
&lt;span class="nv"&gt;kyverno_exit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PIPESTATUS&lt;/span&gt;&lt;span class="p"&gt;[0]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;kyverno_exit&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-ne&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qE&lt;/span&gt; &lt;span class="s2"&gt;"^FAIL"&lt;/span&gt; /tmp/kyverno-output.txt &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qE&lt;/span&gt; &lt;span class="s2"&gt;"fail: [1-9][0-9]*"&lt;/span&gt; /tmp/kyverno-output.txt&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Kyverno policy violations detected. Failing CI."&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why two checks instead of one:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;${PIPESTATUS[0]}&lt;/code&gt; check handles cases where &lt;code&gt;kyverno-cli&lt;/code&gt; itself&lt;br&gt;
exits non-zero — which may happen in future versions or on error conditions.&lt;br&gt;
The stdout grep checks handle the current v1.12.x behavior where violations&lt;br&gt;
print &lt;code&gt;FAIL&lt;/code&gt; to stdout but exit 0. Together they cover both current behavior&lt;br&gt;
and future behavior changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;set +e&lt;/code&gt; before the command:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without &lt;code&gt;set +e&lt;/code&gt;, a non-zero exit from kyverno would immediately abort the&lt;br&gt;
script before &lt;code&gt;${PIPESTATUS[0]}&lt;/code&gt; could be captured. &lt;code&gt;set +e&lt;/code&gt; disables&lt;br&gt;
errexit temporarily so we can capture and evaluate the exit code explicitly.&lt;/p&gt;


&lt;h2&gt;
  
  
  Verification: proving the fix works
&lt;/h2&gt;

&lt;p&gt;After the fix, the same test PR with the missing &lt;code&gt;cost-center&lt;/code&gt; label now&lt;br&gt;
fails CI correctly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;git push origin feature/test-non-compliant

&lt;span class="c"&gt;# CI output:&lt;/span&gt;
validate-policies-against-app-manifests
  Running kyverno policy check...

  PASS: 0, FAIL: 1, WARN: 0, ERROR: 0, SKIP: 0
  policy require-standard-labels-inferenceservice -&amp;gt;
    resource default/InferenceService/test-bad-model
    FAIL: check-owner-and-cost-center-on-isvc

  Kyverno policy violations detected. Failing CI.

&lt;span class="c"&gt;# Result: validate-policies-against-app-manifests: ❌ FAILED&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A compliant manifest with both labels passes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; apps/demo-iris-2/inference-service.yaml
&lt;span class="c"&gt;# CI result: validate-policies-against-app-manifests: ✅ PASSED&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The complete GitHub Actions workflow step
&lt;/h2&gt;

&lt;p&gt;Here is the full implementation used in the NeuroScale platform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/guardrails-checks.yaml&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Validate policies against app manifests&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;app_files=()&lt;/span&gt;
    &lt;span class="s"&gt;while IFS= read -r -d '' f; do&lt;/span&gt;
      &lt;span class="s"&gt;app_files+=("--resource" "$f")&lt;/span&gt;
    &lt;span class="s"&gt;done &amp;lt; &amp;lt;(find apps/ -name "*.yaml" -print0)&lt;/span&gt;

    &lt;span class="s"&gt;if [ ${#app_files[@]} -eq 0 ]; then&lt;/span&gt;
      &lt;span class="s"&gt;echo "No app manifests found. Skipping policy check."&lt;/span&gt;
      &lt;span class="s"&gt;exit 0&lt;/span&gt;
    &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="s"&gt;set +e&lt;/span&gt;
    &lt;span class="s"&gt;docker run --rm -v "$PWD:/work" -w /work \&lt;/span&gt;
      &lt;span class="s"&gt;ghcr.io/kyverno/kyverno-cli:v1.12.5 \&lt;/span&gt;
      &lt;span class="s"&gt;apply infrastructure/kyverno/policies/*.yaml \&lt;/span&gt;
      &lt;span class="s"&gt;"${app_files[@]}" \&lt;/span&gt;
      &lt;span class="s"&gt;2&amp;gt;&amp;amp;1 | tee /tmp/kyverno-output.txt&lt;/span&gt;
    &lt;span class="s"&gt;kyverno_exit="${PIPESTATUS[0]}"&lt;/span&gt;
    &lt;span class="s"&gt;set -e&lt;/span&gt;

    &lt;span class="s"&gt;echo "--- Kyverno output ---"&lt;/span&gt;
    &lt;span class="s"&gt;cat /tmp/kyverno-output.txt&lt;/span&gt;
    &lt;span class="s"&gt;echo "--- Exit code: ${kyverno_exit} ---"&lt;/span&gt;

    &lt;span class="s"&gt;if [ "${kyverno_exit}" -ne 0 ] \&lt;/span&gt;
        &lt;span class="s"&gt;|| grep -qE "^FAIL" /tmp/kyverno-output.txt \&lt;/span&gt;
        &lt;span class="s"&gt;|| grep -qE "fail: [1-9][0-9]*" /tmp/kyverno-output.txt; then&lt;/span&gt;
      &lt;span class="s"&gt;echo "Kyverno policy violations detected. Failing CI."&lt;/span&gt;
      &lt;span class="s"&gt;exit 1&lt;/span&gt;
    &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="s"&gt;echo "All manifests passed policy checks."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why this matters beyond a single platform
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;kyverno-cli apply&lt;/code&gt; exit code behavior is not a bug — it is documented&lt;br&gt;
behavior for the &lt;code&gt;apply&lt;/code&gt; subcommand. But it is not prominently surfaced in&lt;br&gt;
the getting-started documentation, and most CI examples in the wild use&lt;br&gt;
the exit code check alone.&lt;/p&gt;

&lt;p&gt;If your team is using Kyverno for compliance or security enforcement in CI,&lt;br&gt;
and your CI step looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kyverno apply policies/ &lt;span class="nt"&gt;--resource&lt;/span&gt; manifests/
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;$?&lt;/span&gt; &lt;span class="nt"&gt;-ne&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Policy violation detected"&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your enforcement is silently not enforcing. The admission webhook at the&lt;br&gt;
cluster level is still blocking violations — but your shift-left CI gate&lt;br&gt;
is not. Developers will only discover policy violations after merging and&lt;br&gt;
watching ArgoCD fail, not before.&lt;/p&gt;

&lt;p&gt;The distinction between "guardrails exist" and "guardrails enforce" is&lt;br&gt;
exactly what separates platform engineering from platform theater.&lt;/p&gt;


&lt;h2&gt;
  
  
  The two-layer enforcement model
&lt;/h2&gt;

&lt;p&gt;The fix is not just about the CI step. The NeuroScale platform uses two&lt;br&gt;
enforcement layers that work together:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 — PR time (shift-left):&lt;/strong&gt; &lt;code&gt;kyverno-cli&lt;/code&gt; in CI catches violations&lt;br&gt;
before merge. Developers get fast feedback without needing a running cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 — Admission time (shift-down):&lt;/strong&gt; Kyverno admission webhook blocks&lt;br&gt;
non-compliant resources at the Kubernetes API server. Even if CI is bypassed&lt;br&gt;
or misconfigured, nothing non-compliant reaches the cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PR opened
    ↓
CI: kyverno-cli apply + $PIPESTATUS[0] check
    ↓ (blocks here if non-compliant)
PR merged
    ↓
ArgoCD sync
    ↓
Kyverno admission webhook
    ↓ (blocks here as second layer)
Resource created in cluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Layer 1 gives fast developer feedback. Layer 2 is the safety net.&lt;br&gt;
Both are required. Neither alone is sufficient.&lt;/p&gt;




&lt;h2&gt;
  
  
  Debugging Commands Reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Test a policy manually against a manifest&lt;/span&gt;
docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PWD&lt;/span&gt;&lt;span class="s2"&gt;:/work"&lt;/span&gt; &lt;span class="nt"&gt;-w&lt;/span&gt; /work &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/kyverno/kyverno-cli:v1.12.5 &lt;span class="se"&gt;\&lt;/span&gt;
  apply infrastructure/kyverno/policies/require-standard-labels-inferenceservice.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource&lt;/span&gt; apps/demo-iris-2/inference-service.yaml

&lt;span class="c"&gt;# Check Kyverno admission webhook registrations&lt;/span&gt;
kubectl get validatingwebhookconfigurations | &lt;span class="nb"&gt;grep &lt;/span&gt;kyverno
kubectl get mutatingwebhookconfigurations | &lt;span class="nb"&gt;grep &lt;/span&gt;kyverno

&lt;span class="c"&gt;# Verify Kyverno pods are healthy&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kyverno get pods
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kyverno get endpoints kyverno-svc

&lt;span class="c"&gt;# List all installed ClusterPolicies and their enforcement mode&lt;/span&gt;
kubectl get clusterpolicies &lt;span class="nt"&gt;-o&lt;/span&gt; wide

&lt;span class="c"&gt;# Review Kyverno admission decisions in controller logs&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kyverno logs deploy/kyverno &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50 | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"admit&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;deny&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;block"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What I Would Add Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;kyverno test&lt;/code&gt; subcommand integration in CI for cases that require
explicit pass/fail test fixtures rather than live policy simulation&lt;/li&gt;
&lt;li&gt;Background scan results surfaced as PR comments using the Kyverno
policy report API&lt;/li&gt;
&lt;li&gt;Separate policy validation for &lt;code&gt;Deployment&lt;/code&gt; and &lt;code&gt;InferenceService&lt;/code&gt;
resources to give more specific failure messages per resource type&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  See Also
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/tree/main/infrastructure/kyverno/policies" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/kyverno/policies/&lt;/code&gt;&lt;/a&gt; — all ClusterPolicy definitions&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;.github/workflows/guardrails-checks.yaml&lt;/code&gt;&lt;/a&gt; — complete CI workflow&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md&lt;/code&gt;&lt;/a&gt; — full failure documentation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_5_COST_PROXY.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_5_COST_PROXY.md&lt;/code&gt;&lt;/a&gt; — where the fix was implemented&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Jimoh Sodiq Bolaji&lt;/strong&gt; | Platform Engineer | Technical Content Engineer&lt;br&gt;
| Abuja, Nigeria&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;NeuroScale Platform&lt;/a&gt;&lt;br&gt;
· &lt;a href="https://dev.to/sodiqjimoh"&gt;Dev.to&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>kyverno</category>
      <category>cicd</category>
      <category>devops</category>
    </item>
    <item>
      <title>Deploying Backstage on Kubernetes with the Helm Chart: The Infrastructure-First Guide</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Mon, 06 Apr 2026 02:02:26 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/deploying-backstage-on-kubernetes-with-the-helm-chart-the-infrastructure-first-guide-mf3</link>
      <guid>https://forem.com/sodiqjimoh/deploying-backstage-on-kubernetes-with-the-helm-chart-the-infrastructure-first-guide-mf3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for:&lt;/strong&gt; Engineers deploying Backstage on Kubernetes via the&lt;br&gt;
official Helm chart who want a working portal, not just a running pod.&lt;br&gt;
This guide starts where most tutorials end — after &lt;code&gt;helm install&lt;/code&gt; succeeds&lt;br&gt;
but before anything actually works.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A few weeks ago I published an article called&lt;br&gt;
&lt;a href="https://dev.to/sodiqjimoh/nine-ways-backstage-breaks-before-your-developer-portal-works-4eo1"&gt;"Nine Ways Backstage Breaks Before Your Developer Portal Works"&lt;/a&gt;.&lt;br&gt;
A Backstage maintainer read it and gave me structured feedback. The core of&lt;br&gt;
it was this: several of the failures I documented were caused by not&lt;br&gt;
following the official getting-started documentation before using the Helm&lt;br&gt;
chart, and by using the demo image as if it were a production-ready base.&lt;/p&gt;

&lt;p&gt;They were right. This article is the follow-up they suggested — and the one&lt;br&gt;
I should have written first.&lt;/p&gt;

&lt;p&gt;It does not repeat the previous article. It starts earlier, goes deeper on&lt;br&gt;
Helm-specific configuration, and correctly attributes failures to their&lt;br&gt;
actual causes rather than blaming Backstage for things that are ArgoCD,&lt;br&gt;
Traefik, or operator error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Official resources you should read alongside this guide:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://backstage.io/docs/getting-started/" rel="noopener noreferrer"&gt;Backstage getting started documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/backstage/charts/blob/main/charts/backstage/README.md" rel="noopener noreferrer"&gt;Backstage Helm chart README&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/backstage/charts?tab=readme-ov-file#backstage-helm-chart" rel="noopener noreferrer"&gt;Backstage Helm chart disclaimer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://backstage.io/docs/features/software-catalog/configuration#catalog-rules" rel="noopener noreferrer"&gt;Catalog rules documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://backstage.io/docs/features/software-templates/adding-templates" rel="noopener noreferrer"&gt;Adding templates documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Project repo referenced throughout:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The one thing you must understand before installing the Helm chart
&lt;/h2&gt;

&lt;p&gt;The Backstage Helm chart uses a demo image by default. The chart README&lt;br&gt;
contains this explicit warning:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The Backstage chart is not an official Backstage project and is not&lt;br&gt;
supported by the Backstage core team. The default image used in this chart&lt;br&gt;
is for demo purposes only.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This single fact explains most of the configuration friction you will&lt;br&gt;
encounter. The demo image does not behave like a real Backstage application&lt;br&gt;
built with &lt;code&gt;backstage new app&lt;/code&gt;. It has different startup characteristics,&lt;br&gt;
different configuration defaults, and different failure modes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this means practically:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you are building a real developer portal — not just running a demo — you&lt;br&gt;
should follow the &lt;a href="https://backstage.io/docs/getting-started/" rel="noopener noreferrer"&gt;official getting started guide&lt;/a&gt;&lt;br&gt;
to create your own Backstage application first, build a custom Docker image&lt;br&gt;
from it, and then use the Helm chart to deploy that image. The chart's&lt;br&gt;
&lt;code&gt;image.repository&lt;/code&gt; and &lt;code&gt;image.tag&lt;/code&gt; values are where you point to your&lt;br&gt;
own image.&lt;/p&gt;

&lt;p&gt;If you are experimenting, learning, or building an integration platform&lt;br&gt;
where Backstage is one component (as in the NeuroScale project), the demo&lt;br&gt;
image path is workable — but you need to understand its limitations and&lt;br&gt;
configure it correctly.&lt;/p&gt;

&lt;p&gt;This guide covers the Helm chart path specifically, with the official docs&lt;br&gt;
as the reference point throughout.&lt;/p&gt;


&lt;h2&gt;
  
  
  The values hierarchy that breaks everything silently
&lt;/h2&gt;

&lt;p&gt;This is the most important configuration concept in the entire Helm chart.&lt;br&gt;
Get this wrong and every override you write will be silently ignored.&lt;/p&gt;

&lt;p&gt;The Backstage Helm chart is a &lt;strong&gt;wrapper chart&lt;/strong&gt; — Backstage itself is a&lt;br&gt;
dependency inside it. The dependency is named &lt;code&gt;backstage&lt;/code&gt;. This means&lt;br&gt;
configuration for the Backstage application container must be nested under&lt;br&gt;
&lt;code&gt;backstage.backstage.*&lt;/code&gt;, not &lt;code&gt;backstage.*&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrong — values are silently ignored:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This looks correct but is placed at the wrong hierarchy level&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;My Platform&lt;/span&gt;
  &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Correct — values reach the Backstage container:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;           &lt;span class="c1"&gt;# &amp;lt;-- this second level is required&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;My Platform&lt;/span&gt;
    &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Helm chart processes the outer &lt;code&gt;backstage&lt;/code&gt; key as the dependency name.&lt;br&gt;
Values placed directly under &lt;code&gt;backstage.*&lt;/code&gt; are interpreted as chart-level&lt;br&gt;
configuration, not as container configuration. Kubernetes then uses chart&lt;br&gt;
defaults — including probe timings — rather than your overrides.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to verify your values are actually applied:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Render the Helm chart before applying it and inspect the output Deployment&lt;br&gt;
spec directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm template neuroscale-backstage backstage/backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 30 &lt;span class="s2"&gt;"startupProbe"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see &lt;code&gt;initialDelaySeconds: 120&lt;/code&gt; in the output, your probe override&lt;br&gt;
reached the container. If you see &lt;code&gt;initialDelaySeconds: 5&lt;/code&gt; or a very small&lt;br&gt;
number, your values are at the wrong nesting level.&lt;/p&gt;

&lt;p&gt;This verification step should be part of your CI pipeline. In the NeuroScale&lt;br&gt;
platform, &lt;code&gt;scripts/ci/render_backstage.sh&lt;/code&gt; runs this check on every PR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# scripts/ci/render_backstage.sh&lt;/span&gt;
helm template neuroscale-backstage backstage/backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"initialDelaySeconds"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s2"&gt;"120"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"ERROR: startupProbe initialDelaySeconds not set correctly"&lt;/span&gt;
    &lt;span class="nb"&gt;exit &lt;/span&gt;1
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Helm values nesting verified"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Required configuration keys for the demo image
&lt;/h2&gt;

&lt;p&gt;The demo image requires specific configuration keys to be present at&lt;br&gt;
startup. Missing any of them causes the frontend to crash on load with a&lt;br&gt;
JavaScript error that is only visible in browser developer tools — the page&lt;br&gt;
itself shows a blank white screen with no visible error.&lt;/p&gt;

&lt;p&gt;The minimum required &lt;code&gt;appConfig&lt;/code&gt; block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Your Platform Name&lt;/span&gt;    &lt;span class="c1"&gt;# required — crash if absent&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
        &lt;span class="na"&gt;cors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;origin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
        &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;better-sqlite3&lt;/span&gt;
          &lt;span class="na"&gt;connection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:memory:'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;baseUrl&lt;/code&gt; matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;app.baseUrl&lt;/code&gt; and &lt;code&gt;backend.baseUrl&lt;/code&gt; values must match the URL you are&lt;br&gt;
actually using to access Backstage. If you port-forward on port 7010 but&lt;br&gt;
the config says port 7007, the frontend React app loads but all API calls&lt;br&gt;
fail — the UI appears to work while the backend connection is broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;better-sqlite3&lt;/code&gt; for local deployments:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The demo image ships with SQLite support. For local Kubernetes deployments&lt;br&gt;
where you want zero external dependencies, the in-memory SQLite connection&lt;br&gt;
is sufficient. For production, replace this with a PostgreSQL connection&lt;br&gt;
pointing at a managed database service. The chart includes optional&lt;br&gt;
PostgreSQL deployment — see&lt;br&gt;
&lt;a href="https://github.com/backstage/charts/blob/main/charts/backstage/README.md" rel="noopener noreferrer"&gt;the chart's database configuration docs&lt;/a&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  Probe timings: the demo image starts slowly
&lt;/h2&gt;

&lt;p&gt;Backstage is a Node.js application. The demo image takes approximately 60&lt;br&gt;
to 90 seconds to complete startup on a typical Kubernetes node. Kubernetes&lt;br&gt;
default probe timings assume a 2-second initial delay. The result is&lt;br&gt;
predictable: the startup probe fires before the application is ready, the&lt;br&gt;
pod fails the probe, Kubernetes kills it, and the pod enters&lt;br&gt;
&lt;code&gt;CrashLoopBackOff&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is not a Backstage bug. It is a configuration requirement that the&lt;br&gt;
Helm chart does not prominently surface. The correct probe settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;    &lt;span class="c1"&gt;# give Node.js time to start&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;        &lt;span class="c1"&gt;# 30 × 10s = 5 minutes maximum wait&lt;/span&gt;
    &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;    &lt;span class="c1"&gt;# only check liveness after 5 minutes&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How to diagnose probe failures:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Watch pod status in real time&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nt"&gt;-w&lt;/span&gt;

&lt;span class="c"&gt;# When you see CrashLoopBackOff, describe the pod&lt;/span&gt;
kubectl describe pod &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &amp;lt;pod-name&amp;gt;

&lt;span class="c"&gt;# Look for this in Events:&lt;/span&gt;
&lt;span class="c"&gt;# Warning  Unhealthy  kubelet  Startup probe failed: connection refused&lt;/span&gt;

&lt;span class="c"&gt;# Check logs from the previous container instance&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;--previous&lt;/span&gt; &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see &lt;code&gt;Startup probe failed: connection refused&lt;/code&gt; in events but the&lt;br&gt;
previous container logs show normal Node.js startup messages, the&lt;br&gt;
application is starting correctly — the probe is just firing too early.&lt;br&gt;
Increase &lt;code&gt;initialDelaySeconds&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A full incident postmortem for this specific failure, including the exact&lt;br&gt;
Kubernetes events and the Helm values diff before and after the fix, is in&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md" rel="noopener noreferrer"&gt;infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md&lt;/a&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  Authentication: local dev vs production
&lt;/h2&gt;

&lt;p&gt;The Backstage new backend architecture (introduced in version 1.x) includes&lt;br&gt;
an internal authentication policy that requires all service-to-service calls&lt;br&gt;
to include a valid Backstage token. This affects how the scaffolder frontend&lt;br&gt;
talks to the scaffolder backend — a call that was unauthenticated in older&lt;br&gt;
versions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For local development only&lt;/strong&gt;, the quickest fix is to use the guest auth&lt;br&gt;
provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;guest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;dangerouslyAllowOutsideDevelopment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps the auth subsystem active and provides a real&lt;br&gt;
&lt;code&gt;user:default/guest&lt;/code&gt; identity to all plugins — which is safer than&lt;br&gt;
disabling auth entirely with &lt;code&gt;dangerouslyDisableDefaultAuthPolicy: true&lt;/code&gt;.&lt;br&gt;
Plugins that assume a user context will behave correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For production&lt;/strong&gt;, use the GitHub OAuth provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values-prod.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
        &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;production&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;clientId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_CLIENT_ID}&lt;/span&gt;
              &lt;span class="na"&gt;clientSecret&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_CLIENT_SECRET}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store &lt;code&gt;GITHUB_CLIENT_ID&lt;/code&gt; and &lt;code&gt;GITHUB_CLIENT_SECRET&lt;/code&gt; as Kubernetes secrets,&lt;br&gt;
not in &lt;code&gt;values.yaml&lt;/code&gt;. The Helm chart's &lt;code&gt;extraEnvVarsSecrets&lt;/code&gt; field handles&lt;br&gt;
this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;extraEnvVarsSecrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;backstage-secrets&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then create the secret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create secret generic backstage-secrets &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GITHUB_CLIENT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-client-id"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GITHUB_CLIENT_SECRET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-client-secret"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How to verify auth is configured correctly:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check the scaffolder actions API directly&lt;/span&gt;
curl http://localhost:7010/api/scaffolder/v2/actions

&lt;span class="c"&gt;# If you get 401: auth is not configured for your environment&lt;/span&gt;
&lt;span class="c"&gt;# If you get 200 with a JSON list of actions: auth is working&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you get a 401 with &lt;code&gt;{"error":{"name":"AuthenticationError","message":"Missing credentials"}}&lt;/code&gt;,&lt;br&gt;
the scaffolder form will load but render blank — the page returns HTTP 200&lt;br&gt;
but has no data to display. This is only visible in browser developer tools.&lt;/p&gt;


&lt;h2&gt;
  
  
  Catalog configuration: registering templates
&lt;/h2&gt;

&lt;p&gt;The Backstage catalog applies security rules to what entity kinds are&lt;br&gt;
accepted from each registered location. The default allow list for&lt;br&gt;
repository-based locations does not include &lt;code&gt;Template&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is documented in&lt;br&gt;
&lt;a href="https://backstage.io/docs/features/software-catalog/configuration#catalog-rules" rel="noopener noreferrer"&gt;the catalog rules documentation&lt;/a&gt;&lt;br&gt;
and the&lt;br&gt;
&lt;a href="https://backstage.io/docs/features/software-templates/adding-templates" rel="noopener noreferrer"&gt;adding templates documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The registration pattern that works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;catalog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;locations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;
            &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/your-org/your-repo/blob/main/backstage/templates/your-template/template.yaml&lt;/span&gt;
            &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;allow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Template&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without the &lt;code&gt;rules: - allow: [Template]&lt;/code&gt; block, the entity is silently&lt;br&gt;
rejected at ingestion time. The only signal is a warning in the Backstage&lt;br&gt;
server logs — nothing appears in the UI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to diagnose catalog ingestion failures:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100 &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"warn&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;error&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;forbidden&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;NotAllowedError"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for &lt;code&gt;NotAllowedError: Forbidden: entity of kind Template is not&lt;br&gt;
allowed from that location&lt;/code&gt;. If you see this, your rules block is missing&lt;br&gt;
or at the wrong YAML nesting level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After updating the config, restart Backstage to re-ingest:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl rollout restart deploy/backstage &lt;span class="nt"&gt;-n&lt;/span&gt; backstage
kubectl rollout status deploy/backstage &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;300s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The template should appear in &lt;code&gt;/create&lt;/code&gt; within 60 seconds of the pod&lt;br&gt;
becoming ready.&lt;/p&gt;

&lt;p&gt;You can validate your &lt;code&gt;app-config.yaml&lt;/code&gt; structure using the Backstage CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @backstage/cli config:check &lt;span class="nt"&gt;--config&lt;/span&gt; app-config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  GitHub integration: the token secret
&lt;/h2&gt;

&lt;p&gt;The scaffolder requires a GitHub token to open pull requests. The token&lt;br&gt;
must be present as an environment variable in the running Backstage pod.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;integrations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.com&lt;/span&gt;
            &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_TOKEN}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store the token as a Kubernetes secret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create or update the secret&lt;/span&gt;
kubectl create secret generic backstage-github-token &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"ghp_your_token_here"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dry-run&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;client &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; -

&lt;span class="c"&gt;# Restart to reload the environment variable&lt;/span&gt;
kubectl rollout restart deploy/backstage &lt;span class="nt"&gt;-n&lt;/span&gt; backstage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical:&lt;/strong&gt; environment variables from Kubernetes secrets are injected at&lt;br&gt;
pod start time. Updating the secret does not update the running pod. You&lt;br&gt;
must restart the deployment after updating the secret for the new value&lt;br&gt;
to take effect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to verify the token is present without exposing the value:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check character length — a valid GitHub token is 40+ characters&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo "Token length: ${#GITHUB_TOKEN}"'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this returns &lt;code&gt;Token length: 0&lt;/code&gt; or &lt;code&gt;Token length: 16&lt;/code&gt; (the length of a&lt;br&gt;
placeholder like &lt;code&gt;&amp;lt;YOUR_TOKEN_HERE&amp;gt;&lt;/code&gt;), the secret was not updated correctly&lt;br&gt;
or the pod was not restarted after the update.&lt;/p&gt;


&lt;h2&gt;
  
  
  A working minimal values.yaml for local development
&lt;/h2&gt;

&lt;p&gt;This is the minimum configuration that produces a functioning Backstage&lt;br&gt;
portal on a local Kubernetes cluster with the demo image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;registry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io&lt;/span&gt;
      &lt;span class="na"&gt;repository&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backstage/backstage&lt;/span&gt;
      &lt;span class="na"&gt;tag&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latest&lt;/span&gt;           &lt;span class="c1"&gt;# pin to a specific version in production&lt;/span&gt;

    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Your Platform&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;

      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
        &lt;span class="na"&gt;cors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;origin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
        &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;better-sqlite3&lt;/span&gt;
          &lt;span class="na"&gt;connection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:memory:'&lt;/span&gt;

      &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;guest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;dangerouslyAllowOutsideDevelopment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

      &lt;span class="na"&gt;integrations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.com&lt;/span&gt;
            &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_TOKEN}&lt;/span&gt;

      &lt;span class="na"&gt;catalog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;locations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;
            &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/your-org/your-repo/blob/main/backstage/templates/your-template/template.yaml&lt;/span&gt;
            &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;allow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Template&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

    &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;

    &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

    &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100m&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;

    &lt;span class="na"&gt;extraEnvVarsSecrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;backstage-github-token&lt;/span&gt;

  &lt;span class="na"&gt;postgresql&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;    &lt;span class="c1"&gt;# using in-memory SQLite for local dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Deploying and verifying
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add backstage https://backstage.github.io/charts
helm repo update

kubectl create namespace backstage

&lt;span class="c"&gt;# Create the GitHub token secret first&lt;/span&gt;
kubectl create secret generic backstage-github-token &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-token"&lt;/span&gt;

&lt;span class="c"&gt;# Install&lt;/span&gt;
helm &lt;span class="nb"&gt;install &lt;/span&gt;backstage backstage/backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Watch the startup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nt"&gt;-w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expect the pod to stay in &lt;code&gt;Running 0/1&lt;/code&gt; for 60–120 seconds while Node.js&lt;br&gt;
starts. Do not interpret this as a failure. The startup probe will not&lt;br&gt;
pass until the application is ready.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Access the portal:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage port-forward svc/backstage 7010:7007
&lt;span class="c"&gt;# Open: http://localhost:7010&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify the backend is responding:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:7010/healthcheck
&lt;span class="c"&gt;# Expected: {"status":"ok"}&lt;/span&gt;

curl http://localhost:7010/api/scaffolder/v2/actions
&lt;span class="c"&gt;# Expected: JSON list of available scaffolder actions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify catalog ingestion:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50 &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"processed&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;warn&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;error"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for &lt;code&gt;Processed N entities&lt;/code&gt; with no &lt;code&gt;NotAllowedError&lt;/code&gt; lines.&lt;/p&gt;




&lt;h2&gt;
  
  
  The production values profile
&lt;/h2&gt;

&lt;p&gt;Separate your dev and prod configuration into two files. The difference is&lt;br&gt;
significant enough that sharing a single file creates dangerous defaults&lt;br&gt;
in production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values-prod.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;registry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io&lt;/span&gt;
      &lt;span class="na"&gt;repository&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;your-org/your-backstage-app&lt;/span&gt;   &lt;span class="c1"&gt;# your own image&lt;/span&gt;
      &lt;span class="na"&gt;tag&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.2.3"&lt;/span&gt;                               &lt;span class="c1"&gt;# pinned, never latest&lt;/span&gt;

    &lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;

    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://backstage.your-domain.com&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://backstage.your-domain.com&lt;/span&gt;
        &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pg&lt;/span&gt;
          &lt;span class="na"&gt;connection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${POSTGRES_HOST}&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;
            &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${POSTGRES_USER}&lt;/span&gt;
            &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${POSTGRES_PASSWORD}&lt;/span&gt;
            &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backstage&lt;/span&gt;

      &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
        &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;production&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;clientId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_CLIENT_ID}&lt;/span&gt;
              &lt;span class="na"&gt;clientSecret&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_CLIENT_SECRET}&lt;/span&gt;

    &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;     &lt;span class="c1"&gt;# your own image starts faster&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;18&lt;/span&gt;        &lt;span class="c1"&gt;# 3 minutes maximum&lt;/span&gt;

    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;200m&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;
      &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply both files together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm upgrade backstage backstage/backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values-prod.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The prod values file overrides only what it specifies. Everything else&lt;br&gt;
comes from the base &lt;code&gt;values.yaml&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Diagnostic command reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pod status and events&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; backstage
kubectl describe pod &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &amp;lt;pod-name&amp;gt;

&lt;span class="c"&gt;# Application logs&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--previous&lt;/span&gt; &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100

&lt;span class="c"&gt;# Catalog ingestion errors&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;200 &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"warn&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;error&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;forbidden"&lt;/span&gt;

&lt;span class="c"&gt;# Verify rendered Helm values&lt;/span&gt;
helm template backstage backstage/backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 5 &lt;span class="s2"&gt;"startupProbe"&lt;/span&gt;

&lt;span class="c"&gt;# Verify token is loaded in the running container&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo "GITHUB_TOKEN length: ${#GITHUB_TOKEN}"'&lt;/span&gt;

&lt;span class="c"&gt;# Health check endpoints&lt;/span&gt;
curl http://localhost:7010/healthcheck
curl http://localhost:7010/api/catalog/entities?limit&lt;span class="o"&gt;=&lt;/span&gt;1
curl http://localhost:7010/api/scaffolder/v2/actions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What this guide does not cover
&lt;/h2&gt;

&lt;p&gt;This guide covers the Helm chart deployment path specifically. It does not&lt;br&gt;
cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Building your own Backstage application&lt;/strong&gt; — start with the
&lt;a href="https://backstage.io/docs/getting-started/" rel="noopener noreferrer"&gt;official getting started guide&lt;/a&gt;
and &lt;code&gt;backstage new app&lt;/code&gt; for a real production portal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writing custom plugins&lt;/strong&gt; — see
&lt;a href="https://backstage.io/docs/plugins/" rel="noopener noreferrer"&gt;plugin development docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TechDocs integration&lt;/strong&gt; — covered separately in
&lt;a href="https://backstage.io/docs/features/techdocs/" rel="noopener noreferrer"&gt;the TechDocs docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production ingress and TLS&lt;/strong&gt; — specific to your cloud provider and
ingress controller&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  See Also
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://backstage.io/docs/getting-started/" rel="noopener noreferrer"&gt;Official Backstage getting started&lt;/a&gt; — start here before using the Helm chart&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/backstage/charts" rel="noopener noreferrer"&gt;Backstage Helm chart source&lt;/a&gt; — the canonical reference for all chart configuration&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://backstage.io/docs/features/software-catalog/configuration#catalog-rules" rel="noopener noreferrer"&gt;Catalog rules documentation&lt;/a&gt; — required reading before registering templates&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/backstage/values.yaml" rel="noopener noreferrer"&gt;infrastructure/backstage/values.yaml&lt;/a&gt; — working dev configuration from the NeuroScale platform&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/backstage/values-prod.yaml" rel="noopener noreferrer"&gt;infrastructure/backstage/values-prod.yaml&lt;/a&gt; — production profile with GitHub OAuth&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md" rel="noopener noreferrer"&gt;infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md&lt;/a&gt; — full postmortem for the CrashLoopBackOff probe failure&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Jimoh Sodiq Bolaji&lt;/strong&gt; | Platform Engineer | Technical Content Engineer&lt;br&gt;
| Abuja, Nigeria&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;NeuroScale Platform&lt;/a&gt;&lt;br&gt;
· &lt;a href="https://dev.to/sodiqjimoh"&gt;Dev.to&lt;/a&gt;&lt;/p&gt;

</description>
      <category>backstage</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>9 Failures That Hit Me Building a Backstage Golden Path for KServe — Every Error, Every Fix</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Mon, 30 Mar 2026 23:12:36 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/nine-ways-backstage-breaks-before-your-developer-portal-works-4eo1</link>
      <guid>https://forem.com/sodiqjimoh/nine-ways-backstage-breaks-before-your-developer-portal-works-4eo1</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Edit (Apr 2026):&lt;/strong&gt; Updated title and added framing context based on community feedback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Series context:&lt;/strong&gt; This is Part 3 of building a production-hardened AI inference platform.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Part 1: &lt;a href="https://dev.to/sodiqjimoh/why-your-kserve-inferenceservice-wontbecome-readyfour-production-failures-and-fixes-nei"&gt;Why Your KServe InferenceService Won't Become Ready&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Part 2: 5 GitOps Failure Modes That Break KServe Deployments
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Project repo:&lt;/strong&gt; &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;If you have ever deployed Backstage and stared at a blank &lt;code&gt;/create&lt;/code&gt; page wondering what went wrong, this article is for you.&lt;/p&gt;

&lt;p&gt;Most Backstage tutorials end at "the portal is running." This one starts there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One important framing note before we begin:&lt;/strong&gt; this article documents the path starting from the official Backstage Helm chart, not from &lt;code&gt;backstage new app&lt;/code&gt;. If you're building a real Backstage application from source, some of these failures won't apply. But if you're doing what a lot of platform engineers do, reaching for &lt;code&gt;helm install&lt;/code&gt; first, then every single one of these will.&lt;/p&gt;

&lt;p&gt;This is a complete production failure log from implementing a Backstage Golden Path that deploys KServe model inference endpoints on Kubernetes. Nine distinct failures. Every one with exact error output, root cause, and the fix that worked.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The goal:&lt;/strong&gt; a developer fills a Backstage form, a GitHub PR opens, the PR merges, ArgoCD deploys a KServe InferenceService, and the endpoint responds to predictions.&lt;br&gt;
Getting there took nine failures across three days.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What I was trying to build
&lt;/h2&gt;

&lt;p&gt;The Golden Path demo contract:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Backstage form → PR opened → merge → ArgoCD sync → InferenceService Ready=True → curl returns {"predictions":[1,1]}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backstage (Helm chart, self-hosted on k3d)&lt;/li&gt;
&lt;li&gt;ArgoCD (GitOps reconciliation)&lt;/li&gt;
&lt;li&gt;KServe (model inference endpoints)&lt;/li&gt;
&lt;li&gt;GitHub (scaffolder target)&lt;/li&gt;
&lt;li&gt;Kyverno (admission policies)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Failure 1: Template Not Visible in Catalog — Silent Rejection With No UI Error
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 30 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After adding the template file and registering it in &lt;code&gt;infrastructure/backstage/values.yaml&lt;/code&gt;, the template did not appear in Backstage's &lt;code&gt;/create&lt;/code&gt; page. No error was visible in the UI. The page simply showed an empty catalog.&lt;/p&gt;

&lt;h3&gt;
  
  
  Digging In
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage logs deploy/neuroscale-backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50
...
&lt;span class="o"&gt;[&lt;/span&gt;backstage] warn  Failed to process location
  &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"location"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"type"&lt;/span&gt;:&lt;span class="s2"&gt;"url"&lt;/span&gt;,&lt;span class="s2"&gt;"target"&lt;/span&gt;:&lt;span class="s2"&gt;"https://github.com/sodiq-code/
  neuroscale-platform/blob/main/backstage/templates/model-endpoint/
  template.yaml"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="s2"&gt;"error"&lt;/span&gt;:&lt;span class="s2"&gt;"NotAllowedError: Forbidden: entity of kind Template
  is not allowed from that location"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error only appears in server logs. The UI shows nothing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;Backstage's catalog configuration allows only specific entity kinds from each registered location. The default allow list for repository-based locations does not include &lt;code&gt;Template&lt;/code&gt;. Without an explicit &lt;code&gt;allow: [Template]&lt;/code&gt; rule, entities of kind &lt;code&gt;Template&lt;/code&gt; are silently rejected. This is security-by-default behavior — but the complete silence in the UI makes it look like a misconfiguration rather than a permission issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;catalog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;locations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;
            &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/sodiq-code/neuroscale-platform/blob/main/backstage/templates/model-endpoint/template.yaml&lt;/span&gt;
            &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;allow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Template&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After rolling out the updated Backstage deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage rollout restart deploy/neuroscale-backstage
&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage rollout status deploy/neuroscale-backstage &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;300s
deployment &lt;span class="s2"&gt;"neuroscale-backstage"&lt;/span&gt; successfully rolled out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The template appeared in &lt;code&gt;/create&lt;/code&gt; within 60 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;For a platform team deploying Backstage for internal users, this silent failure means developers see an empty template catalog and assume the platform is broken — not that a config rule is missing. Always check server logs, not just the UI, when Backstage catalog ingestion seems to fail.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 2: Scaffolder /create Page Loads Blank — 401 on Actions API
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 45 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After the template was visible, clicking into it showed a blank form. The browser developer console revealed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /api/scaffolder/v2/actions &lt;/span&gt;&lt;span class="k"&gt;HTTP&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;1.1&lt;/span&gt; &lt;span class="m"&gt;401&lt;/span&gt; &lt;span class="ne"&gt;Unauthorized&lt;/span&gt;
&lt;span class="na"&gt;{"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;{"name":"AuthenticationError","message":"Missing credentials"}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The page route returned HTTP 200 — the React app loaded — but the actions API returned 401, so the form had no data to render.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;Backstage's new backend architecture (introduced in 1.x) adds an internal authentication policy requiring all service-to-service calls to include a valid Backstage token. The scaffolder frontend makes an internal API call to list available actions. Because no auth provider was configured for local development, this internal call was rejected. This is a breaking change from older Backstage versions where the actions endpoint was unauthenticated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;dangerouslyDisableDefaultAuthPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Production note:&lt;/strong&gt; &lt;code&gt;dangerouslyDisableDefaultAuthPolicy: true&lt;/code&gt; is acceptable for local development only. For production, configure GitHub OAuth via &lt;code&gt;values-prod.yaml&lt;/code&gt; with a proper sign-in policy. The production profile uses &lt;code&gt;auth.providers.guest.dangerouslyAllowOutsideDevelopment: true&lt;/code&gt; instead — which keeps the auth subsystem active and provides a real &lt;code&gt;user:default/guest&lt;/code&gt; identity, rather than disabling auth entirely.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;An empty scaffolder form is indistinguishable from a misconfigured form to an end user. The 401 error is only visible in browser developer tools. This is the second failure in this series that generated zero visible error in the UI.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 3: Frontend Crashes With Blank White Screen — Missing Required Config Key
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 20 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After the auth policy fix, reloading Backstage showed a blank white screen. The browser console:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;Uncaught&lt;/span&gt; &lt;span class="nb"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Missing&lt;/span&gt; &lt;span class="nx"&gt;required&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="nx"&gt;at&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;app.title&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;app&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="nx"&gt;at&lt;/span&gt; &lt;span class="nf"&gt;validateConfigSchema &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;esm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;js&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;234&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;at&lt;/span&gt; &lt;span class="nx"&gt;BackstageApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;render &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;esm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;js&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;891&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The Backstage frontend requires &lt;code&gt;app.title&lt;/code&gt; to be present in the runtime configuration. This key was absent from the &lt;code&gt;appConfig&lt;/code&gt; section of &lt;code&gt;values.yaml&lt;/code&gt;. The React application crashed on initialization before any content could render. This is a required configuration key not documented prominently as "required on first boot."&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NeuroScale Platform&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
        &lt;span class="na"&gt;cors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;origin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: &lt;code&gt;app.baseUrl&lt;/code&gt; and &lt;code&gt;backend.baseUrl&lt;/code&gt; also needed to match the port used for port-forwarding (7010), not the default 7007.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;A blank white screen with no network errors means the JavaScript runtime crashed before rendering. Always check the browser console — not just network requests — for Backstage frontend failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 4: Backstage CrashLoopBackOff — Helm Dependency Values Mis-Nesting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 2 hours | &lt;strong&gt;Impact:&lt;/strong&gt; Developer portal completely unavailable&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nt"&gt;-w&lt;/span&gt;
NAME                                    READY   STATUS             RESTARTS
neuroscale-backstage-7d9f5b8c4-xqr2m   0/1     CrashLoopBackOff   8          12m

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl describe pod neuroscale-backstage-7d9f5b8c4-xqr2m &lt;span class="nt"&gt;-n&lt;/span&gt; backstage
...
Events:
  Warning  Unhealthy  30s  kubelet
    Startup probe failed: connect: connection refused
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The Backstage Helm chart is a wrapper chart with &lt;code&gt;backstage&lt;/code&gt; as a dependency. Configuration for the Backstage container itself must be nested under &lt;code&gt;backstage.backstage.*&lt;/code&gt;, not &lt;code&gt;backstage.*&lt;/code&gt;. The misconfiguration meant probe settings and resource requests were silently ignored, so Kubernetes used default probe timings — a 2-second initial delay — that were far too aggressive for Backstage's ~90-second startup time.&lt;/p&gt;

&lt;p&gt;The pod was killed before it could become healthy, triggering CrashLoopBackOff.&lt;/p&gt;

&lt;p&gt;Backstage requires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
  &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default gives it 2 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Correct the values hierarchy and harden probe timings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;           &lt;span class="c1"&gt;# &amp;lt;-- must be nested here, not at backstage.*&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="s"&gt;...&lt;/span&gt;
    &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
    &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
    &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100m&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;If a Helm chart is a wrapper with a dependency, configuration for the dependency must be nested under the dependency's alias key. Values placed at the wrong hierarchy level are silently ignored — Kubernetes uses chart defaults, not your overrides. This incident directly motivated adding CI validation for rendered Helm manifests: if the final Deployment spec had been checked in CI, the wrong probe values would have been caught before deployment. Full RCA: &lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 5: PR Creation Fails — GitHub Token Secret Contains Placeholder Value
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 30 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After the portal was stable, the scaffolder's "Open pull request" step spun for 30 seconds then failed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: Request failed with status 401: Bad credentials
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No PR was created in GitHub.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The Kubernetes Secret &lt;code&gt;neuroscale-backstage-secrets&lt;/code&gt; contained a placeholder &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; value — literally &lt;code&gt;&amp;lt;YOUR_TOKEN_HERE&amp;gt;&lt;/code&gt;. The environment variable was present, satisfying &lt;code&gt;kubectl describe secret&lt;/code&gt; output, but the value was not a valid token.&lt;/p&gt;

&lt;p&gt;A secondary issue: after updating the secret with the correct token, the running pod did not pick up the change. Environment variables from Secrets are injected at pod start time, not dynamically. The pod needed a restart.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update the secret with a valid token&lt;/span&gt;
&lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; GITHUB_TOKEN
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage create secret generic neuroscale-backstage-secrets &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GITHUB_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dry-run&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;client &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; -

&lt;span class="c"&gt;# Restart to reload env vars&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage rollout restart deploy/neuroscale-backstage
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage rollout status deploy/neuroscale-backstage &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;300s

&lt;span class="c"&gt;# Verify token is present — check length, never the value&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nb"&gt;exec &lt;/span&gt;deploy/neuroscale-backstage &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo ${#GITHUB_TOKEN} chars'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;kubectl describe secret&lt;/code&gt; shows the key exists and has bytes. It does not show whether the value is a valid token or a placeholder string. Always verify token presence by checking character length in the running container, never by reading the secret value directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 6: PR Merged But ArgoCD Stays OutOfSync — Fix Not Committed to Git
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 1 hour of confusion&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;The Backstage scaffolder created the PR correctly. CI passed. The PR was merged. ArgoCD detected the new application. But the child app immediately showed &lt;code&gt;OutOfSync/Degraded&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Degraded

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application demo-iris-2
...
Message: Internal error occurred: failed calling webhook
  &lt;span class="s2"&gt;"inferenceservice.kserve-webhook-server.validator.webhook"&lt;/span&gt;:
  no endpoints available &lt;span class="k"&gt;for &lt;/span&gt;service &lt;span class="s2"&gt;"kserve-webhook-server-service"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;This was the &lt;code&gt;kube-rbac-proxy&lt;/code&gt; ImagePullBackOff failure from earlier — reappearing after a cluster restart. The fix had been applied with &lt;code&gt;kubectl patch&lt;/code&gt; directly, not committed to Git. ArgoCD's &lt;code&gt;selfHeal: true&lt;/code&gt; reverted it on the next sync cycle. The cluster restart exposed that the fix was never persisted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify the patch is in kustomization.yaml&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;infrastructure/serving-stack/kustomization.yaml | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A2&lt;/span&gt; patches

&lt;span class="c"&gt;# Commit and push&lt;/span&gt;
git add infrastructure/serving-stack/
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"serving-stack: persist kube-rbac-proxy removal patch"&lt;/span&gt;
git push origin main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ArgoCD picked up the change within 3 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;Any fix applied with &lt;code&gt;kubectl&lt;/code&gt; directly in a GitOps-managed cluster is temporary. The next sync cycle will revert it. Every fix must be committed to Git to survive. The PR-merged-but-nothing-deployed experience is the worst possible failure for a Golden Path demo — the developer did everything correctly and the platform failed silently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 7: Inference Endpoint Returns HTTP 307 Redirect — Traefik Intercepts Before Kourier
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 45 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After &lt;code&gt;demo-iris-2&lt;/code&gt; became &lt;code&gt;Ready=True&lt;/code&gt;, the inference test returned an unexpected redirect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[6.8,2.8,4.8,1.4]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://172.20.0.3/v1/models/demo-iris-2:predict

&amp;lt; HTTP/1.1 307 Temporary Redirect
&amp;lt; Location: https://172.20.0.3/v1/models/demo-iris-2:predict
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;k3d's built-in Traefik ingress was intercepting the request and applying an HTTP-to-HTTPS redirect before it reached Kourier. The request never reached the Knative routing layer at all.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Use direct pod port-forward for canonical local verification, bypassing Traefik and Kourier entirely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find predictor pod&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get pods &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-l&lt;/span&gt; serving.knative.dev/revision&lt;span class="o"&gt;=&lt;/span&gt;demo-iris-2-predictor-00001 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.items[0].metadata.name}'&lt;/span&gt;

&lt;span class="c"&gt;# Port-forward directly to the pod&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default port-forward &lt;span class="se"&gt;\&lt;/span&gt;
  pod/demo-iris-2-predictor-00001-deployment-&amp;lt;&lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; 18080:8080

&lt;span class="c"&gt;# Predict&lt;/span&gt;
curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[6.8,2.8,4.8,1.4],[6.0,3.4,4.5,1.6]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://127.0.0.1:18080/v1/models/demo-iris-2:predict

&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"predictions"&lt;/span&gt;:[1,1]&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;A healthy inference endpoint can look completely broken if your test path hits an unexpected intermediary. For local k3d clusters, disable Traefik at cluster creation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;k3d cluster create neuroscale &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--k3s-arg&lt;/span&gt; &lt;span class="s2"&gt;"--disable=traefik@server:0"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Failure 8: Catalog Ingestion Silently Rejects Template After Values Update
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 20 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After updating &lt;code&gt;values.yaml&lt;/code&gt; and rolling out a new Backstage deployment, the template disappeared from &lt;code&gt;/create&lt;/code&gt; again — the same symptom as Failure 1, but after it had been working.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The rolling update caused a brief period where the new pod was starting and the old pod was terminating. During this window, the catalog re-ingested all locations. The updated &lt;code&gt;values.yaml&lt;/code&gt; had a YAML indentation error in the &lt;code&gt;catalog.locations&lt;/code&gt; block, which caused the allow rule for &lt;code&gt;Template&lt;/code&gt; to be silently dropped during parsing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check catalog ingestion in the new pod logs immediately after rollout&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage logs deploy/neuroscale-backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100 | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"warn&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;error&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;fail&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;forbidden"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fixed the YAML indentation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Correct indentation&lt;/span&gt;
&lt;span class="na"&gt;catalog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;locations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/...&lt;/span&gt;
      &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;allow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Template&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# must be under rules:, not misaligned&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;YAML indentation errors in Backstage config values are never surfaced as errors — the field is simply ignored. After every Backstage rollout that touches &lt;code&gt;appConfig&lt;/code&gt;, immediately verify catalog ingestion by checking server logs and confirming the template appears in &lt;code&gt;/create&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 9: Scaffolder Task Hangs Then Fails — Port-Forward Session Died Mid-Task
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 15 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;The scaffolder task started successfully, progress spinner ran for 60 seconds, then failed with a network error. The Backstage UI showed the task as failed with no specific error message. A second attempt worked immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;kubectl port-forward&lt;/code&gt; session for Backstage had silently died between opening the browser and submitting the scaffolder form. The React app was loaded from cache — so the page appeared fully functional — but all API calls were failing because the backend was unreachable. The scaffolder task started, sent the first API call, and failed on the network layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before running any Backstage scaffolder task, verify the port-forward is alive&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:7010/api/catalog/entities?limit&lt;span class="o"&gt;=&lt;/span&gt;1 | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; 100

&lt;span class="c"&gt;# If it returns nothing or errors, restart the port-forward&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage port-forward svc/neuroscale-backstage 7010:7007
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;scripts/port-forward-all.sh&lt;/code&gt; from the repository which starts all required port-forwards as background processes with clean shutdown handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;A React app loaded from browser cache looks fully functional even when the backend is unreachable. Always verify the backend API is responding before running a scaffolder task, not just that the UI loaded.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Golden Path Actually Proves After Nine Failures
&lt;/h2&gt;

&lt;p&gt;Final working state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice demo-iris-2
NAME          URL                                       READY   AGE
demo-iris-2   http://demo-iris-2.default.example.com   True    25m

&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[6.8,2.8,4.8,1.4],[6.0,3.4,4.5,1.6]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://127.0.0.1:18080/v1/models/demo-iris-2:predict

&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"predictions"&lt;/span&gt;:[1,1]&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Golden Path demo is a chain of seven moving parts: Backstage config, GitHub auth, ArgoCD app-of-apps, KServe controller, Knative routing, Kourier gateway, and the predictor pod. In production, any link in that chain can fail independently.&lt;/p&gt;

&lt;p&gt;The debugging process for these nine failures is a direct map to what a platform SRE does on an on-call shift.&lt;/p&gt;




&lt;h2&gt;
  
  
  Debugging Commands Reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Backstage catalog ingestion errors&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage logs deploy/neuroscale-backstage | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"warn&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;error&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;fail&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;forbidden"&lt;/span&gt;

&lt;span class="c"&gt;# Backstage runtime config&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage describe configmap neuroscale-backstage-app-config

&lt;span class="c"&gt;# Verify GitHub token is present (check length only)&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nb"&gt;exec &lt;/span&gt;deploy/neuroscale-backstage &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo ${#GITHUB_TOKEN} chars'&lt;/span&gt;

&lt;span class="c"&gt;# ArgoCD child app sync status&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get applications
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application demo-iris-2

&lt;span class="c"&gt;# InferenceService conditions&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default describe inferenceservice demo-iris-2

&lt;span class="c"&gt;# Admission webhook endpoints&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get endpoints kserve-webhook-server-service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Pattern Across All Nine Failures
&lt;/h2&gt;

&lt;p&gt;Looking back at the nine failures, they fall into three categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Silent failures (no UI error, log only):&lt;/strong&gt;&lt;br&gt;
Failures 1, 2, 8 — catalog ingestion rejections and auth failures that show nothing in the UI. Rule: always check server logs, not just the browser.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration hierarchy failures:&lt;/strong&gt;&lt;br&gt;
Failures 3, 4 — missing required keys and wrong Helm nesting. Rule: validate rendered manifests in CI before applying them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;State and dependency failures:&lt;/strong&gt;&lt;br&gt;
Failures 5, 6, 7, 9 — stale secrets, unreversioned fixes, intercepting proxies, dead sessions. Rule: verify the complete dependency chain before debugging the thing that appears broken.&lt;/p&gt;




&lt;h2&gt;
  
  
  See Also
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md&lt;/code&gt;&lt;/a&gt; — full 12-section RCA for Failure 4&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_3_GOLDEN_PATH.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_3_GOLDEN_PATH.md&lt;/code&gt;&lt;/a&gt; — complete implementation record&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/backstage/values.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/backstage/values.yaml&lt;/code&gt;&lt;/a&gt; — working dev Backstage configuration&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/backstage/values-prod.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/backstage/values-prod.yaml&lt;/code&gt;&lt;/a&gt; — production profile with GitHub OAuth&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/scripts/smoke-test.sh" rel="noopener noreferrer"&gt;&lt;code&gt;scripts/smoke-test.sh&lt;/code&gt;&lt;/a&gt; — automated end-to-end verification&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Jimoh Sodiq Bolaji&lt;/strong&gt; | Platform Engineer | Technical Content Engineer | Abuja, Nigeria&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;NeuroScale Platform&lt;/a&gt; · &lt;a href="https://dev.to/sodiqjimoh"&gt;Dev.to&lt;/a&gt;&lt;/p&gt;

</description>
      <category>backstage</category>
      <category>kubernetes</category>
      <category>kserve</category>
      <category>gitops</category>
    </item>
    <item>
      <title>Beyond InferenceService Readiness: 5 GitOps Failure Modes That Break KServe Deployments</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Mon, 30 Mar 2026 22:52:16 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/beyond-inferenceservice-readiness-5-gitops-failure-modes-that-break-kserve-deployments-14fb</link>
      <guid>https://forem.com/sodiqjimoh/beyond-inferenceservice-readiness-5-gitops-failure-modes-that-break-kserve-deployments-14fb</guid>
      <description>&lt;p&gt;&lt;strong&gt;A sequel to my KServe readiness post — five GitOps control-plane failure modes with exact terminal output, diagnostics, and repeatable fixes for ArgoCD + KServe stacks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This post is a follow-up to my earlier KServe piece on endpoint readiness:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://dev.to/sodiqjimoh/why-your-kserve-inferenceservice-wontbecome-readyfour-production-failures-and-fixes-nei"&gt;Why Your KServe InferenceService Won't Become Ready: Four Production Failures and Fixes&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That article focused on why an &lt;code&gt;InferenceService&lt;/code&gt; may not become &lt;code&gt;Ready&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This one zooms out to a broader question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What breaks when the GitOps control plane itself is unstable?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most GitOps + AI serving tutorials still focus on the happy path — install ArgoCD, apply KServe, deploy InferenceService, done. But in real platform work, the happy path is the easy part.&lt;/p&gt;

&lt;p&gt;The hard part is when your app is &lt;code&gt;OutOfSync&lt;/code&gt;, the webhook has no endpoints, and everything looks healthy except the thing you actually need.&lt;/p&gt;

&lt;p&gt;This post covers the &lt;strong&gt;five failure modes&lt;/strong&gt; that repeatedly broke KServe deployments in a real production-grade platform build, with exact terminal output, root causes, and the fixes that worked.&lt;/p&gt;

&lt;p&gt;All failures come from hands-on implementation work documented here:&lt;br&gt;
&lt;strong&gt;Project repo:&lt;/strong&gt; &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The platform context
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ArgoCD&lt;/strong&gt; — GitOps reconciliation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KServe&lt;/strong&gt; — model serving (&lt;code&gt;InferenceService&lt;/code&gt;, runtimes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knative + Kourier&lt;/strong&gt; — serving networking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kyverno&lt;/strong&gt; — policy guardrails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backstage&lt;/strong&gt; — self-service PR generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GitOps root app:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;neuroscale-infrastructure&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/sodiq-code/neuroscale-platform.git&lt;/span&gt;
    &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;infrastructure/apps&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Failure Mode 1: Webhook Has No Endpoints — Sync Fails Cluster-Wide
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; ~1 hour | &lt;strong&gt;Impact:&lt;/strong&gt; All InferenceService operations blocked&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;ArgoCD syncs child apps and hits this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application ai-model-alpha
...
Message: admission webhook
  &lt;span class="s2"&gt;"inferenceservice.kserve-webhook-server.validator.webhook"&lt;/span&gt;
  denied the request: Internal error occurred:
  no endpoints available &lt;span class="k"&gt;for &lt;/span&gt;service &lt;span class="s2"&gt;"kserve-webhook-server-service"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Meanwhile the KServe controller pod shows only 1 of 2 containers ready:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get pods
NAME                                        READY   STATUS
kserve-controller-manager-8d7c5b9f4-xr2lm  1/2     Running

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve describe pod kserve-controller-manager-8d7c5b9f4-xr2lm
...
  kube-rbac-proxy:
    State:   Waiting
    Reason:  ImagePullBackOff
    Image:   gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
Events:
  Warning  Failed  kubelet
    Failed to pull image: unexpected status code 403 Forbidden
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;kube-rbac-proxy&lt;/code&gt; sidecar inside &lt;code&gt;kserve-controller-manager&lt;/code&gt; was pulling from &lt;code&gt;gcr.io/kubebuilder/&lt;/code&gt; — a registry that restricted access in late 2025. The manager container was healthy but because the sidecar was not running, the webhook server had no valid certificate endpoint. Result: every &lt;code&gt;InferenceService&lt;/code&gt; apply or update was blocked cluster-wide.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Remove the sidecar via Kustomize strategic merge patch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/serving-stack/patches/&lt;/span&gt;
&lt;span class="c1"&gt;#   kserve-controller-kube-rbac-proxy-image.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-controller-manager&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-rbac-proxy&lt;/span&gt;
          &lt;span class="na"&gt;$patch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delete&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify webhook endpoints are restored after re-sync:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get endpoints kserve-webhook-server-service
NAME                           ENDPOINTS          AGE
kserve-webhook-server-service  10.42.0.23:9443    45s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;When webhook endpoints are missing, your app YAML is never the real problem. Diagnose the controller first. An external registry access change can silently kill your entire admission layer cluster-wide with no obvious error in the app itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Mode 2: CRD Deleted by a Misapplied Patch — All Endpoints Gone Instantly
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 4 minutes recovery | &lt;strong&gt;Impact:&lt;/strong&gt; SEV-1 equivalent — all InferenceServices deleted&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;All InferenceService objects disappeared silently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservices
No resources found &lt;span class="k"&gt;in &lt;/span&gt;default namespace.

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Missing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;A Kustomize patch file named &lt;code&gt;remove-inferenceservice-crd.yaml&lt;/code&gt; was mistakenly applied directly with &lt;code&gt;kubectl apply -f&lt;/code&gt; instead of being used as a build-time patch inside &lt;code&gt;kustomization.yaml&lt;/code&gt;. The file contained a &lt;code&gt;$patch: delete&lt;/code&gt; directive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apiextensions.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CustomResourceDefinition&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inferenceservices.serving.kserve.io&lt;/span&gt;
&lt;span class="na"&gt;$patch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delete&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When applied directly, it deleted the actual CRD from Kubernetes. When a CRD is deleted, Kubernetes immediately garbage-collects every custom resource of that type. Every InferenceService was gone within seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Restore the CRD immediately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/kserve/kserve/releases/download/v0.12.1/kserve.yaml

kubectl &lt;span class="nb"&gt;wait &lt;/span&gt;crd/inferenceservices.serving.kserve.io &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Established &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;60s

kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd patch application demo-iris-2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; merge &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s1"&gt;'{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;$patch: delete&lt;/code&gt; in a Kustomize file is a build-time instruction — it tells &lt;code&gt;kustomize build&lt;/code&gt; to omit that resource from output. It must never be applied directly with &lt;code&gt;kubectl apply -f&lt;/code&gt;. Ambiguous filenames like &lt;code&gt;remove-inferenceservice-crd.yaml&lt;/code&gt; are dangerous footguns. In a production cluster with 50 deployed models this would be a full SEV-1.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Rule:&lt;/strong&gt; Any file containing &lt;code&gt;$patch: delete&lt;/code&gt; must only ever be referenced inside a &lt;code&gt;kustomization.yaml&lt;/code&gt; patches block, never applied directly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Failure Mode 3: Permanent OutOfSync Due to Label Key Mismatch
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 2 weeks undetected | &lt;strong&gt;Impact:&lt;/strong&gt; CI was green while policy enforcement was silently broken&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;A PR is merged, ArgoCD syncs, but the InferenceService stays &lt;code&gt;OutOfSync/Degraded&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Degraded
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kyverno denies the resource at admission:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Error from server: error when creating &lt;span class="s2"&gt;"STDIN"&lt;/span&gt;:
  admission webhook &lt;span class="s2"&gt;"clusterpolice.kyverno.svc"&lt;/span&gt; denied the request:
  resource InferenceService/default/test-model was blocked due to the following policies
  require-standard-labels-inferenceservice:
    check-owner-and-cost-center-on-isvc: &lt;span class="s1"&gt;'validation error:
    InferenceService resources must set metadata.labels.owner and
    metadata.labels.cost-center.
    rule check-owner-and-cost-center-on-isvc failed at path
    /metadata/labels/cost-center/'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the label is present in the manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice demo-iris-2 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.metadata.labels}'&lt;/span&gt; | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool
&lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"owner"&lt;/span&gt;: &lt;span class="s2"&gt;"platform-team"&lt;/span&gt;,
    &lt;span class="s2"&gt;"costCenter"&lt;/span&gt;: &lt;span class="s2"&gt;"ai-platform"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;costCenter&lt;/code&gt; (camelCase) and &lt;code&gt;cost-center&lt;/code&gt; (kebab-case) are completely different Kubernetes label keys. The Backstage template skeleton was generating &lt;code&gt;costCenter&lt;/code&gt;. The Kyverno policy required &lt;code&gt;cost-center&lt;/code&gt;. CI passed because CI used the same manifest that would pass — the mismatch only surfaced at admission time.&lt;/p&gt;

&lt;p&gt;Additionally, &lt;code&gt;kyverno-cli apply&lt;/code&gt; exits with code &lt;code&gt;0&lt;/code&gt; even when policy violations are found. CI was checking &lt;code&gt;$?&lt;/code&gt; rather than &lt;code&gt;${PIPESTATUS[0]}&lt;/code&gt;, so the CI step appeared green while enforcement was completely broken for two weeks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Standardize on kebab-case throughout (Kubernetes convention):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Backstage template skeleton&lt;/span&gt;
&lt;span class="c"&gt;# apps/${{ values.name }}/inference-service.yaml&lt;/span&gt;
labels:
  owner: platform-team
  cost-center: ai-platform   &lt;span class="c"&gt;# was: costCenter&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fix the CI Kyverno check to catch actual violations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;set&lt;/span&gt; +e
docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PWD&lt;/span&gt;&lt;span class="s2"&gt;:/work"&lt;/span&gt; &lt;span class="nt"&gt;-w&lt;/span&gt; /work ghcr.io/kyverno/kyverno-cli:v1.12.5 &lt;span class="se"&gt;\&lt;/span&gt;
  apply infrastructure/kyverno/policies/&lt;span class="k"&gt;*&lt;/span&gt;.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;app_files&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  2&amp;gt;&amp;amp;1 | &lt;span class="nb"&gt;tee&lt;/span&gt; /tmp/kyverno-output.txt
&lt;span class="nv"&gt;kyverno_exit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PIPESTATUS&lt;/span&gt;&lt;span class="p"&gt;[0]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;kyverno_exit&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-ne&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qE&lt;/span&gt; &lt;span class="s2"&gt;"^FAIL"&lt;/span&gt; /tmp/kyverno-output.txt &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qE&lt;/span&gt; &lt;span class="s2"&gt;"fail: [1-9][0-9]*"&lt;/span&gt; /tmp/kyverno-output.txt&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Kyverno policy violations detected. Failing CI."&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;$?&lt;/code&gt; captures the exit code of &lt;code&gt;tee&lt;/code&gt;, not &lt;code&gt;kyverno&lt;/code&gt;. &lt;code&gt;${PIPESTATUS[0]}&lt;/code&gt; captures kyverno's actual exit code. "Guardrails exist" and "guardrails enforce" are different states. The most dangerous failure mode for a policy system is silent false positives — everything looks green while nothing is actually being enforced.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Mode 4: Kyverno Install Breaks ArgoCD Reconciliation Loop
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 2–5 minutes per cluster | &lt;strong&gt;Impact:&lt;/strong&gt; All ArgoCD apps enter Unknown state&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After adding Kyverno to the platform, previously healthy apps enter &lt;code&gt;Unknown&lt;/code&gt; state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get applications
NAME                       SYNC STATUS   HEALTH STATUS
neuroscale-infrastructure  Synced         Healthy
serving-stack              Unknown        Unknown    &lt;span class="c"&gt;# was Healthy 10 minutes ago&lt;/span&gt;
policy-guardrails          Synced         Healthy

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application serving-stack
...
Message: rpc error: code &lt;span class="o"&gt;=&lt;/span&gt; Unavailable desc &lt;span class="o"&gt;=&lt;/span&gt; connection refused
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;Kyverno installs its own &lt;code&gt;ValidatingWebhookConfiguration&lt;/code&gt; and &lt;code&gt;MutatingWebhookConfiguration&lt;/code&gt; during install. While Kyverno is initializing, the webhook configurations are registered but point to endpoints that do not exist yet. During this window, any &lt;code&gt;kubectl apply&lt;/code&gt; operation — including ArgoCD's sync reconciliation loop — times out waiting for a response from a not-yet-running webhook server. This cascades into the ArgoCD repo-server losing its connection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Add a Kyverno &lt;code&gt;webhookAnnotations&lt;/code&gt; ConfigMap patch to suppress automatic webhook registration during the initialization window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/kyverno/kustomization.yaml&lt;/span&gt;
&lt;span class="na"&gt;patches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno&lt;/span&gt;
    &lt;span class="na"&gt;patch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
      &lt;span class="s"&gt;apiVersion: v1&lt;/span&gt;
      &lt;span class="s"&gt;kind: ConfigMap&lt;/span&gt;
      &lt;span class="s"&gt;metadata:&lt;/span&gt;
        &lt;span class="s"&gt;name: kyverno&lt;/span&gt;
        &lt;span class="s"&gt;namespace: kyverno&lt;/span&gt;
      &lt;span class="s"&gt;data:&lt;/span&gt;
        &lt;span class="s"&gt;webhookAnnotations: "{}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After Kyverno reaches &lt;code&gt;Running&lt;/code&gt; state, force a hard refresh:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd patch application serving-stack &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; merge &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s1"&gt;'{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;Adding a policy engine to an existing cluster disrupts all other ArgoCD-managed applications during the install window. In production, this requires a maintenance window or a canary install strategy. Kyverno must be fully healthy before any other component syncs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Mode 5: Stale Admission Webhook Blocks All Resource Creation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 30+ minutes | &lt;strong&gt;Impact:&lt;/strong&gt; All Deployments in the namespace silently blocked&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After fixing the repo-server, apps sync but Deployments never appear:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get applications &lt;span class="nt"&gt;-n&lt;/span&gt; argocd
NAME                       SYNC STATUS   HEALTH STATUS
neuroscale-infrastructure  Synced         Healthy
test-app                   Synced         Progressing   &lt;span class="c"&gt;# stuck&lt;/span&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get deploy &lt;span class="nt"&gt;-n&lt;/span&gt; default
No resources found &lt;span class="k"&gt;in &lt;/span&gt;default namespace.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ArgoCD shows the Deployment as "synced" but it does not exist — a contradiction. Checking conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application test-app &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 20 conditions
  conditions:
  - message: &lt;span class="s1"&gt;'Failed sync attempt: one or more objects failed to apply,
      reason: Internal error occurred: failed calling webhook
      "validate.nginx.ingress.kubernetes.io":
      failed to call webhook: Post
      "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/...":
      dial tcp 10.96.x.x:443: connect: connection refused'&lt;/span&gt;
    &lt;span class="nb"&gt;type&lt;/span&gt;: SyncError
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;ValidatingWebhookConfiguration&lt;/code&gt; from a previous cluster experiment was still registered but pointing to a service that no longer existed. Kubernetes admission webhooks are cluster-scoped. The stale &lt;code&gt;ingress-nginx&lt;/code&gt; webhook was intercepting every resource creation attempt and failing them — the error only appears in ArgoCD events, not on the Deployment itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Discover stale webhooks&lt;/span&gt;
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

&lt;span class="c"&gt;# Delete the stale one&lt;/span&gt;
kubectl delete validatingwebhookconfiguration ingress-nginx-admission

&lt;span class="c"&gt;# Force ArgoCD to retry&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd patch application test-app &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; merge &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s1"&gt;'{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After deletion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get deploy &lt;span class="nt"&gt;-n&lt;/span&gt; default
NAME          READY   UP-TO-DATE   AVAILABLE   AGE
nginx-test    1/1     1            1           23s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;A stale webhook from a previous workload silently blocks all resource creation in the affected namespace for hours without any obvious error message. The admission error only appears in ArgoCD events logs, not on the resource itself. Always check for stale webhooks before blaming manifests.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Triage Sequence That Saves Hours
&lt;/h2&gt;

&lt;p&gt;When a KServe app is failing in ArgoCD, run this exact order before touching any manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Environment gate — if this fails, stop and fix environment first&lt;/span&gt;
kubectl get nodes
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get applications

&lt;span class="c"&gt;# 2. Control-plane health&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get deploy,pods,svc,endpoints
kubectl get crd | &lt;span class="nb"&gt;grep &lt;/span&gt;serving.kserve.io

&lt;span class="c"&gt;# 3. Controller logs&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve logs deploy/kserve-controller-manager &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100

&lt;span class="c"&gt;# 4. Webhook availability&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get endpoints kserve-webhook-server-service

&lt;span class="c"&gt;# 5. Stale webhooks&lt;/span&gt;
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

&lt;span class="c"&gt;# 6. App-level sync error detail&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application &amp;lt;app-name&amp;gt; &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 20 conditions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only after every step above passes should you edit app manifests.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters for Platform Teams
&lt;/h2&gt;

&lt;p&gt;A platform is credible when it supports both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-service delivery&lt;/strong&gt; — the Golden Path works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-service recovery&lt;/strong&gt; — failures are understandable and fixable without a platform expert&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams build the first and postpone the second. That creates operational debt fast.&lt;/p&gt;

&lt;p&gt;The fix is not more dashboards. It is better failure-model documentation, tighter GitOps guardrails, and the discipline to document what breaks — not just what works.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A platform is not "done" when the happy path works. It's done when the failure path is understandable and recoverable.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What I Would Improve Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Pre-merge CI assertions for probe and resource fields in rendered manifests&lt;/li&gt;
&lt;li&gt;Explicit dependency ordering using ArgoCD sync waves to prevent Kyverno install disruption&lt;/li&gt;
&lt;li&gt;Conformance checks for Helm dependency values nesting to catch silently ignored overrides&lt;/li&gt;
&lt;li&gt;Policy test fixtures that verify both pass and fail cases in CI&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  See Also
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_1_GITOPS_SPINE.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_1_GITOPS_SPINE.md&lt;/code&gt;&lt;/a&gt; — ArgoCD spine failures with exact terminal output&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md&lt;/code&gt;&lt;/a&gt; — Kyverno CI false-green and the &lt;code&gt;$PIPESTATUS[0]&lt;/code&gt; fix&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md&lt;/code&gt;&lt;/a&gt; — full incident postmortem with 12-section RCA&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_2_KSERVE_SERVING.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_2_KSERVE_SERVING.md&lt;/code&gt;&lt;/a&gt; — the kube-rbac-proxy failure in full detail&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Jimoh Sodiq Bolaji&lt;/strong&gt; | Platform Engineer | Technical Content Engineer | Abuja, Nigeria&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;NeuroScale Platform&lt;/a&gt; · &lt;a href="https://dev.to/sodiqjimoh"&gt;Dev.to&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>gitops</category>
      <category>devops</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why Your KServe InferenceService Won't Become Ready: Four Production Failures and Fixes</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Mon, 30 Mar 2026 03:37:24 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/why-your-kserve-inferenceservice-wontbecome-readyfour-production-failures-and-fixes-nei</link>
      <guid>https://forem.com/sodiqjimoh/why-your-kserve-inferenceservice-wontbecome-readyfour-production-failures-and-fixes-nei</guid>
      <description>&lt;p&gt;A practitioner's account of the errors the KServe getting-started documentation doesn't tell you about — with exact terminal output, root causes, and working Kustomize patches.&lt;/p&gt;

&lt;p&gt;This article documents four production failures I encountered while deploying KServe on a local k3d cluster as part of building NeuroScale — a self-service AI inference platform. None of these failures appear in the official KServe getting-started documentation. If you are deploying KServe without Istio, this will save you several hours of debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Was Building
&lt;/h2&gt;

&lt;p&gt;NeuroScale is a self-service AI inference platform on Kubernetes. The goal was simple: one InferenceService named &lt;code&gt;sklearn-iris&lt;/code&gt; reaches &lt;code&gt;Ready=True&lt;/code&gt; and responds to a prediction request.&lt;/p&gt;

&lt;p&gt;The install had to be GitOps-managed via ArgoCD — not "I ran some scripts." Getting there took two days and four distinct failures. Here is every one of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt; k3d (local Kubernetes) · KServe 0.12.1 · ArgoCD · Kourier (no Istio) · Knative Serving&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📝 &lt;strong&gt;Author's Note:&lt;/strong&gt; This article was originally documented in the NeuroScale platform repository.&lt;br&gt;
&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;docs/REALITY_CHECK_MILESTONE_2_KSERVE_SERVING.md&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Failure 1: KServe InferenceService Stuck Not Ready — Istio vs Kourier Ingress Mismatch Causes ReconcileError Loop
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; ~3 hours&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After applying the KServe installation via ArgoCD (serving-stack app), the InferenceService was created but never became Ready:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice sklearn-iris
NAME           URL   READY   PREV   LATEST   AGE
sklearn-iris         False          100      8m

&lt;span class="c"&gt;# READY=False with no URL = KServe controller did not complete ingress setup.&lt;/span&gt;
&lt;span class="c"&gt;# No Knative Route was created. No external URL was assigned.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Digging In
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default describe inferenceservice sklearn-iris
...
Status:
  Conditions:
    Message: Failed to reconcile ingress
    Reason:  ReconcileError
    Status:  False
    Type:    IngressReady

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve logs deploy/kserve-controller-manager &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50
...
ERROR controller.inferenceservice Failed to reconcile ingress
  &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"error"&lt;/span&gt;: &lt;span class="s2"&gt;"virtual service not found: sklearn-iris.default.svc.cluster.local"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error referenced a &lt;em&gt;virtual service&lt;/em&gt; — that is an Istio concept. But we were running Kourier. The KServe controller was attempting to create an Istio VirtualService in a cluster that had no Istio control plane.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause: Default KServe Ingress Mode Assumes Istio
&lt;/h3&gt;

&lt;p&gt;KServe's default &lt;code&gt;inferenceservice-config&lt;/code&gt; ConfigMap expects Istio as the ingress provider. It sets &lt;code&gt;ingressClassName: istio&lt;/code&gt; and the key &lt;code&gt;disableIstioVirtualHost&lt;/code&gt; defaults to &lt;code&gt;false&lt;/code&gt;. When Istio is absent, the controller enters an error loop trying to create resources that will never exist.&lt;/p&gt;

&lt;p&gt;Setting &lt;code&gt;disableIstioVirtualHost: true&lt;/code&gt; tells KServe to skip Istio and fall back to Knative route objects that Kourier can handle.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why Kourier instead of Istio:&lt;/strong&gt; Istio adds ~1 GB of memory overhead. On a local k3d cluster shared with Docker Desktop, Backstage, and the KServe controller, that exhausts available RAM. Kourier's entire footprint is under 200 MB.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Fix: ConfigMap Patch in serving-stack
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/serving-stack/patches/inferenceservice-config-ingress.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inferenceservice-config&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
    &lt;span class="s"&gt;{&lt;/span&gt;
      &lt;span class="s"&gt;"ingressGateway": "knative-serving/knative-ingress-gateway",&lt;/span&gt;
      &lt;span class="s"&gt;"ingressDomain": "example.com",&lt;/span&gt;
      &lt;span class="s"&gt;"ingressClassName": "istio",&lt;/span&gt;
      &lt;span class="s"&gt;"urlScheme": "http",&lt;/span&gt;
      &lt;span class="s"&gt;"disableIstioVirtualHost": true,&lt;/span&gt;
      &lt;span class="s"&gt;"disableIngressCreation": false&lt;/span&gt;
    &lt;span class="s"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this patch was applied and the KServe controller restarted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice sklearn-iris
NAME           URL                                       READY   AGE
sklearn-iris   http://sklearn-iris.default.example.com   True    2m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Business impact:&lt;/strong&gt; This failure cost approximately 3 hours. The KServe documentation does not prominently state that the default configuration requires Istio. The error message "virtual service not found" is Istio-specific vocabulary that only makes sense if you already know Istio is the default — a classic undocumented assumption in infrastructure tooling.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Failure 2: ArgoCD Serving-Stack Sync Fails — Duplicate Knative CRD Exceeds 256 KB Annotation Size Limit
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; ~30 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application serving-stack
NAME            SYNC STATUS   HEALTH STATUS
serving-stack   OutOfSync     Degraded

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application serving-stack
...
Message: CustomResourceDefinition &lt;span class="s2"&gt;"services.serving.knative.dev"&lt;/span&gt;
  is invalid: metadata.annotations:
  Too long: may not be more than 262144 bytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;ArgoCD stores &lt;code&gt;kubectl.kubernetes.io/last-applied-configuration&lt;/code&gt; in the annotation. For large CRDs, this annotation plus the apply payload exceeds Kubernetes' 256 KB annotation size limit. The Knative CRD is approximately 400 KB as a YAML object.&lt;/p&gt;

&lt;p&gt;A rendering overlap compounded the issue: the &lt;code&gt;kserve.yaml&lt;/code&gt; bundle already includes its own version of the Knative Serving CRDs, and we were also referencing &lt;code&gt;serving-core.yaml&lt;/code&gt; directly. This created two attempts to manage the same CRDs, causing comparison instability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/serving-stack/kustomization.yaml&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Use server-side apply to bypass the annotation size limit&lt;/span&gt;
&lt;span class="na"&gt;commonAnnotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;argocd.argoproj.io/sync-options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServerSideApply=true&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Ignore runtime-mutated fields on Knative CRDs&lt;/span&gt;
&lt;span class="c1"&gt;#    (In ArgoCD Application spec)&lt;/span&gt;
&lt;span class="na"&gt;ignoreDifferences&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apiextensions.k8s.io&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CustomResourceDefinition&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;services.serving.knative.dev&lt;/span&gt;
    &lt;span class="na"&gt;jsonPointers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/spec/preserveUnknownFields&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Business impact:&lt;/strong&gt; ArgoCD's error says "Too long" but does not tell you which annotation or why it got too long. Debugging requires knowing ArgoCD's internal server-side apply mechanism.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Failure 3: kube-rbac-proxy ImagePullBackOff Blocks KServe Admission Webhook — gcr.io Access Restriction
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; ~1 hour | Cluster-wide impact&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application ai-model-alpha
...
Message: admission webhook
  &lt;span class="s2"&gt;"inferenceservice.kserve-webhook-server.validator.webhook"&lt;/span&gt;
  denied the request: no endpoints available &lt;span class="k"&gt;for
  &lt;/span&gt;service &lt;span class="s2"&gt;"kserve-webhook-server-service"&lt;/span&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get pods
NAME                            READY   STATUS
kserve-controller-manager-xxx   1/2     Running   &lt;span class="c"&gt;# only 1 of 2 ready&lt;/span&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve describe pod kserve-controller-manager-xxx
  kube-rbac-proxy:
    State:   Waiting
    Reason:  ImagePullBackOff
    Image:   gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
Events:
  Warning  Failed  kubelet
    Failed to pull image: unexpected status code 403 Forbidden
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;KServe 0.12.1's &lt;code&gt;kserve-controller-manager&lt;/code&gt; Deployment includes a &lt;code&gt;kube-rbac-proxy&lt;/code&gt; sidecar from &lt;code&gt;gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1&lt;/code&gt;. Google Container Registry restricted access to kubebuilder images in late 2025.&lt;/p&gt;

&lt;p&gt;The manager container itself was healthy (1 of 2 ready). But without the sidecar, the webhook server certificate was not being served, so the admission webhook had no healthy endpoints. The alternative &lt;code&gt;registry.k8s.io/kube-rbac-proxy:v0.13.1&lt;/code&gt; did not exist at the new location either.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix: Remove the Sidecar via Kustomize Strategic Merge Patch
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/serving-stack/patches/&lt;/span&gt;
&lt;span class="c1"&gt;#   kserve-controller-kube-rbac-proxy-image.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-controller-manager&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-rbac-proxy&lt;/span&gt;
          &lt;span class="na"&gt;$patch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delete&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this patch and a re-sync:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get pods
NAME                            READY   STATUS
kserve-controller-manager-yyy   1/1     Running   &lt;span class="c"&gt;# fixed&lt;/span&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get endpoints kserve-webhook-server-service
NAME                            ENDPOINTS          AGE
kserve-webhook-server-service   10.42.0.23:9443    45s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Known tradeoff:&lt;/strong&gt; Removing &lt;code&gt;kube-rbac-proxy&lt;/code&gt; disables the Prometheus metrics proxy endpoint for the KServe controller. In production, source a verified replacement image from an accessible registry before deploying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business impact:&lt;/strong&gt; An external registry access change cascaded into a complete admission webhook outage. Any InferenceService creation or update was blocked cluster-wide while the sidecar was failing. This class of failure has no good solution without upstream monitoring of your image dependencies.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Failure 4: Inference Request Returns HTTP 405 — IngressDomain Placeholder Resolves to Public Internet
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; ~1 hour&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice sklearn-iris &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.status.url}'&lt;/span&gt;
http://sklearn-iris.default.example.com

&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[5.1,3.5,1.4,0.2]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://sklearn-iris.default.example.com/v1/models/sklearn-iris:predict
&amp;lt;html&amp;gt;&amp;lt;&lt;span class="nb"&gt;head&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&amp;lt;title&amp;gt;405 Not Allowed&amp;lt;/title&amp;gt;&amp;lt;/head&amp;gt;...

&lt;span class="c"&gt;# The request hit the public example.com server, not our Kourier gateway.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;ingressDomain&lt;/code&gt; in the KServe ConfigMap was set to &lt;code&gt;example.com&lt;/code&gt; — a literal placeholder. The generated URL resolves publicly to Cloudflare/IANA servers, not the local cluster.&lt;/p&gt;

&lt;p&gt;Additionally, Kourier routes by Host header, not by IP. Just port-forwarding Kourier and hitting &lt;code&gt;127.0.0.1&lt;/code&gt; does not work without the correct Host header.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix: Direct Predictor Pod Port-Forward
&lt;/h3&gt;

&lt;p&gt;Bypass Knative routing and Kourier entirely for local verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Step 1: Get the predictor pod name&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get pods &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-l&lt;/span&gt; serving.knative.dev/revision&lt;span class="o"&gt;=&lt;/span&gt;sklearn-iris-predictor-00001

&lt;span class="c"&gt;# Step 2: Port-forward directly to the predictor container&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default port-forward &lt;span class="se"&gt;\&lt;/span&gt;
  pod/sklearn-iris-predictor-00001-deployment-&amp;lt;&lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; 18080:8080

&lt;span class="c"&gt;# Step 3: Predict (no Host header, no Kourier, no DNS needed)&lt;/span&gt;
curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[5.1,3.5,1.4,0.2],[6.2,3.4,5.4,2.3]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://127.0.0.1:18080/v1/models/sklearn-iris:predict

&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"predictions"&lt;/span&gt;:[0,2]&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the full Kourier routing path, always pass the Host header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kourier-system port-forward svc/kourier 18080:80

curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Host: sklearn-iris-predictor.default.127.0.0.1.sslip.io'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[5.1,3.5,1.4,0.2]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://127.0.0.1:18080/v1/models/sklearn-iris:predict
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Business impact:&lt;/strong&gt; False-negative inference verification. A healthy endpoint looked broken because the test URL resolved to the wrong server. Always verify the complete network path — DNS resolution, ingress routing, pod health — as separate steps rather than assuming a single curl test is conclusive.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What This Proves After the Failures
&lt;/h2&gt;

&lt;p&gt;After working through the above failures, the inference baseline worked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice sklearn-iris
NAME           URL                                       READY   AGE
sklearn-iris   http://sklearn-iris.default.example.com   True    45m

&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[5.1,3.5,1.4,0.2],[6.2,3.4,5.4,2.3]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://127.0.0.1:18080/v1/models/sklearn-iris:predict

&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"predictions"&lt;/span&gt;:[0,2]&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Istio/Kourier mismatch is the canonical example of why "default configuration" is dangerous in complex systems. KServe's default assumes a specific network topology that is not disclosed in the getting-started docs. Recognizing this class of failure — configuration that works in the tool author's environment but not yours — is a senior platform engineering competency.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Setup Does NOT Solve (Known Tradeoffs)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No Istio service mesh:&lt;/strong&gt; No mTLS between services, no advanced traffic management. Acceptable for local dev; requires a replacement security layer in production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kube-rbac-proxy removed:&lt;/strong&gt; Prometheus metrics from the KServe controller are unavailable. Re-add this sidecar from a working registry before any production deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port-forward for inference:&lt;/strong&gt; The Host-header workaround is local only. Cloud deployment requires a real ingress with DNS and TLS. On EKS, swap Kourier for an ALB and set &lt;code&gt;ingressDomain&lt;/code&gt; to your real domain. See the &lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/CLOUD_PROMOTION_GUIDE.md" rel="noopener noreferrer"&gt;Cloud Promotion Guide&lt;/a&gt; in the repository.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Debugging Commands Reference
&lt;/h2&gt;

&lt;p&gt;Run these in order when an InferenceService will not become Ready.&lt;/p&gt;

&lt;h3&gt;
  
  
  1 — InferenceService Conditions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default describe inferenceservice sklearn-iris
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve logs deploy/kserve-controller-manager &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve logs deploy/kserve-controller-manager &lt;span class="nt"&gt;-c&lt;/span&gt; manager &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2 — Webhook Endpoint Availability
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get endpoints kserve-webhook-server-service
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve describe endpoints kserve-webhook-server-service
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get ksvc
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get route
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3 — ConfigMap and Pod Status
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get configmap inferenceservice-config &lt;span class="nt"&gt;-o&lt;/span&gt; yaml
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get pods &lt;span class="nt"&gt;-o&lt;/span&gt; wide
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve describe pod &amp;lt;pod-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The One Thing to Remember
&lt;/h2&gt;

&lt;p&gt;KServe's default configuration assumes Istio is installed. This assumption is not prominently stated in the getting-started documentation. Every engineer running KServe on k3d, k3s, GKE Autopilot, or any non-Istio cluster will hit ReconcileError and see error messages referencing "virtual services" — an Istio concept — with no obvious resolution path.&lt;/p&gt;

&lt;p&gt;The fix is one ConfigMap patch. It takes 30 seconds to apply. Finding it took three hours.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;kube-rbac-proxy&lt;/code&gt; 403 from gcr.io is an external dependency failure that silently kills your admission webhook cluster-wide. The &lt;code&gt;$patch: delete&lt;/code&gt; Kustomize strategy is the fastest recovery path when no alternative registry image is available.&lt;/p&gt;

&lt;p&gt;Full platform source — all six Reality Check documents, Backstage Golden Path, Kyverno policy enforcement, cost attribution, and a Cloud Promotion Guide to EKS/GKE: &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;Check out the full NeuroScale repo here.&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  See Also
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/serving-stack/patches/inferenceservice-config-ingress.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/serving-stack/patches/inferenceservice-config-ingress.yaml&lt;/code&gt;&lt;/a&gt; — Kourier config patch&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/serving-stack/patches/kserve-controller-kube-rbac-proxy-image.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/serving-stack/patches/kserve-controller-kube-rbac-proxy-image.yaml&lt;/code&gt;&lt;/a&gt; — sidecar removal patch&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/kserve/sklearn-runtime.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/kserve/sklearn-runtime.yaml&lt;/code&gt;&lt;/a&gt; — ClusterServingRuntime definition&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/CLOUD_PROMOTION_GUIDE.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/CLOUD_PROMOTION_GUIDE.md&lt;/code&gt;&lt;/a&gt; — how to replace Kourier with ALB/NGINX on EKS/GKE&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_3_GOLDEN_PATH.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_3_GOLDEN_PATH.md&lt;/code&gt;&lt;/a&gt; — nine Backstage failures documented at the same depth&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md&lt;/code&gt;&lt;/a&gt; — how kyverno-cli exits 0 on violations and why &lt;code&gt;$PIPESTATUS[0]&lt;/code&gt; matters&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Jimoh Sodiq Bolaji&lt;/strong&gt; | Platform Engineer | Technical Content Engineer | Abuja, Nigeria | &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;NeuroScale Platform&lt;/a&gt;&lt;/p&gt;




</description>
      <category>kubernetes</category>
      <category>kserve</category>
      <category>mlops</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
