<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sodiq Jimoh</title>
    <description>The latest articles on Forem by Sodiq Jimoh (@sodiqjimoh).</description>
    <link>https://forem.com/sodiqjimoh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3850139%2Fd2edc9dc-4ca6-4299-8708-8fb3c454bd56.jpg</url>
      <title>Forem: Sodiq Jimoh</title>
      <link>https://forem.com/sodiqjimoh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sodiqjimoh"/>
    <language>en</language>
    <item>
      <title>Deploying Backstage on Kubernetes with the Helm Chart: The Infrastructure-First Guide</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Mon, 06 Apr 2026 02:02:26 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/deploying-backstage-on-kubernetes-with-the-helm-chart-the-infrastructure-first-guide-mf3</link>
      <guid>https://forem.com/sodiqjimoh/deploying-backstage-on-kubernetes-with-the-helm-chart-the-infrastructure-first-guide-mf3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for:&lt;/strong&gt; Engineers deploying Backstage on Kubernetes via the&lt;br&gt;
official Helm chart who want a working portal, not just a running pod.&lt;br&gt;
This guide starts where most tutorials end — after &lt;code&gt;helm install&lt;/code&gt; succeeds&lt;br&gt;
but before anything actually works.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A few weeks ago I published an article called&lt;br&gt;
&lt;a href="https://dev.to/sodiqjimoh/nine-ways-backstage-breaks-before-your-developer-portal-works-4eo1"&gt;"Nine Ways Backstage Breaks Before Your Developer Portal Works"&lt;/a&gt;.&lt;br&gt;
A Backstage maintainer read it and gave me structured feedback. The core of&lt;br&gt;
it was this: several of the failures I documented were caused by not&lt;br&gt;
following the official getting-started documentation before using the Helm&lt;br&gt;
chart, and by using the demo image as if it were a production-ready base.&lt;/p&gt;

&lt;p&gt;They were right. This article is the follow-up they suggested — and the one&lt;br&gt;
I should have written first.&lt;/p&gt;

&lt;p&gt;It does not repeat the previous article. It starts earlier, goes deeper on&lt;br&gt;
Helm-specific configuration, and correctly attributes failures to their&lt;br&gt;
actual causes rather than blaming Backstage for things that are ArgoCD,&lt;br&gt;
Traefik, or operator error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Official resources you should read alongside this guide:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://backstage.io/docs/getting-started/" rel="noopener noreferrer"&gt;Backstage getting started documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/backstage/charts/blob/main/charts/backstage/README.md" rel="noopener noreferrer"&gt;Backstage Helm chart README&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/backstage/charts?tab=readme-ov-file#backstage-helm-chart" rel="noopener noreferrer"&gt;Backstage Helm chart disclaimer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://backstage.io/docs/features/software-catalog/configuration#catalog-rules" rel="noopener noreferrer"&gt;Catalog rules documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://backstage.io/docs/features/software-templates/adding-templates" rel="noopener noreferrer"&gt;Adding templates documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Project repo referenced throughout:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The one thing you must understand before installing the Helm chart
&lt;/h2&gt;

&lt;p&gt;The Backstage Helm chart uses a demo image by default. The chart README&lt;br&gt;
contains this explicit warning:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The Backstage chart is not an official Backstage project and is not&lt;br&gt;
supported by the Backstage core team. The default image used in this chart&lt;br&gt;
is for demo purposes only.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This single fact explains most of the configuration friction you will&lt;br&gt;
encounter. The demo image does not behave like a real Backstage application&lt;br&gt;
built with &lt;code&gt;backstage new app&lt;/code&gt;. It has different startup characteristics,&lt;br&gt;
different configuration defaults, and different failure modes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this means practically:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you are building a real developer portal — not just running a demo — you&lt;br&gt;
should follow the &lt;a href="https://backstage.io/docs/getting-started/" rel="noopener noreferrer"&gt;official getting started guide&lt;/a&gt;&lt;br&gt;
to create your own Backstage application first, build a custom Docker image&lt;br&gt;
from it, and then use the Helm chart to deploy that image. The chart's&lt;br&gt;
&lt;code&gt;image.repository&lt;/code&gt; and &lt;code&gt;image.tag&lt;/code&gt; values are where you point to your&lt;br&gt;
own image.&lt;/p&gt;

&lt;p&gt;If you are experimenting, learning, or building an integration platform&lt;br&gt;
where Backstage is one component (as in the NeuroScale project), the demo&lt;br&gt;
image path is workable — but you need to understand its limitations and&lt;br&gt;
configure it correctly.&lt;/p&gt;

&lt;p&gt;This guide covers the Helm chart path specifically, with the official docs&lt;br&gt;
as the reference point throughout.&lt;/p&gt;


&lt;h2&gt;
  
  
  The values hierarchy that breaks everything silently
&lt;/h2&gt;

&lt;p&gt;This is the most important configuration concept in the entire Helm chart.&lt;br&gt;
Get this wrong and every override you write will be silently ignored.&lt;/p&gt;

&lt;p&gt;The Backstage Helm chart is a &lt;strong&gt;wrapper chart&lt;/strong&gt; — Backstage itself is a&lt;br&gt;
dependency inside it. The dependency is named &lt;code&gt;backstage&lt;/code&gt;. This means&lt;br&gt;
configuration for the Backstage application container must be nested under&lt;br&gt;
&lt;code&gt;backstage.backstage.*&lt;/code&gt;, not &lt;code&gt;backstage.*&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrong — values are silently ignored:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This looks correct but is placed at the wrong hierarchy level&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;My Platform&lt;/span&gt;
  &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Correct — values reach the Backstage container:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;           &lt;span class="c1"&gt;# &amp;lt;-- this second level is required&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;My Platform&lt;/span&gt;
    &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Helm chart processes the outer &lt;code&gt;backstage&lt;/code&gt; key as the dependency name.&lt;br&gt;
Values placed directly under &lt;code&gt;backstage.*&lt;/code&gt; are interpreted as chart-level&lt;br&gt;
configuration, not as container configuration. Kubernetes then uses chart&lt;br&gt;
defaults — including probe timings — rather than your overrides.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to verify your values are actually applied:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Render the Helm chart before applying it and inspect the output Deployment&lt;br&gt;
spec directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm template neuroscale-backstage backstage/backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 30 &lt;span class="s2"&gt;"startupProbe"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see &lt;code&gt;initialDelaySeconds: 120&lt;/code&gt; in the output, your probe override&lt;br&gt;
reached the container. If you see &lt;code&gt;initialDelaySeconds: 5&lt;/code&gt; or a very small&lt;br&gt;
number, your values are at the wrong nesting level.&lt;/p&gt;

&lt;p&gt;This verification step should be part of your CI pipeline. In the NeuroScale&lt;br&gt;
platform, &lt;code&gt;scripts/ci/render_backstage.sh&lt;/code&gt; runs this check on every PR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# scripts/ci/render_backstage.sh&lt;/span&gt;
helm template neuroscale-backstage backstage/backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"initialDelaySeconds"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s2"&gt;"120"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"ERROR: startupProbe initialDelaySeconds not set correctly"&lt;/span&gt;
    &lt;span class="nb"&gt;exit &lt;/span&gt;1
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Helm values nesting verified"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Required configuration keys for the demo image
&lt;/h2&gt;

&lt;p&gt;The demo image requires specific configuration keys to be present at&lt;br&gt;
startup. Missing any of them causes the frontend to crash on load with a&lt;br&gt;
JavaScript error that is only visible in browser developer tools — the page&lt;br&gt;
itself shows a blank white screen with no visible error.&lt;/p&gt;

&lt;p&gt;The minimum required &lt;code&gt;appConfig&lt;/code&gt; block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Your Platform Name&lt;/span&gt;    &lt;span class="c1"&gt;# required — crash if absent&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
        &lt;span class="na"&gt;cors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;origin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
        &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;better-sqlite3&lt;/span&gt;
          &lt;span class="na"&gt;connection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:memory:'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;baseUrl&lt;/code&gt; matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;app.baseUrl&lt;/code&gt; and &lt;code&gt;backend.baseUrl&lt;/code&gt; values must match the URL you are&lt;br&gt;
actually using to access Backstage. If you port-forward on port 7010 but&lt;br&gt;
the config says port 7007, the frontend React app loads but all API calls&lt;br&gt;
fail — the UI appears to work while the backend connection is broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;better-sqlite3&lt;/code&gt; for local deployments:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The demo image ships with SQLite support. For local Kubernetes deployments&lt;br&gt;
where you want zero external dependencies, the in-memory SQLite connection&lt;br&gt;
is sufficient. For production, replace this with a PostgreSQL connection&lt;br&gt;
pointing at a managed database service. The chart includes optional&lt;br&gt;
PostgreSQL deployment — see&lt;br&gt;
&lt;a href="https://github.com/backstage/charts/blob/main/charts/backstage/README.md" rel="noopener noreferrer"&gt;the chart's database configuration docs&lt;/a&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  Probe timings: the demo image starts slowly
&lt;/h2&gt;

&lt;p&gt;Backstage is a Node.js application. The demo image takes approximately 60&lt;br&gt;
to 90 seconds to complete startup on a typical Kubernetes node. Kubernetes&lt;br&gt;
default probe timings assume a 2-second initial delay. The result is&lt;br&gt;
predictable: the startup probe fires before the application is ready, the&lt;br&gt;
pod fails the probe, Kubernetes kills it, and the pod enters&lt;br&gt;
&lt;code&gt;CrashLoopBackOff&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is not a Backstage bug. It is a configuration requirement that the&lt;br&gt;
Helm chart does not prominently surface. The correct probe settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;    &lt;span class="c1"&gt;# give Node.js time to start&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;        &lt;span class="c1"&gt;# 30 × 10s = 5 minutes maximum wait&lt;/span&gt;
    &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;    &lt;span class="c1"&gt;# only check liveness after 5 minutes&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How to diagnose probe failures:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Watch pod status in real time&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nt"&gt;-w&lt;/span&gt;

&lt;span class="c"&gt;# When you see CrashLoopBackOff, describe the pod&lt;/span&gt;
kubectl describe pod &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &amp;lt;pod-name&amp;gt;

&lt;span class="c"&gt;# Look for this in Events:&lt;/span&gt;
&lt;span class="c"&gt;# Warning  Unhealthy  kubelet  Startup probe failed: connection refused&lt;/span&gt;

&lt;span class="c"&gt;# Check logs from the previous container instance&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;--previous&lt;/span&gt; &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see &lt;code&gt;Startup probe failed: connection refused&lt;/code&gt; in events but the&lt;br&gt;
previous container logs show normal Node.js startup messages, the&lt;br&gt;
application is starting correctly — the probe is just firing too early.&lt;br&gt;
Increase &lt;code&gt;initialDelaySeconds&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A full incident postmortem for this specific failure, including the exact&lt;br&gt;
Kubernetes events and the Helm values diff before and after the fix, is in&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md" rel="noopener noreferrer"&gt;infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md&lt;/a&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  Authentication: local dev vs production
&lt;/h2&gt;

&lt;p&gt;The Backstage new backend architecture (introduced in version 1.x) includes&lt;br&gt;
an internal authentication policy that requires all service-to-service calls&lt;br&gt;
to include a valid Backstage token. This affects how the scaffolder frontend&lt;br&gt;
talks to the scaffolder backend — a call that was unauthenticated in older&lt;br&gt;
versions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For local development only&lt;/strong&gt;, the quickest fix is to use the guest auth&lt;br&gt;
provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;guest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;dangerouslyAllowOutsideDevelopment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps the auth subsystem active and provides a real&lt;br&gt;
&lt;code&gt;user:default/guest&lt;/code&gt; identity to all plugins — which is safer than&lt;br&gt;
disabling auth entirely with &lt;code&gt;dangerouslyDisableDefaultAuthPolicy: true&lt;/code&gt;.&lt;br&gt;
Plugins that assume a user context will behave correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For production&lt;/strong&gt;, use the GitHub OAuth provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values-prod.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
        &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;production&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;clientId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_CLIENT_ID}&lt;/span&gt;
              &lt;span class="na"&gt;clientSecret&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_CLIENT_SECRET}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store &lt;code&gt;GITHUB_CLIENT_ID&lt;/code&gt; and &lt;code&gt;GITHUB_CLIENT_SECRET&lt;/code&gt; as Kubernetes secrets,&lt;br&gt;
not in &lt;code&gt;values.yaml&lt;/code&gt;. The Helm chart's &lt;code&gt;extraEnvVarsSecrets&lt;/code&gt; field handles&lt;br&gt;
this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;extraEnvVarsSecrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;backstage-secrets&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then create the secret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create secret generic backstage-secrets &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GITHUB_CLIENT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-client-id"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GITHUB_CLIENT_SECRET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-client-secret"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How to verify auth is configured correctly:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check the scaffolder actions API directly&lt;/span&gt;
curl http://localhost:7010/api/scaffolder/v2/actions

&lt;span class="c"&gt;# If you get 401: auth is not configured for your environment&lt;/span&gt;
&lt;span class="c"&gt;# If you get 200 with a JSON list of actions: auth is working&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you get a 401 with &lt;code&gt;{"error":{"name":"AuthenticationError","message":"Missing credentials"}}&lt;/code&gt;,&lt;br&gt;
the scaffolder form will load but render blank — the page returns HTTP 200&lt;br&gt;
but has no data to display. This is only visible in browser developer tools.&lt;/p&gt;


&lt;h2&gt;
  
  
  Catalog configuration: registering templates
&lt;/h2&gt;

&lt;p&gt;The Backstage catalog applies security rules to what entity kinds are&lt;br&gt;
accepted from each registered location. The default allow list for&lt;br&gt;
repository-based locations does not include &lt;code&gt;Template&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is documented in&lt;br&gt;
&lt;a href="https://backstage.io/docs/features/software-catalog/configuration#catalog-rules" rel="noopener noreferrer"&gt;the catalog rules documentation&lt;/a&gt;&lt;br&gt;
and the&lt;br&gt;
&lt;a href="https://backstage.io/docs/features/software-templates/adding-templates" rel="noopener noreferrer"&gt;adding templates documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The registration pattern that works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;catalog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;locations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;
            &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/your-org/your-repo/blob/main/backstage/templates/your-template/template.yaml&lt;/span&gt;
            &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;allow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Template&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without the &lt;code&gt;rules: - allow: [Template]&lt;/code&gt; block, the entity is silently&lt;br&gt;
rejected at ingestion time. The only signal is a warning in the Backstage&lt;br&gt;
server logs — nothing appears in the UI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to diagnose catalog ingestion failures:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100 &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"warn&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;error&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;forbidden&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;NotAllowedError"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for &lt;code&gt;NotAllowedError: Forbidden: entity of kind Template is not&lt;br&gt;
allowed from that location&lt;/code&gt;. If you see this, your rules block is missing&lt;br&gt;
or at the wrong YAML nesting level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After updating the config, restart Backstage to re-ingest:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl rollout restart deploy/backstage &lt;span class="nt"&gt;-n&lt;/span&gt; backstage
kubectl rollout status deploy/backstage &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;300s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The template should appear in &lt;code&gt;/create&lt;/code&gt; within 60 seconds of the pod&lt;br&gt;
becoming ready.&lt;/p&gt;

&lt;p&gt;You can validate your &lt;code&gt;app-config.yaml&lt;/code&gt; structure using the Backstage CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @backstage/cli config:check &lt;span class="nt"&gt;--config&lt;/span&gt; app-config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  GitHub integration: the token secret
&lt;/h2&gt;

&lt;p&gt;The scaffolder requires a GitHub token to open pull requests. The token&lt;br&gt;
must be present as an environment variable in the running Backstage pod.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;integrations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.com&lt;/span&gt;
            &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_TOKEN}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store the token as a Kubernetes secret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create or update the secret&lt;/span&gt;
kubectl create secret generic backstage-github-token &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"ghp_your_token_here"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dry-run&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;client &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; -

&lt;span class="c"&gt;# Restart to reload the environment variable&lt;/span&gt;
kubectl rollout restart deploy/backstage &lt;span class="nt"&gt;-n&lt;/span&gt; backstage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical:&lt;/strong&gt; environment variables from Kubernetes secrets are injected at&lt;br&gt;
pod start time. Updating the secret does not update the running pod. You&lt;br&gt;
must restart the deployment after updating the secret for the new value&lt;br&gt;
to take effect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to verify the token is present without exposing the value:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check character length — a valid GitHub token is 40+ characters&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo "Token length: ${#GITHUB_TOKEN}"'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this returns &lt;code&gt;Token length: 0&lt;/code&gt; or &lt;code&gt;Token length: 16&lt;/code&gt; (the length of a&lt;br&gt;
placeholder like &lt;code&gt;&amp;lt;YOUR_TOKEN_HERE&amp;gt;&lt;/code&gt;), the secret was not updated correctly&lt;br&gt;
or the pod was not restarted after the update.&lt;/p&gt;


&lt;h2&gt;
  
  
  A working minimal values.yaml for local development
&lt;/h2&gt;

&lt;p&gt;This is the minimum configuration that produces a functioning Backstage&lt;br&gt;
portal on a local Kubernetes cluster with the demo image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;registry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io&lt;/span&gt;
      &lt;span class="na"&gt;repository&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backstage/backstage&lt;/span&gt;
      &lt;span class="na"&gt;tag&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latest&lt;/span&gt;           &lt;span class="c1"&gt;# pin to a specific version in production&lt;/span&gt;

    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Your Platform&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;

      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
        &lt;span class="na"&gt;cors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;origin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
        &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;better-sqlite3&lt;/span&gt;
          &lt;span class="na"&gt;connection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:memory:'&lt;/span&gt;

      &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;guest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;dangerouslyAllowOutsideDevelopment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

      &lt;span class="na"&gt;integrations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.com&lt;/span&gt;
            &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_TOKEN}&lt;/span&gt;

      &lt;span class="na"&gt;catalog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;locations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;
            &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/your-org/your-repo/blob/main/backstage/templates/your-template/template.yaml&lt;/span&gt;
            &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;allow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Template&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

    &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;

    &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

    &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthcheck&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7007&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100m&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;

    &lt;span class="na"&gt;extraEnvVarsSecrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;backstage-github-token&lt;/span&gt;

  &lt;span class="na"&gt;postgresql&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;    &lt;span class="c1"&gt;# using in-memory SQLite for local dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Deploying and verifying
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add backstage https://backstage.github.io/charts
helm repo update

kubectl create namespace backstage

&lt;span class="c"&gt;# Create the GitHub token secret first&lt;/span&gt;
kubectl create secret generic backstage-github-token &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-token"&lt;/span&gt;

&lt;span class="c"&gt;# Install&lt;/span&gt;
helm &lt;span class="nb"&gt;install &lt;/span&gt;backstage backstage/backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Watch the startup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nt"&gt;-w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expect the pod to stay in &lt;code&gt;Running 0/1&lt;/code&gt; for 60–120 seconds while Node.js&lt;br&gt;
starts. Do not interpret this as a failure. The startup probe will not&lt;br&gt;
pass until the application is ready.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Access the portal:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage port-forward svc/backstage 7010:7007
&lt;span class="c"&gt;# Open: http://localhost:7010&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify the backend is responding:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:7010/healthcheck
&lt;span class="c"&gt;# Expected: {"status":"ok"}&lt;/span&gt;

curl http://localhost:7010/api/scaffolder/v2/actions
&lt;span class="c"&gt;# Expected: JSON list of available scaffolder actions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify catalog ingestion:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50 &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"processed&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;warn&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;error"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for &lt;code&gt;Processed N entities&lt;/code&gt; with no &lt;code&gt;NotAllowedError&lt;/code&gt; lines.&lt;/p&gt;




&lt;h2&gt;
  
  
  The production values profile
&lt;/h2&gt;

&lt;p&gt;Separate your dev and prod configuration into two files. The difference is&lt;br&gt;
significant enough that sharing a single file creates dangerous defaults&lt;br&gt;
in production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values-prod.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;registry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io&lt;/span&gt;
      &lt;span class="na"&gt;repository&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;your-org/your-backstage-app&lt;/span&gt;   &lt;span class="c1"&gt;# your own image&lt;/span&gt;
      &lt;span class="na"&gt;tag&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.2.3"&lt;/span&gt;                               &lt;span class="c1"&gt;# pinned, never latest&lt;/span&gt;

    &lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;

    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://backstage.your-domain.com&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://backstage.your-domain.com&lt;/span&gt;
        &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pg&lt;/span&gt;
          &lt;span class="na"&gt;connection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${POSTGRES_HOST}&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;
            &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${POSTGRES_USER}&lt;/span&gt;
            &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${POSTGRES_PASSWORD}&lt;/span&gt;
            &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backstage&lt;/span&gt;

      &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
        &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;production&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;clientId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_CLIENT_ID}&lt;/span&gt;
              &lt;span class="na"&gt;clientSecret&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_CLIENT_SECRET}&lt;/span&gt;

    &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;     &lt;span class="c1"&gt;# your own image starts faster&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;18&lt;/span&gt;        &lt;span class="c1"&gt;# 3 minutes maximum&lt;/span&gt;

    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;200m&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;
      &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply both files together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm upgrade backstage backstage/backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values-prod.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The prod values file overrides only what it specifies. Everything else&lt;br&gt;
comes from the base &lt;code&gt;values.yaml&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Diagnostic command reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pod status and events&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; backstage
kubectl describe pod &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &amp;lt;pod-name&amp;gt;

&lt;span class="c"&gt;# Application logs&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--previous&lt;/span&gt; &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100

&lt;span class="c"&gt;# Catalog ingestion errors&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;200 &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"warn&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;error&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;forbidden"&lt;/span&gt;

&lt;span class="c"&gt;# Verify rendered Helm values&lt;/span&gt;
helm template backstage backstage/backstage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; infrastructure/backstage/values.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; backstage &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 5 &lt;span class="s2"&gt;"startupProbe"&lt;/span&gt;

&lt;span class="c"&gt;# Verify token is loaded in the running container&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; backstage deploy/backstage &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo "GITHUB_TOKEN length: ${#GITHUB_TOKEN}"'&lt;/span&gt;

&lt;span class="c"&gt;# Health check endpoints&lt;/span&gt;
curl http://localhost:7010/healthcheck
curl http://localhost:7010/api/catalog/entities?limit&lt;span class="o"&gt;=&lt;/span&gt;1
curl http://localhost:7010/api/scaffolder/v2/actions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What this guide does not cover
&lt;/h2&gt;

&lt;p&gt;This guide covers the Helm chart deployment path specifically. It does not&lt;br&gt;
cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Building your own Backstage application&lt;/strong&gt; — start with the
&lt;a href="https://backstage.io/docs/getting-started/" rel="noopener noreferrer"&gt;official getting started guide&lt;/a&gt;
and &lt;code&gt;backstage new app&lt;/code&gt; for a real production portal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writing custom plugins&lt;/strong&gt; — see
&lt;a href="https://backstage.io/docs/plugins/" rel="noopener noreferrer"&gt;plugin development docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TechDocs integration&lt;/strong&gt; — covered separately in
&lt;a href="https://backstage.io/docs/features/techdocs/" rel="noopener noreferrer"&gt;the TechDocs docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production ingress and TLS&lt;/strong&gt; — specific to your cloud provider and
ingress controller&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  See Also
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://backstage.io/docs/getting-started/" rel="noopener noreferrer"&gt;Official Backstage getting started&lt;/a&gt; — start here before using the Helm chart&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/backstage/charts" rel="noopener noreferrer"&gt;Backstage Helm chart source&lt;/a&gt; — the canonical reference for all chart configuration&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://backstage.io/docs/features/software-catalog/configuration#catalog-rules" rel="noopener noreferrer"&gt;Catalog rules documentation&lt;/a&gt; — required reading before registering templates&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/backstage/values.yaml" rel="noopener noreferrer"&gt;infrastructure/backstage/values.yaml&lt;/a&gt; — working dev configuration from the NeuroScale platform&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/backstage/values-prod.yaml" rel="noopener noreferrer"&gt;infrastructure/backstage/values-prod.yaml&lt;/a&gt; — production profile with GitHub OAuth&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md" rel="noopener noreferrer"&gt;infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md&lt;/a&gt; — full postmortem for the CrashLoopBackOff probe failure&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Jimoh Sodiq Bolaji&lt;/strong&gt; | Platform Engineer | Technical Content Engineer&lt;br&gt;
| Abuja, Nigeria&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;NeuroScale Platform&lt;/a&gt;&lt;br&gt;
· &lt;a href="https://dev.to/sodiqjimoh"&gt;Dev.to&lt;/a&gt;&lt;/p&gt;

</description>
      <category>backstage</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>9 Failures That Hit Me Building a Backstage Golden Path for KServe — Every Error, Every Fix</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Mon, 30 Mar 2026 23:12:36 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/nine-ways-backstage-breaks-before-your-developer-portal-works-4eo1</link>
      <guid>https://forem.com/sodiqjimoh/nine-ways-backstage-breaks-before-your-developer-portal-works-4eo1</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Edit (Apr 2026):&lt;/strong&gt; Updated title and added framing context based on community feedback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Series context:&lt;/strong&gt; This is Part 3 of building a production-hardened AI inference platform.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Part 1: &lt;a href="https://dev.to/sodiqjimoh/why-your-kserve-inferenceservice-wontbecome-readyfour-production-failures-and-fixes-nei"&gt;Why Your KServe InferenceService Won't Become Ready&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Part 2: 5 GitOps Failure Modes That Break KServe Deployments
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Project repo:&lt;/strong&gt; &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;If you have ever deployed Backstage and stared at a blank &lt;code&gt;/create&lt;/code&gt; page wondering what went wrong, this article is for you.&lt;/p&gt;

&lt;p&gt;Most Backstage tutorials end at "the portal is running." This one starts there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One important framing note before we begin:&lt;/strong&gt; this article documents the path starting from the official Backstage Helm chart, not from &lt;code&gt;backstage new app&lt;/code&gt;. If you're building a real Backstage application from source, some of these failures won't apply. But if you're doing what a lot of platform engineers do, reaching for &lt;code&gt;helm install&lt;/code&gt; first, then every single one of these will.&lt;/p&gt;

&lt;p&gt;This is a complete production failure log from implementing a Backstage Golden Path that deploys KServe model inference endpoints on Kubernetes. Nine distinct failures. Every one with exact error output, root cause, and the fix that worked.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The goal:&lt;/strong&gt; a developer fills a Backstage form, a GitHub PR opens, the PR merges, ArgoCD deploys a KServe InferenceService, and the endpoint responds to predictions.&lt;br&gt;
Getting there took nine failures across three days.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What I was trying to build
&lt;/h2&gt;

&lt;p&gt;The Golden Path demo contract:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Backstage form → PR opened → merge → ArgoCD sync → InferenceService Ready=True → curl returns {"predictions":[1,1]}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backstage (Helm chart, self-hosted on k3d)&lt;/li&gt;
&lt;li&gt;ArgoCD (GitOps reconciliation)&lt;/li&gt;
&lt;li&gt;KServe (model inference endpoints)&lt;/li&gt;
&lt;li&gt;GitHub (scaffolder target)&lt;/li&gt;
&lt;li&gt;Kyverno (admission policies)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Failure 1: Template Not Visible in Catalog — Silent Rejection With No UI Error
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 30 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After adding the template file and registering it in &lt;code&gt;infrastructure/backstage/values.yaml&lt;/code&gt;, the template did not appear in Backstage's &lt;code&gt;/create&lt;/code&gt; page. No error was visible in the UI. The page simply showed an empty catalog.&lt;/p&gt;

&lt;h3&gt;
  
  
  Digging In
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage logs deploy/neuroscale-backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50
...
&lt;span class="o"&gt;[&lt;/span&gt;backstage] warn  Failed to process location
  &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"location"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"type"&lt;/span&gt;:&lt;span class="s2"&gt;"url"&lt;/span&gt;,&lt;span class="s2"&gt;"target"&lt;/span&gt;:&lt;span class="s2"&gt;"https://github.com/sodiq-code/
  neuroscale-platform/blob/main/backstage/templates/model-endpoint/
  template.yaml"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="s2"&gt;"error"&lt;/span&gt;:&lt;span class="s2"&gt;"NotAllowedError: Forbidden: entity of kind Template
  is not allowed from that location"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error only appears in server logs. The UI shows nothing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;Backstage's catalog configuration allows only specific entity kinds from each registered location. The default allow list for repository-based locations does not include &lt;code&gt;Template&lt;/code&gt;. Without an explicit &lt;code&gt;allow: [Template]&lt;/code&gt; rule, entities of kind &lt;code&gt;Template&lt;/code&gt; are silently rejected. This is security-by-default behavior — but the complete silence in the UI makes it look like a misconfiguration rather than a permission issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;catalog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;locations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;
            &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/sodiq-code/neuroscale-platform/blob/main/backstage/templates/model-endpoint/template.yaml&lt;/span&gt;
            &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;allow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Template&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After rolling out the updated Backstage deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage rollout restart deploy/neuroscale-backstage
&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage rollout status deploy/neuroscale-backstage &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;300s
deployment &lt;span class="s2"&gt;"neuroscale-backstage"&lt;/span&gt; successfully rolled out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The template appeared in &lt;code&gt;/create&lt;/code&gt; within 60 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;For a platform team deploying Backstage for internal users, this silent failure means developers see an empty template catalog and assume the platform is broken — not that a config rule is missing. Always check server logs, not just the UI, when Backstage catalog ingestion seems to fail.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 2: Scaffolder /create Page Loads Blank — 401 on Actions API
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 45 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After the template was visible, clicking into it showed a blank form. The browser developer console revealed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /api/scaffolder/v2/actions &lt;/span&gt;&lt;span class="k"&gt;HTTP&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;1.1&lt;/span&gt; &lt;span class="m"&gt;401&lt;/span&gt; &lt;span class="ne"&gt;Unauthorized&lt;/span&gt;
&lt;span class="na"&gt;{"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;{"name":"AuthenticationError","message":"Missing credentials"}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The page route returned HTTP 200 — the React app loaded — but the actions API returned 401, so the form had no data to render.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;Backstage's new backend architecture (introduced in 1.x) adds an internal authentication policy requiring all service-to-service calls to include a valid Backstage token. The scaffolder frontend makes an internal API call to list available actions. Because no auth provider was configured for local development, this internal call was rejected. This is a breaking change from older Backstage versions where the actions endpoint was unauthenticated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;dangerouslyDisableDefaultAuthPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Production note:&lt;/strong&gt; &lt;code&gt;dangerouslyDisableDefaultAuthPolicy: true&lt;/code&gt; is acceptable for local development only. For production, configure GitHub OAuth via &lt;code&gt;values-prod.yaml&lt;/code&gt; with a proper sign-in policy. The production profile uses &lt;code&gt;auth.providers.guest.dangerouslyAllowOutsideDevelopment: true&lt;/code&gt; instead — which keeps the auth subsystem active and provides a real &lt;code&gt;user:default/guest&lt;/code&gt; identity, rather than disabling auth entirely.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;An empty scaffolder form is indistinguishable from a misconfigured form to an end user. The 401 error is only visible in browser developer tools. This is the second failure in this series that generated zero visible error in the UI.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 3: Frontend Crashes With Blank White Screen — Missing Required Config Key
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 20 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After the auth policy fix, reloading Backstage showed a blank white screen. The browser console:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;Uncaught&lt;/span&gt; &lt;span class="nb"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Missing&lt;/span&gt; &lt;span class="nx"&gt;required&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="nx"&gt;at&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;app.title&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;app&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="nx"&gt;at&lt;/span&gt; &lt;span class="nf"&gt;validateConfigSchema &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;esm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;js&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;234&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;at&lt;/span&gt; &lt;span class="nx"&gt;BackstageApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;render &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;esm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;js&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;891&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The Backstage frontend requires &lt;code&gt;app.title&lt;/code&gt; to be present in the runtime configuration. This key was absent from the &lt;code&gt;appConfig&lt;/code&gt; section of &lt;code&gt;values.yaml&lt;/code&gt;. The React application crashed on initialization before any content could render. This is a required configuration key not documented prominently as "required on first boot."&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NeuroScale Platform&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
        &lt;span class="na"&gt;cors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;origin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://localhost:7010&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: &lt;code&gt;app.baseUrl&lt;/code&gt; and &lt;code&gt;backend.baseUrl&lt;/code&gt; also needed to match the port used for port-forwarding (7010), not the default 7007.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;A blank white screen with no network errors means the JavaScript runtime crashed before rendering. Always check the browser console — not just network requests — for Backstage frontend failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 4: Backstage CrashLoopBackOff — Helm Dependency Values Mis-Nesting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 2 hours | &lt;strong&gt;Impact:&lt;/strong&gt; Developer portal completely unavailable&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nt"&gt;-w&lt;/span&gt;
NAME                                    READY   STATUS             RESTARTS
neuroscale-backstage-7d9f5b8c4-xqr2m   0/1     CrashLoopBackOff   8          12m

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl describe pod neuroscale-backstage-7d9f5b8c4-xqr2m &lt;span class="nt"&gt;-n&lt;/span&gt; backstage
...
Events:
  Warning  Unhealthy  30s  kubelet
    Startup probe failed: connect: connection refused
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The Backstage Helm chart is a wrapper chart with &lt;code&gt;backstage&lt;/code&gt; as a dependency. Configuration for the Backstage container itself must be nested under &lt;code&gt;backstage.backstage.*&lt;/code&gt;, not &lt;code&gt;backstage.*&lt;/code&gt;. The misconfiguration meant probe settings and resource requests were silently ignored, so Kubernetes used default probe timings — a 2-second initial delay — that were far too aggressive for Backstage's ~90-second startup time.&lt;/p&gt;

&lt;p&gt;The pod was killed before it could become healthy, triggering CrashLoopBackOff.&lt;/p&gt;

&lt;p&gt;Backstage requires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
  &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default gives it 2 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Correct the values hierarchy and harden probe timings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/backstage/values.yaml&lt;/span&gt;
&lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backstage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;           &lt;span class="c1"&gt;# &amp;lt;-- must be nested here, not at backstage.*&lt;/span&gt;
    &lt;span class="na"&gt;appConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="s"&gt;...&lt;/span&gt;
    &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
    &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
    &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100m&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;If a Helm chart is a wrapper with a dependency, configuration for the dependency must be nested under the dependency's alias key. Values placed at the wrong hierarchy level are silently ignored — Kubernetes uses chart defaults, not your overrides. This incident directly motivated adding CI validation for rendered Helm manifests: if the final Deployment spec had been checked in CI, the wrong probe values would have been caught before deployment. Full RCA: &lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 5: PR Creation Fails — GitHub Token Secret Contains Placeholder Value
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 30 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After the portal was stable, the scaffolder's "Open pull request" step spun for 30 seconds then failed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: Request failed with status 401: Bad credentials
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No PR was created in GitHub.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The Kubernetes Secret &lt;code&gt;neuroscale-backstage-secrets&lt;/code&gt; contained a placeholder &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; value — literally &lt;code&gt;&amp;lt;YOUR_TOKEN_HERE&amp;gt;&lt;/code&gt;. The environment variable was present, satisfying &lt;code&gt;kubectl describe secret&lt;/code&gt; output, but the value was not a valid token.&lt;/p&gt;

&lt;p&gt;A secondary issue: after updating the secret with the correct token, the running pod did not pick up the change. Environment variables from Secrets are injected at pod start time, not dynamically. The pod needed a restart.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update the secret with a valid token&lt;/span&gt;
&lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; GITHUB_TOKEN
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage create secret generic neuroscale-backstage-secrets &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GITHUB_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dry-run&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;client &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; -

&lt;span class="c"&gt;# Restart to reload env vars&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage rollout restart deploy/neuroscale-backstage
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage rollout status deploy/neuroscale-backstage &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;300s

&lt;span class="c"&gt;# Verify token is present — check length, never the value&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nb"&gt;exec &lt;/span&gt;deploy/neuroscale-backstage &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo ${#GITHUB_TOKEN} chars'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;kubectl describe secret&lt;/code&gt; shows the key exists and has bytes. It does not show whether the value is a valid token or a placeholder string. Always verify token presence by checking character length in the running container, never by reading the secret value directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 6: PR Merged But ArgoCD Stays OutOfSync — Fix Not Committed to Git
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 1 hour of confusion&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;The Backstage scaffolder created the PR correctly. CI passed. The PR was merged. ArgoCD detected the new application. But the child app immediately showed &lt;code&gt;OutOfSync/Degraded&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Degraded

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application demo-iris-2
...
Message: Internal error occurred: failed calling webhook
  &lt;span class="s2"&gt;"inferenceservice.kserve-webhook-server.validator.webhook"&lt;/span&gt;:
  no endpoints available &lt;span class="k"&gt;for &lt;/span&gt;service &lt;span class="s2"&gt;"kserve-webhook-server-service"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;This was the &lt;code&gt;kube-rbac-proxy&lt;/code&gt; ImagePullBackOff failure from earlier — reappearing after a cluster restart. The fix had been applied with &lt;code&gt;kubectl patch&lt;/code&gt; directly, not committed to Git. ArgoCD's &lt;code&gt;selfHeal: true&lt;/code&gt; reverted it on the next sync cycle. The cluster restart exposed that the fix was never persisted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify the patch is in kustomization.yaml&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;infrastructure/serving-stack/kustomization.yaml | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A2&lt;/span&gt; patches

&lt;span class="c"&gt;# Commit and push&lt;/span&gt;
git add infrastructure/serving-stack/
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"serving-stack: persist kube-rbac-proxy removal patch"&lt;/span&gt;
git push origin main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ArgoCD picked up the change within 3 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;Any fix applied with &lt;code&gt;kubectl&lt;/code&gt; directly in a GitOps-managed cluster is temporary. The next sync cycle will revert it. Every fix must be committed to Git to survive. The PR-merged-but-nothing-deployed experience is the worst possible failure for a Golden Path demo — the developer did everything correctly and the platform failed silently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 7: Inference Endpoint Returns HTTP 307 Redirect — Traefik Intercepts Before Kourier
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 45 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After &lt;code&gt;demo-iris-2&lt;/code&gt; became &lt;code&gt;Ready=True&lt;/code&gt;, the inference test returned an unexpected redirect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[6.8,2.8,4.8,1.4]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://172.20.0.3/v1/models/demo-iris-2:predict

&amp;lt; HTTP/1.1 307 Temporary Redirect
&amp;lt; Location: https://172.20.0.3/v1/models/demo-iris-2:predict
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;k3d's built-in Traefik ingress was intercepting the request and applying an HTTP-to-HTTPS redirect before it reached Kourier. The request never reached the Knative routing layer at all.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Use direct pod port-forward for canonical local verification, bypassing Traefik and Kourier entirely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find predictor pod&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get pods &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-l&lt;/span&gt; serving.knative.dev/revision&lt;span class="o"&gt;=&lt;/span&gt;demo-iris-2-predictor-00001 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.items[0].metadata.name}'&lt;/span&gt;

&lt;span class="c"&gt;# Port-forward directly to the pod&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default port-forward &lt;span class="se"&gt;\&lt;/span&gt;
  pod/demo-iris-2-predictor-00001-deployment-&amp;lt;&lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; 18080:8080

&lt;span class="c"&gt;# Predict&lt;/span&gt;
curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[6.8,2.8,4.8,1.4],[6.0,3.4,4.5,1.6]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://127.0.0.1:18080/v1/models/demo-iris-2:predict

&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"predictions"&lt;/span&gt;:[1,1]&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;A healthy inference endpoint can look completely broken if your test path hits an unexpected intermediary. For local k3d clusters, disable Traefik at cluster creation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;k3d cluster create neuroscale &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--k3s-arg&lt;/span&gt; &lt;span class="s2"&gt;"--disable=traefik@server:0"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Failure 8: Catalog Ingestion Silently Rejects Template After Values Update
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 20 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After updating &lt;code&gt;values.yaml&lt;/code&gt; and rolling out a new Backstage deployment, the template disappeared from &lt;code&gt;/create&lt;/code&gt; again — the same symptom as Failure 1, but after it had been working.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The rolling update caused a brief period where the new pod was starting and the old pod was terminating. During this window, the catalog re-ingested all locations. The updated &lt;code&gt;values.yaml&lt;/code&gt; had a YAML indentation error in the &lt;code&gt;catalog.locations&lt;/code&gt; block, which caused the allow rule for &lt;code&gt;Template&lt;/code&gt; to be silently dropped during parsing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check catalog ingestion in the new pod logs immediately after rollout&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage logs deploy/neuroscale-backstage &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100 | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"warn&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;error&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;fail&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;forbidden"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fixed the YAML indentation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Correct indentation&lt;/span&gt;
&lt;span class="na"&gt;catalog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;locations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/...&lt;/span&gt;
      &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;allow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Template&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# must be under rules:, not misaligned&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;YAML indentation errors in Backstage config values are never surfaced as errors — the field is simply ignored. After every Backstage rollout that touches &lt;code&gt;appConfig&lt;/code&gt;, immediately verify catalog ingestion by checking server logs and confirming the template appears in &lt;code&gt;/create&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 9: Scaffolder Task Hangs Then Fails — Port-Forward Session Died Mid-Task
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 15 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;The scaffolder task started successfully, progress spinner ran for 60 seconds, then failed with a network error. The Backstage UI showed the task as failed with no specific error message. A second attempt worked immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;kubectl port-forward&lt;/code&gt; session for Backstage had silently died between opening the browser and submitting the scaffolder form. The React app was loaded from cache — so the page appeared fully functional — but all API calls were failing because the backend was unreachable. The scaffolder task started, sent the first API call, and failed on the network layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before running any Backstage scaffolder task, verify the port-forward is alive&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:7010/api/catalog/entities?limit&lt;span class="o"&gt;=&lt;/span&gt;1 | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; 100

&lt;span class="c"&gt;# If it returns nothing or errors, restart the port-forward&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage port-forward svc/neuroscale-backstage 7010:7007
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;scripts/port-forward-all.sh&lt;/code&gt; from the repository which starts all required port-forwards as background processes with clean shutdown handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;A React app loaded from browser cache looks fully functional even when the backend is unreachable. Always verify the backend API is responding before running a scaffolder task, not just that the UI loaded.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Golden Path Actually Proves After Nine Failures
&lt;/h2&gt;

&lt;p&gt;Final working state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice demo-iris-2
NAME          URL                                       READY   AGE
demo-iris-2   http://demo-iris-2.default.example.com   True    25m

&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[6.8,2.8,4.8,1.4],[6.0,3.4,4.5,1.6]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://127.0.0.1:18080/v1/models/demo-iris-2:predict

&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"predictions"&lt;/span&gt;:[1,1]&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Golden Path demo is a chain of seven moving parts: Backstage config, GitHub auth, ArgoCD app-of-apps, KServe controller, Knative routing, Kourier gateway, and the predictor pod. In production, any link in that chain can fail independently.&lt;/p&gt;

&lt;p&gt;The debugging process for these nine failures is a direct map to what a platform SRE does on an on-call shift.&lt;/p&gt;




&lt;h2&gt;
  
  
  Debugging Commands Reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Backstage catalog ingestion errors&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage logs deploy/neuroscale-backstage | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"warn&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;error&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;fail&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;forbidden"&lt;/span&gt;

&lt;span class="c"&gt;# Backstage runtime config&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage describe configmap neuroscale-backstage-app-config

&lt;span class="c"&gt;# Verify GitHub token is present (check length only)&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; backstage &lt;span class="nb"&gt;exec &lt;/span&gt;deploy/neuroscale-backstage &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'echo ${#GITHUB_TOKEN} chars'&lt;/span&gt;

&lt;span class="c"&gt;# ArgoCD child app sync status&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get applications
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application demo-iris-2

&lt;span class="c"&gt;# InferenceService conditions&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default describe inferenceservice demo-iris-2

&lt;span class="c"&gt;# Admission webhook endpoints&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get endpoints kserve-webhook-server-service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Pattern Across All Nine Failures
&lt;/h2&gt;

&lt;p&gt;Looking back at the nine failures, they fall into three categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Silent failures (no UI error, log only):&lt;/strong&gt;&lt;br&gt;
Failures 1, 2, 8 — catalog ingestion rejections and auth failures that show nothing in the UI. Rule: always check server logs, not just the browser.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration hierarchy failures:&lt;/strong&gt;&lt;br&gt;
Failures 3, 4 — missing required keys and wrong Helm nesting. Rule: validate rendered manifests in CI before applying them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;State and dependency failures:&lt;/strong&gt;&lt;br&gt;
Failures 5, 6, 7, 9 — stale secrets, unreversioned fixes, intercepting proxies, dead sessions. Rule: verify the complete dependency chain before debugging the thing that appears broken.&lt;/p&gt;




&lt;h2&gt;
  
  
  See Also
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md&lt;/code&gt;&lt;/a&gt; — full 12-section RCA for Failure 4&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_3_GOLDEN_PATH.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_3_GOLDEN_PATH.md&lt;/code&gt;&lt;/a&gt; — complete implementation record&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/backstage/values.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/backstage/values.yaml&lt;/code&gt;&lt;/a&gt; — working dev Backstage configuration&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/backstage/values-prod.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/backstage/values-prod.yaml&lt;/code&gt;&lt;/a&gt; — production profile with GitHub OAuth&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/scripts/smoke-test.sh" rel="noopener noreferrer"&gt;&lt;code&gt;scripts/smoke-test.sh&lt;/code&gt;&lt;/a&gt; — automated end-to-end verification&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Jimoh Sodiq Bolaji&lt;/strong&gt; | Platform Engineer | Technical Content Engineer | Abuja, Nigeria&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;NeuroScale Platform&lt;/a&gt; · &lt;a href="https://dev.to/sodiqjimoh"&gt;Dev.to&lt;/a&gt;&lt;/p&gt;

</description>
      <category>backstage</category>
      <category>kubernetes</category>
      <category>kserve</category>
      <category>gitops</category>
    </item>
    <item>
      <title>Beyond InferenceService Readiness: 5 GitOps Failure Modes That Break KServe Deployments</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Mon, 30 Mar 2026 22:52:16 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/beyond-inferenceservice-readiness-5-gitops-failure-modes-that-break-kserve-deployments-14fb</link>
      <guid>https://forem.com/sodiqjimoh/beyond-inferenceservice-readiness-5-gitops-failure-modes-that-break-kserve-deployments-14fb</guid>
      <description>&lt;p&gt;&lt;strong&gt;A sequel to my KServe readiness post — five GitOps control-plane failure modes with exact terminal output, diagnostics, and repeatable fixes for ArgoCD + KServe stacks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This post is a follow-up to my earlier KServe piece on endpoint readiness:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://dev.to/sodiqjimoh/why-your-kserve-inferenceservice-wontbecome-readyfour-production-failures-and-fixes-nei"&gt;Why Your KServe InferenceService Won't Become Ready: Four Production Failures and Fixes&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That article focused on why an &lt;code&gt;InferenceService&lt;/code&gt; may not become &lt;code&gt;Ready&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This one zooms out to a broader question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What breaks when the GitOps control plane itself is unstable?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most GitOps + AI serving tutorials still focus on the happy path — install ArgoCD, apply KServe, deploy InferenceService, done. But in real platform work, the happy path is the easy part.&lt;/p&gt;

&lt;p&gt;The hard part is when your app is &lt;code&gt;OutOfSync&lt;/code&gt;, the webhook has no endpoints, and everything looks healthy except the thing you actually need.&lt;/p&gt;

&lt;p&gt;This post covers the &lt;strong&gt;five failure modes&lt;/strong&gt; that repeatedly broke KServe deployments in a real production-grade platform build, with exact terminal output, root causes, and the fixes that worked.&lt;/p&gt;

&lt;p&gt;All failures come from hands-on implementation work documented here:&lt;br&gt;
&lt;strong&gt;Project repo:&lt;/strong&gt; &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The platform context
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ArgoCD&lt;/strong&gt; — GitOps reconciliation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KServe&lt;/strong&gt; — model serving (&lt;code&gt;InferenceService&lt;/code&gt;, runtimes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knative + Kourier&lt;/strong&gt; — serving networking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kyverno&lt;/strong&gt; — policy guardrails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backstage&lt;/strong&gt; — self-service PR generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GitOps root app:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;neuroscale-infrastructure&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/sodiq-code/neuroscale-platform.git&lt;/span&gt;
    &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;infrastructure/apps&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Failure Mode 1: Webhook Has No Endpoints — Sync Fails Cluster-Wide
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; ~1 hour | &lt;strong&gt;Impact:&lt;/strong&gt; All InferenceService operations blocked&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;ArgoCD syncs child apps and hits this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application ai-model-alpha
...
Message: admission webhook
  &lt;span class="s2"&gt;"inferenceservice.kserve-webhook-server.validator.webhook"&lt;/span&gt;
  denied the request: Internal error occurred:
  no endpoints available &lt;span class="k"&gt;for &lt;/span&gt;service &lt;span class="s2"&gt;"kserve-webhook-server-service"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Meanwhile the KServe controller pod shows only 1 of 2 containers ready:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get pods
NAME                                        READY   STATUS
kserve-controller-manager-8d7c5b9f4-xr2lm  1/2     Running

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve describe pod kserve-controller-manager-8d7c5b9f4-xr2lm
...
  kube-rbac-proxy:
    State:   Waiting
    Reason:  ImagePullBackOff
    Image:   gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
Events:
  Warning  Failed  kubelet
    Failed to pull image: unexpected status code 403 Forbidden
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;kube-rbac-proxy&lt;/code&gt; sidecar inside &lt;code&gt;kserve-controller-manager&lt;/code&gt; was pulling from &lt;code&gt;gcr.io/kubebuilder/&lt;/code&gt; — a registry that restricted access in late 2025. The manager container was healthy but because the sidecar was not running, the webhook server had no valid certificate endpoint. Result: every &lt;code&gt;InferenceService&lt;/code&gt; apply or update was blocked cluster-wide.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Remove the sidecar via Kustomize strategic merge patch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/serving-stack/patches/&lt;/span&gt;
&lt;span class="c1"&gt;#   kserve-controller-kube-rbac-proxy-image.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-controller-manager&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-rbac-proxy&lt;/span&gt;
          &lt;span class="na"&gt;$patch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delete&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify webhook endpoints are restored after re-sync:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get endpoints kserve-webhook-server-service
NAME                           ENDPOINTS          AGE
kserve-webhook-server-service  10.42.0.23:9443    45s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;When webhook endpoints are missing, your app YAML is never the real problem. Diagnose the controller first. An external registry access change can silently kill your entire admission layer cluster-wide with no obvious error in the app itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Mode 2: CRD Deleted by a Misapplied Patch — All Endpoints Gone Instantly
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 4 minutes recovery | &lt;strong&gt;Impact:&lt;/strong&gt; SEV-1 equivalent — all InferenceServices deleted&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;All InferenceService objects disappeared silently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservices
No resources found &lt;span class="k"&gt;in &lt;/span&gt;default namespace.

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Missing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;A Kustomize patch file named &lt;code&gt;remove-inferenceservice-crd.yaml&lt;/code&gt; was mistakenly applied directly with &lt;code&gt;kubectl apply -f&lt;/code&gt; instead of being used as a build-time patch inside &lt;code&gt;kustomization.yaml&lt;/code&gt;. The file contained a &lt;code&gt;$patch: delete&lt;/code&gt; directive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apiextensions.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CustomResourceDefinition&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inferenceservices.serving.kserve.io&lt;/span&gt;
&lt;span class="na"&gt;$patch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delete&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When applied directly, it deleted the actual CRD from Kubernetes. When a CRD is deleted, Kubernetes immediately garbage-collects every custom resource of that type. Every InferenceService was gone within seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Restore the CRD immediately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/kserve/kserve/releases/download/v0.12.1/kserve.yaml

kubectl &lt;span class="nb"&gt;wait &lt;/span&gt;crd/inferenceservices.serving.kserve.io &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Established &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;60s

kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd patch application demo-iris-2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; merge &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s1"&gt;'{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;$patch: delete&lt;/code&gt; in a Kustomize file is a build-time instruction — it tells &lt;code&gt;kustomize build&lt;/code&gt; to omit that resource from output. It must never be applied directly with &lt;code&gt;kubectl apply -f&lt;/code&gt;. Ambiguous filenames like &lt;code&gt;remove-inferenceservice-crd.yaml&lt;/code&gt; are dangerous footguns. In a production cluster with 50 deployed models this would be a full SEV-1.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Rule:&lt;/strong&gt; Any file containing &lt;code&gt;$patch: delete&lt;/code&gt; must only ever be referenced inside a &lt;code&gt;kustomization.yaml&lt;/code&gt; patches block, never applied directly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Failure Mode 3: Permanent OutOfSync Due to Label Key Mismatch
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 2 weeks undetected | &lt;strong&gt;Impact:&lt;/strong&gt; CI was green while policy enforcement was silently broken&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;A PR is merged, ArgoCD syncs, but the InferenceService stays &lt;code&gt;OutOfSync/Degraded&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Degraded
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kyverno denies the resource at admission:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Error from server: error when creating &lt;span class="s2"&gt;"STDIN"&lt;/span&gt;:
  admission webhook &lt;span class="s2"&gt;"clusterpolice.kyverno.svc"&lt;/span&gt; denied the request:
  resource InferenceService/default/test-model was blocked due to the following policies
  require-standard-labels-inferenceservice:
    check-owner-and-cost-center-on-isvc: &lt;span class="s1"&gt;'validation error:
    InferenceService resources must set metadata.labels.owner and
    metadata.labels.cost-center.
    rule check-owner-and-cost-center-on-isvc failed at path
    /metadata/labels/cost-center/'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the label is present in the manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice demo-iris-2 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.metadata.labels}'&lt;/span&gt; | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool
&lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"owner"&lt;/span&gt;: &lt;span class="s2"&gt;"platform-team"&lt;/span&gt;,
    &lt;span class="s2"&gt;"costCenter"&lt;/span&gt;: &lt;span class="s2"&gt;"ai-platform"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;costCenter&lt;/code&gt; (camelCase) and &lt;code&gt;cost-center&lt;/code&gt; (kebab-case) are completely different Kubernetes label keys. The Backstage template skeleton was generating &lt;code&gt;costCenter&lt;/code&gt;. The Kyverno policy required &lt;code&gt;cost-center&lt;/code&gt;. CI passed because CI used the same manifest that would pass — the mismatch only surfaced at admission time.&lt;/p&gt;

&lt;p&gt;Additionally, &lt;code&gt;kyverno-cli apply&lt;/code&gt; exits with code &lt;code&gt;0&lt;/code&gt; even when policy violations are found. CI was checking &lt;code&gt;$?&lt;/code&gt; rather than &lt;code&gt;${PIPESTATUS[0]}&lt;/code&gt;, so the CI step appeared green while enforcement was completely broken for two weeks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Standardize on kebab-case throughout (Kubernetes convention):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Backstage template skeleton&lt;/span&gt;
&lt;span class="c"&gt;# apps/${{ values.name }}/inference-service.yaml&lt;/span&gt;
labels:
  owner: platform-team
  cost-center: ai-platform   &lt;span class="c"&gt;# was: costCenter&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fix the CI Kyverno check to catch actual violations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;set&lt;/span&gt; +e
docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PWD&lt;/span&gt;&lt;span class="s2"&gt;:/work"&lt;/span&gt; &lt;span class="nt"&gt;-w&lt;/span&gt; /work ghcr.io/kyverno/kyverno-cli:v1.12.5 &lt;span class="se"&gt;\&lt;/span&gt;
  apply infrastructure/kyverno/policies/&lt;span class="k"&gt;*&lt;/span&gt;.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;app_files&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  2&amp;gt;&amp;amp;1 | &lt;span class="nb"&gt;tee&lt;/span&gt; /tmp/kyverno-output.txt
&lt;span class="nv"&gt;kyverno_exit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PIPESTATUS&lt;/span&gt;&lt;span class="p"&gt;[0]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;kyverno_exit&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-ne&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qE&lt;/span&gt; &lt;span class="s2"&gt;"^FAIL"&lt;/span&gt; /tmp/kyverno-output.txt &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qE&lt;/span&gt; &lt;span class="s2"&gt;"fail: [1-9][0-9]*"&lt;/span&gt; /tmp/kyverno-output.txt&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Kyverno policy violations detected. Failing CI."&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;$?&lt;/code&gt; captures the exit code of &lt;code&gt;tee&lt;/code&gt;, not &lt;code&gt;kyverno&lt;/code&gt;. &lt;code&gt;${PIPESTATUS[0]}&lt;/code&gt; captures kyverno's actual exit code. "Guardrails exist" and "guardrails enforce" are different states. The most dangerous failure mode for a policy system is silent false positives — everything looks green while nothing is actually being enforced.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Mode 4: Kyverno Install Breaks ArgoCD Reconciliation Loop
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 2–5 minutes per cluster | &lt;strong&gt;Impact:&lt;/strong&gt; All ArgoCD apps enter Unknown state&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After adding Kyverno to the platform, previously healthy apps enter &lt;code&gt;Unknown&lt;/code&gt; state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get applications
NAME                       SYNC STATUS   HEALTH STATUS
neuroscale-infrastructure  Synced         Healthy
serving-stack              Unknown        Unknown    &lt;span class="c"&gt;# was Healthy 10 minutes ago&lt;/span&gt;
policy-guardrails          Synced         Healthy

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application serving-stack
...
Message: rpc error: code &lt;span class="o"&gt;=&lt;/span&gt; Unavailable desc &lt;span class="o"&gt;=&lt;/span&gt; connection refused
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;Kyverno installs its own &lt;code&gt;ValidatingWebhookConfiguration&lt;/code&gt; and &lt;code&gt;MutatingWebhookConfiguration&lt;/code&gt; during install. While Kyverno is initializing, the webhook configurations are registered but point to endpoints that do not exist yet. During this window, any &lt;code&gt;kubectl apply&lt;/code&gt; operation — including ArgoCD's sync reconciliation loop — times out waiting for a response from a not-yet-running webhook server. This cascades into the ArgoCD repo-server losing its connection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;

&lt;p&gt;Add a Kyverno &lt;code&gt;webhookAnnotations&lt;/code&gt; ConfigMap patch to suppress automatic webhook registration during the initialization window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/kyverno/kustomization.yaml&lt;/span&gt;
&lt;span class="na"&gt;patches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno&lt;/span&gt;
    &lt;span class="na"&gt;patch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
      &lt;span class="s"&gt;apiVersion: v1&lt;/span&gt;
      &lt;span class="s"&gt;kind: ConfigMap&lt;/span&gt;
      &lt;span class="s"&gt;metadata:&lt;/span&gt;
        &lt;span class="s"&gt;name: kyverno&lt;/span&gt;
        &lt;span class="s"&gt;namespace: kyverno&lt;/span&gt;
      &lt;span class="s"&gt;data:&lt;/span&gt;
        &lt;span class="s"&gt;webhookAnnotations: "{}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After Kyverno reaches &lt;code&gt;Running&lt;/code&gt; state, force a hard refresh:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd patch application serving-stack &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; merge &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s1"&gt;'{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;Adding a policy engine to an existing cluster disrupts all other ArgoCD-managed applications during the install window. In production, this requires a maintenance window or a canary install strategy. Kyverno must be fully healthy before any other component syncs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Mode 5: Stale Admission Webhook Blocks All Resource Creation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; 30+ minutes | &lt;strong&gt;Impact:&lt;/strong&gt; All Deployments in the namespace silently blocked&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After fixing the repo-server, apps sync but Deployments never appear:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get applications &lt;span class="nt"&gt;-n&lt;/span&gt; argocd
NAME                       SYNC STATUS   HEALTH STATUS
neuroscale-infrastructure  Synced         Healthy
test-app                   Synced         Progressing   &lt;span class="c"&gt;# stuck&lt;/span&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get deploy &lt;span class="nt"&gt;-n&lt;/span&gt; default
No resources found &lt;span class="k"&gt;in &lt;/span&gt;default namespace.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ArgoCD shows the Deployment as "synced" but it does not exist — a contradiction. Checking conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application test-app &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 20 conditions
  conditions:
  - message: &lt;span class="s1"&gt;'Failed sync attempt: one or more objects failed to apply,
      reason: Internal error occurred: failed calling webhook
      "validate.nginx.ingress.kubernetes.io":
      failed to call webhook: Post
      "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/...":
      dial tcp 10.96.x.x:443: connect: connection refused'&lt;/span&gt;
    &lt;span class="nb"&gt;type&lt;/span&gt;: SyncError
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;ValidatingWebhookConfiguration&lt;/code&gt; from a previous cluster experiment was still registered but pointing to a service that no longer existed. Kubernetes admission webhooks are cluster-scoped. The stale &lt;code&gt;ingress-nginx&lt;/code&gt; webhook was intercepting every resource creation attempt and failing them — the error only appears in ArgoCD events, not on the Deployment itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Discover stale webhooks&lt;/span&gt;
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

&lt;span class="c"&gt;# Delete the stale one&lt;/span&gt;
kubectl delete validatingwebhookconfiguration ingress-nginx-admission

&lt;span class="c"&gt;# Force ArgoCD to retry&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd patch application test-app &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; merge &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s1"&gt;'{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After deletion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get deploy &lt;span class="nt"&gt;-n&lt;/span&gt; default
NAME          READY   UP-TO-DATE   AVAILABLE   AGE
nginx-test    1/1     1            1           23s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson
&lt;/h3&gt;

&lt;p&gt;A stale webhook from a previous workload silently blocks all resource creation in the affected namespace for hours without any obvious error message. The admission error only appears in ArgoCD events logs, not on the resource itself. Always check for stale webhooks before blaming manifests.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Triage Sequence That Saves Hours
&lt;/h2&gt;

&lt;p&gt;When a KServe app is failing in ArgoCD, run this exact order before touching any manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Environment gate — if this fails, stop and fix environment first&lt;/span&gt;
kubectl get nodes
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get applications

&lt;span class="c"&gt;# 2. Control-plane health&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get deploy,pods,svc,endpoints
kubectl get crd | &lt;span class="nb"&gt;grep &lt;/span&gt;serving.kserve.io

&lt;span class="c"&gt;# 3. Controller logs&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve logs deploy/kserve-controller-manager &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100

&lt;span class="c"&gt;# 4. Webhook availability&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get endpoints kserve-webhook-server-service

&lt;span class="c"&gt;# 5. Stale webhooks&lt;/span&gt;
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

&lt;span class="c"&gt;# 6. App-level sync error detail&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application &amp;lt;app-name&amp;gt; &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 20 conditions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only after every step above passes should you edit app manifests.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters for Platform Teams
&lt;/h2&gt;

&lt;p&gt;A platform is credible when it supports both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-service delivery&lt;/strong&gt; — the Golden Path works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-service recovery&lt;/strong&gt; — failures are understandable and fixable without a platform expert&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams build the first and postpone the second. That creates operational debt fast.&lt;/p&gt;

&lt;p&gt;The fix is not more dashboards. It is better failure-model documentation, tighter GitOps guardrails, and the discipline to document what breaks — not just what works.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A platform is not "done" when the happy path works. It's done when the failure path is understandable and recoverable.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What I Would Improve Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Pre-merge CI assertions for probe and resource fields in rendered manifests&lt;/li&gt;
&lt;li&gt;Explicit dependency ordering using ArgoCD sync waves to prevent Kyverno install disruption&lt;/li&gt;
&lt;li&gt;Conformance checks for Helm dependency values nesting to catch silently ignored overrides&lt;/li&gt;
&lt;li&gt;Policy test fixtures that verify both pass and fail cases in CI&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  See Also
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_1_GITOPS_SPINE.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_1_GITOPS_SPINE.md&lt;/code&gt;&lt;/a&gt; — ArgoCD spine failures with exact terminal output&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md&lt;/code&gt;&lt;/a&gt; — Kyverno CI false-green and the &lt;code&gt;$PIPESTATUS[0]&lt;/code&gt; fix&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md&lt;/code&gt;&lt;/a&gt; — full incident postmortem with 12-section RCA&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_2_KSERVE_SERVING.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_2_KSERVE_SERVING.md&lt;/code&gt;&lt;/a&gt; — the kube-rbac-proxy failure in full detail&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Jimoh Sodiq Bolaji&lt;/strong&gt; | Platform Engineer | Technical Content Engineer | Abuja, Nigeria&lt;br&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;NeuroScale Platform&lt;/a&gt; · &lt;a href="https://dev.to/sodiqjimoh"&gt;Dev.to&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>gitops</category>
      <category>devops</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why Your KServe InferenceService Won't Become Ready: Four Production Failures and Fixes</title>
      <dc:creator>Sodiq Jimoh</dc:creator>
      <pubDate>Mon, 30 Mar 2026 03:37:24 +0000</pubDate>
      <link>https://forem.com/sodiqjimoh/why-your-kserve-inferenceservice-wontbecome-readyfour-production-failures-and-fixes-nei</link>
      <guid>https://forem.com/sodiqjimoh/why-your-kserve-inferenceservice-wontbecome-readyfour-production-failures-and-fixes-nei</guid>
      <description>&lt;p&gt;A practitioner's account of the errors the KServe getting-started documentation doesn't tell you about — with exact terminal output, root causes, and working Kustomize patches.&lt;/p&gt;

&lt;p&gt;This article documents four production failures I encountered while deploying KServe on a local k3d cluster as part of building NeuroScale — a self-service AI inference platform. None of these failures appear in the official KServe getting-started documentation. If you are deploying KServe without Istio, this will save you several hours of debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Was Building
&lt;/h2&gt;

&lt;p&gt;NeuroScale is a self-service AI inference platform on Kubernetes. The goal was simple: one InferenceService named &lt;code&gt;sklearn-iris&lt;/code&gt; reaches &lt;code&gt;Ready=True&lt;/code&gt; and responds to a prediction request.&lt;/p&gt;

&lt;p&gt;The install had to be GitOps-managed via ArgoCD — not "I ran some scripts." Getting there took two days and four distinct failures. Here is every one of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt; k3d (local Kubernetes) · KServe 0.12.1 · ArgoCD · Kourier (no Istio) · Knative Serving&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📝 &lt;strong&gt;Author's Note:&lt;/strong&gt; This article was originally documented in the NeuroScale platform repository.&lt;br&gt;
&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;docs/REALITY_CHECK_MILESTONE_2_KSERVE_SERVING.md&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;github.com/sodiq-code/neuroscale-platform&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Failure 1: KServe InferenceService Stuck Not Ready — Istio vs Kourier Ingress Mismatch Causes ReconcileError Loop
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; ~3 hours&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;

&lt;p&gt;After applying the KServe installation via ArgoCD (serving-stack app), the InferenceService was created but never became Ready:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice sklearn-iris
NAME           URL   READY   PREV   LATEST   AGE
sklearn-iris         False          100      8m

&lt;span class="c"&gt;# READY=False with no URL = KServe controller did not complete ingress setup.&lt;/span&gt;
&lt;span class="c"&gt;# No Knative Route was created. No external URL was assigned.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Digging In
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default describe inferenceservice sklearn-iris
...
Status:
  Conditions:
    Message: Failed to reconcile ingress
    Reason:  ReconcileError
    Status:  False
    Type:    IngressReady

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve logs deploy/kserve-controller-manager &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50
...
ERROR controller.inferenceservice Failed to reconcile ingress
  &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"error"&lt;/span&gt;: &lt;span class="s2"&gt;"virtual service not found: sklearn-iris.default.svc.cluster.local"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error referenced a &lt;em&gt;virtual service&lt;/em&gt; — that is an Istio concept. But we were running Kourier. The KServe controller was attempting to create an Istio VirtualService in a cluster that had no Istio control plane.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause: Default KServe Ingress Mode Assumes Istio
&lt;/h3&gt;

&lt;p&gt;KServe's default &lt;code&gt;inferenceservice-config&lt;/code&gt; ConfigMap expects Istio as the ingress provider. It sets &lt;code&gt;ingressClassName: istio&lt;/code&gt; and the key &lt;code&gt;disableIstioVirtualHost&lt;/code&gt; defaults to &lt;code&gt;false&lt;/code&gt;. When Istio is absent, the controller enters an error loop trying to create resources that will never exist.&lt;/p&gt;

&lt;p&gt;Setting &lt;code&gt;disableIstioVirtualHost: true&lt;/code&gt; tells KServe to skip Istio and fall back to Knative route objects that Kourier can handle.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why Kourier instead of Istio:&lt;/strong&gt; Istio adds ~1 GB of memory overhead. On a local k3d cluster shared with Docker Desktop, Backstage, and the KServe controller, that exhausts available RAM. Kourier's entire footprint is under 200 MB.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Fix: ConfigMap Patch in serving-stack
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/serving-stack/patches/inferenceservice-config-ingress.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inferenceservice-config&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
    &lt;span class="s"&gt;{&lt;/span&gt;
      &lt;span class="s"&gt;"ingressGateway": "knative-serving/knative-ingress-gateway",&lt;/span&gt;
      &lt;span class="s"&gt;"ingressDomain": "example.com",&lt;/span&gt;
      &lt;span class="s"&gt;"ingressClassName": "istio",&lt;/span&gt;
      &lt;span class="s"&gt;"urlScheme": "http",&lt;/span&gt;
      &lt;span class="s"&gt;"disableIstioVirtualHost": true,&lt;/span&gt;
      &lt;span class="s"&gt;"disableIngressCreation": false&lt;/span&gt;
    &lt;span class="s"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this patch was applied and the KServe controller restarted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice sklearn-iris
NAME           URL                                       READY   AGE
sklearn-iris   http://sklearn-iris.default.example.com   True    2m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Business impact:&lt;/strong&gt; This failure cost approximately 3 hours. The KServe documentation does not prominently state that the default configuration requires Istio. The error message "virtual service not found" is Istio-specific vocabulary that only makes sense if you already know Istio is the default — a classic undocumented assumption in infrastructure tooling.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Failure 2: ArgoCD Serving-Stack Sync Fails — Duplicate Knative CRD Exceeds 256 KB Annotation Size Limit
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; ~30 minutes&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get application serving-stack
NAME            SYNC STATUS   HEALTH STATUS
serving-stack   OutOfSync     Degraded

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application serving-stack
...
Message: CustomResourceDefinition &lt;span class="s2"&gt;"services.serving.knative.dev"&lt;/span&gt;
  is invalid: metadata.annotations:
  Too long: may not be more than 262144 bytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;ArgoCD stores &lt;code&gt;kubectl.kubernetes.io/last-applied-configuration&lt;/code&gt; in the annotation. For large CRDs, this annotation plus the apply payload exceeds Kubernetes' 256 KB annotation size limit. The Knative CRD is approximately 400 KB as a YAML object.&lt;/p&gt;

&lt;p&gt;A rendering overlap compounded the issue: the &lt;code&gt;kserve.yaml&lt;/code&gt; bundle already includes its own version of the Knative Serving CRDs, and we were also referencing &lt;code&gt;serving-core.yaml&lt;/code&gt; directly. This created two attempts to manage the same CRDs, causing comparison instability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/serving-stack/kustomization.yaml&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Use server-side apply to bypass the annotation size limit&lt;/span&gt;
&lt;span class="na"&gt;commonAnnotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;argocd.argoproj.io/sync-options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServerSideApply=true&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Ignore runtime-mutated fields on Knative CRDs&lt;/span&gt;
&lt;span class="c1"&gt;#    (In ArgoCD Application spec)&lt;/span&gt;
&lt;span class="na"&gt;ignoreDifferences&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apiextensions.k8s.io&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CustomResourceDefinition&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;services.serving.knative.dev&lt;/span&gt;
    &lt;span class="na"&gt;jsonPointers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/spec/preserveUnknownFields&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Business impact:&lt;/strong&gt; ArgoCD's error says "Too long" but does not tell you which annotation or why it got too long. Debugging requires knowing ArgoCD's internal server-side apply mechanism.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Failure 3: kube-rbac-proxy ImagePullBackOff Blocks KServe Admission Webhook — gcr.io Access Restriction
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; ~1 hour | Cluster-wide impact&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd describe application ai-model-alpha
...
Message: admission webhook
  &lt;span class="s2"&gt;"inferenceservice.kserve-webhook-server.validator.webhook"&lt;/span&gt;
  denied the request: no endpoints available &lt;span class="k"&gt;for
  &lt;/span&gt;service &lt;span class="s2"&gt;"kserve-webhook-server-service"&lt;/span&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get pods
NAME                            READY   STATUS
kserve-controller-manager-xxx   1/2     Running   &lt;span class="c"&gt;# only 1 of 2 ready&lt;/span&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve describe pod kserve-controller-manager-xxx
  kube-rbac-proxy:
    State:   Waiting
    Reason:  ImagePullBackOff
    Image:   gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
Events:
  Warning  Failed  kubelet
    Failed to pull image: unexpected status code 403 Forbidden
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;KServe 0.12.1's &lt;code&gt;kserve-controller-manager&lt;/code&gt; Deployment includes a &lt;code&gt;kube-rbac-proxy&lt;/code&gt; sidecar from &lt;code&gt;gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1&lt;/code&gt;. Google Container Registry restricted access to kubebuilder images in late 2025.&lt;/p&gt;

&lt;p&gt;The manager container itself was healthy (1 of 2 ready). But without the sidecar, the webhook server certificate was not being served, so the admission webhook had no healthy endpoints. The alternative &lt;code&gt;registry.k8s.io/kube-rbac-proxy:v0.13.1&lt;/code&gt; did not exist at the new location either.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix: Remove the Sidecar via Kustomize Strategic Merge Patch
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/serving-stack/patches/&lt;/span&gt;
&lt;span class="c1"&gt;#   kserve-controller-kube-rbac-proxy-image.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve-controller-manager&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kserve&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-rbac-proxy&lt;/span&gt;
          &lt;span class="na"&gt;$patch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delete&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this patch and a re-sync:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get pods
NAME                            READY   STATUS
kserve-controller-manager-yyy   1/1     Running   &lt;span class="c"&gt;# fixed&lt;/span&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get endpoints kserve-webhook-server-service
NAME                            ENDPOINTS          AGE
kserve-webhook-server-service   10.42.0.23:9443    45s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Known tradeoff:&lt;/strong&gt; Removing &lt;code&gt;kube-rbac-proxy&lt;/code&gt; disables the Prometheus metrics proxy endpoint for the KServe controller. In production, source a verified replacement image from an accessible registry before deploying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business impact:&lt;/strong&gt; An external registry access change cascaded into a complete admission webhook outage. Any InferenceService creation or update was blocked cluster-wide while the sidecar was failing. This class of failure has no good solution without upstream monitoring of your image dependencies.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Failure 4: Inference Request Returns HTTP 405 — IngressDomain Placeholder Resolves to Public Internet
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time lost:&lt;/strong&gt; ~1 hour&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice sklearn-iris &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.status.url}'&lt;/span&gt;
http://sklearn-iris.default.example.com

&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[5.1,3.5,1.4,0.2]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://sklearn-iris.default.example.com/v1/models/sklearn-iris:predict
&amp;lt;html&amp;gt;&amp;lt;&lt;span class="nb"&gt;head&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&amp;lt;title&amp;gt;405 Not Allowed&amp;lt;/title&amp;gt;&amp;lt;/head&amp;gt;...

&lt;span class="c"&gt;# The request hit the public example.com server, not our Kourier gateway.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Root Cause
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;ingressDomain&lt;/code&gt; in the KServe ConfigMap was set to &lt;code&gt;example.com&lt;/code&gt; — a literal placeholder. The generated URL resolves publicly to Cloudflare/IANA servers, not the local cluster.&lt;/p&gt;

&lt;p&gt;Additionally, Kourier routes by Host header, not by IP. Just port-forwarding Kourier and hitting &lt;code&gt;127.0.0.1&lt;/code&gt; does not work without the correct Host header.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix: Direct Predictor Pod Port-Forward
&lt;/h3&gt;

&lt;p&gt;Bypass Knative routing and Kourier entirely for local verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Step 1: Get the predictor pod name&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get pods &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-l&lt;/span&gt; serving.knative.dev/revision&lt;span class="o"&gt;=&lt;/span&gt;sklearn-iris-predictor-00001

&lt;span class="c"&gt;# Step 2: Port-forward directly to the predictor container&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default port-forward &lt;span class="se"&gt;\&lt;/span&gt;
  pod/sklearn-iris-predictor-00001-deployment-&amp;lt;&lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; 18080:8080

&lt;span class="c"&gt;# Step 3: Predict (no Host header, no Kourier, no DNS needed)&lt;/span&gt;
curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[5.1,3.5,1.4,0.2],[6.2,3.4,5.4,2.3]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://127.0.0.1:18080/v1/models/sklearn-iris:predict

&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"predictions"&lt;/span&gt;:[0,2]&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the full Kourier routing path, always pass the Host header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kourier-system port-forward svc/kourier 18080:80

curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Host: sklearn-iris-predictor.default.127.0.0.1.sslip.io'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[5.1,3.5,1.4,0.2]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://127.0.0.1:18080/v1/models/sklearn-iris:predict
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Business impact:&lt;/strong&gt; False-negative inference verification. A healthy endpoint looked broken because the test URL resolved to the wrong server. Always verify the complete network path — DNS resolution, ingress routing, pod health — as separate steps rather than assuming a single curl test is conclusive.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What This Proves After the Failures
&lt;/h2&gt;

&lt;p&gt;After working through the above failures, the inference baseline worked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get inferenceservice sklearn-iris
NAME           URL                                       READY   AGE
sklearn-iris   http://sklearn-iris.default.example.com   True    45m

&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"instances":[[5.1,3.5,1.4,0.2],[6.2,3.4,5.4,2.3]]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://127.0.0.1:18080/v1/models/sklearn-iris:predict

&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"predictions"&lt;/span&gt;:[0,2]&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Istio/Kourier mismatch is the canonical example of why "default configuration" is dangerous in complex systems. KServe's default assumes a specific network topology that is not disclosed in the getting-started docs. Recognizing this class of failure — configuration that works in the tool author's environment but not yours — is a senior platform engineering competency.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Setup Does NOT Solve (Known Tradeoffs)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No Istio service mesh:&lt;/strong&gt; No mTLS between services, no advanced traffic management. Acceptable for local dev; requires a replacement security layer in production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kube-rbac-proxy removed:&lt;/strong&gt; Prometheus metrics from the KServe controller are unavailable. Re-add this sidecar from a working registry before any production deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port-forward for inference:&lt;/strong&gt; The Host-header workaround is local only. Cloud deployment requires a real ingress with DNS and TLS. On EKS, swap Kourier for an ALB and set &lt;code&gt;ingressDomain&lt;/code&gt; to your real domain. See the &lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/CLOUD_PROMOTION_GUIDE.md" rel="noopener noreferrer"&gt;Cloud Promotion Guide&lt;/a&gt; in the repository.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Debugging Commands Reference
&lt;/h2&gt;

&lt;p&gt;Run these in order when an InferenceService will not become Ready.&lt;/p&gt;

&lt;h3&gt;
  
  
  1 — InferenceService Conditions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default describe inferenceservice sklearn-iris
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve logs deploy/kserve-controller-manager &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve logs deploy/kserve-controller-manager &lt;span class="nt"&gt;-c&lt;/span&gt; manager &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2 — Webhook Endpoint Availability
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get endpoints kserve-webhook-server-service
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve describe endpoints kserve-webhook-server-service
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get ksvc
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; default get route
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3 — ConfigMap and Pod Status
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get configmap inferenceservice-config &lt;span class="nt"&gt;-o&lt;/span&gt; yaml
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve get pods &lt;span class="nt"&gt;-o&lt;/span&gt; wide
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kserve describe pod &amp;lt;pod-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The One Thing to Remember
&lt;/h2&gt;

&lt;p&gt;KServe's default configuration assumes Istio is installed. This assumption is not prominently stated in the getting-started documentation. Every engineer running KServe on k3d, k3s, GKE Autopilot, or any non-Istio cluster will hit ReconcileError and see error messages referencing "virtual services" — an Istio concept — with no obvious resolution path.&lt;/p&gt;

&lt;p&gt;The fix is one ConfigMap patch. It takes 30 seconds to apply. Finding it took three hours.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;kube-rbac-proxy&lt;/code&gt; 403 from gcr.io is an external dependency failure that silently kills your admission webhook cluster-wide. The &lt;code&gt;$patch: delete&lt;/code&gt; Kustomize strategy is the fastest recovery path when no alternative registry image is available.&lt;/p&gt;

&lt;p&gt;Full platform source — all six Reality Check documents, Backstage Golden Path, Kyverno policy enforcement, cost attribution, and a Cloud Promotion Guide to EKS/GKE: &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;Check out the full NeuroScale repo here.&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  See Also
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/serving-stack/patches/inferenceservice-config-ingress.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/serving-stack/patches/inferenceservice-config-ingress.yaml&lt;/code&gt;&lt;/a&gt; — Kourier config patch&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/serving-stack/patches/kserve-controller-kube-rbac-proxy-image.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/serving-stack/patches/kserve-controller-kube-rbac-proxy-image.yaml&lt;/code&gt;&lt;/a&gt; — sidecar removal patch&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/infrastructure/kserve/sklearn-runtime.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;infrastructure/kserve/sklearn-runtime.yaml&lt;/code&gt;&lt;/a&gt; — ClusterServingRuntime definition&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/CLOUD_PROMOTION_GUIDE.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/CLOUD_PROMOTION_GUIDE.md&lt;/code&gt;&lt;/a&gt; — how to replace Kourier with ALB/NGINX on EKS/GKE&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_3_GOLDEN_PATH.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_3_GOLDEN_PATH.md&lt;/code&gt;&lt;/a&gt; — nine Backstage failures documented at the same depth&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md&lt;/code&gt;&lt;/a&gt; — how kyverno-cli exits 0 on violations and why &lt;code&gt;$PIPESTATUS[0]&lt;/code&gt; matters&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Jimoh Sodiq Bolaji&lt;/strong&gt; | Platform Engineer | Technical Content Engineer | Abuja, Nigeria | &lt;a href="https://github.com/sodiq-code/neuroscale-platform" rel="noopener noreferrer"&gt;NeuroScale Platform&lt;/a&gt;&lt;/p&gt;




</description>
      <category>kubernetes</category>
      <category>kserve</category>
      <category>mlops</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
