<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: DevOps Start</title>
    <description>The latest articles on Forem by DevOps Start (@devopsstart).</description>
    <link>https://forem.com/devopsstart</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862044%2F9672d1b5-f8fd-4473-998f-30a47c07608f.png</url>
      <title>Forem: DevOps Start</title>
      <link>https://forem.com/devopsstart</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/devopsstart"/>
    <language>en</language>
    <item>
      <title>How to Build a Developer Control Plane with Backstage</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Tue, 05 May 2026 14:25:21 +0000</pubDate>
      <link>https://forem.com/devopsstart/how-to-build-a-developer-control-plane-with-backstage-1g96</link>
      <guid>https://forem.com/devopsstart/how-to-build-a-developer-control-plane-with-backstage-1g96</guid>
      <description>&lt;p&gt;&lt;em&gt;Looking to reduce cognitive load for your engineering teams? This tutorial, originally published on devopsstart.com, walks you through building a developer control plane using Backstage.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;An Internal Developer Platform (IDP) is a centralized control plane that gives your development teams a paved road for building, deploying and managing software. Instead of managing dozens of different tools and CLIs, developers get a single, curated interface for everything from creating a new microservice to checking its CI/CD status or viewing its documentation. This tutorial shows you how to build a foundational IDP using Backstage.io, the open-source framework for building developer portals created by Spotify and now a CNCF graduated project.&lt;/p&gt;

&lt;p&gt;You will learn to set up a Backstage application, populate its software catalog, integrate GitHub Actions to view pipeline runs and create a software template that lets developers scaffold new services in minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an Internal Developer Platform?
&lt;/h2&gt;

&lt;p&gt;An Internal Developer Platform (IDP) is a layer built on top of your existing DevOps toolchain that exposes your infrastructure and tooling through a simplified, self-service interface. It codifies best practices and organizational standards into "golden paths", enabling developers to create and manage applications without needing deep expertise in Kubernetes, Terraform or complex CI/CD configurations.&lt;/p&gt;

&lt;p&gt;Backstage is the leading open-source project for building IDPs. It provides a pluggable frontend and backend that act as a central hub. It's not a replacement for tools like Jenkins, Argo CD or Grafana. Instead, it integrates with them, presenting their information and actions within a unified system. This approach turns a complex, distributed toolchain into a cohesive and discoverable platform. An IDP reduces cognitive load on developers by abstracting away the underlying complexity of cloud-native infrastructure, letting them focus on writing code instead of fighting with tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;To follow this tutorial, you need a few tools installed on your local machine.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Node.js:&lt;/strong&gt; Backstage is a TypeScript/JavaScript application. You need Node.js &lt;code&gt;v18.x&lt;/code&gt; or &lt;code&gt;v20.x&lt;/code&gt;. This guide uses &lt;code&gt;v20.11.1&lt;/code&gt;. You can use a tool like &lt;code&gt;nvm&lt;/code&gt; to manage Node versions.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Yarn:&lt;/strong&gt; Backstage uses Yarn &lt;code&gt;v1&lt;/code&gt; for package management. After installing Node.js, you can install it globally:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; yarn
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
shell
* **Docker:** The Backstage backend runs in a Docker container during local development to connect to a PostgreSQL database. Ensure Docker Desktop or an equivalent is installed and running.
* **`npx`:** This command-line tool is included with `npm` (which comes with Node.js) and is used to run the Backstage app creation script without a global installation.
* **A GitHub Account and Personal Access Token (PAT):** Backstage integrates with GitHub to discover components for the software catalog and display CI/CD information. You need a GitHub account and a PAT with the `repo` scope to allow Backstage to read repository information and workflow runs. You can create a token in your GitHub settings under `Developer settings &amp;gt; Personal access tokens &amp;gt; Tokens (classic)`.

## Step 1: Scaffold a New Backstage App

The fastest way to get started is with the Backstage CLI's `create-app` command. This script scaffolds a complete monorepo with a frontend, a backend and all the necessary configuration to run locally.

First, run the interactive creator using `npx`:



```bash
npx @backstage/create-app@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script will prompt you for an application name. Let's call it &lt;code&gt;dev-control-plane&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;? Enter a name for the app [required] dev-control-plane
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This process takes 5-10 minutes depending on your network speed, as it clones the template, installs all npm dependencies and sets up the basic structure.&lt;/p&gt;

&lt;p&gt;Once it's finished, navigate into the new directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;dev-control-plane
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The directory structure looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── app-config.yaml         # Main configuration file for your app
├── catalog-info.yaml       # Registers this app in its own catalog
├── lerna.json
├── package.json            # Root package.json for the monorepo
├── packages/
│   ├── app/                # The frontend application (React)
│   └── backend/            # The backend application (Node.js/Express)
└── yarn.lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, start the application. The backend and frontend run as separate processes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;yarn dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command starts the backend on port &lt;code&gt;7007&lt;/code&gt; and the frontend on port &lt;code&gt;3000&lt;/code&gt;. After a minute or two of compilation, your web browser should automatically open to &lt;code&gt;http://localhost:3000&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You now have a running, albeit empty, Backstage application. The initial view shows an example catalog with a few components. The next step is to clear these examples and populate the catalog with your own services.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Configure the Software Catalog
&lt;/h2&gt;

&lt;p&gt;The Software Catalog is the heart of Backstage. It's a centralized system for tracking ownership and metadata for all your software, including microservices, libraries, websites and machine learning models. Backstage discovers these components by ingesting &lt;code&gt;catalog-info.yaml&lt;/code&gt; files from your Git repositories.&lt;/p&gt;

&lt;p&gt;For this example, you will need a sample GitHub repository containing a &lt;code&gt;catalog-info.yaml&lt;/code&gt; file. You can create a new public repository named &lt;code&gt;sample-service&lt;/code&gt; or use one of your existing projects. Throughout this guide, replace &lt;code&gt;your-org&lt;/code&gt; with your actual GitHub username or organization name.&lt;/p&gt;

&lt;p&gt;Create a &lt;code&gt;catalog-info.yaml&lt;/code&gt; file in the root of that repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In your-org/sample-service/catalog-info.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backstage.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Component&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sample-service&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;A sample service for the Backstage catalog.&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;github.com/project-slug&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;your-org/sample-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service&lt;/span&gt;
  &lt;span class="na"&gt;lifecycle&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;experimental&lt;/span&gt;
  &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user:guest&lt;/span&gt;
  &lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;examples&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This file contains several key fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;apiVersion&lt;/code&gt; and &lt;code&gt;kind&lt;/code&gt;:&lt;/strong&gt; Define the entity type. &lt;code&gt;Component&lt;/code&gt; is the most common kind, representing a piece of software.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;metadata.name&lt;/code&gt;:&lt;/strong&gt; A unique identifier for the component within Backstage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;metadata.annotations&lt;/code&gt;:&lt;/strong&gt; Provides external identifiers. The &lt;code&gt;github.com/project-slug&lt;/code&gt; annotation is crucial for plugins like GitHub Actions to find the correct repository.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;spec.type&lt;/code&gt;:&lt;/strong&gt; The type of component, for example, &lt;code&gt;service&lt;/code&gt;, &lt;code&gt;website&lt;/code&gt;, or &lt;code&gt;library&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;spec.lifecycle&lt;/code&gt;:&lt;/strong&gt; The current maturity stage, such as &lt;code&gt;experimental&lt;/code&gt;, &lt;code&gt;production&lt;/code&gt;, or &lt;code&gt;deprecated&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;spec.owner&lt;/code&gt;:&lt;/strong&gt; Specifies who owns this component. This is often a team or user group. For now, we'll use the default &lt;code&gt;guest&lt;/code&gt; user.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, tell your Backstage application to find this file. Open &lt;code&gt;app-config.yaml&lt;/code&gt; in the root of your &lt;code&gt;dev-control-plane&lt;/code&gt; project and find the &lt;code&gt;catalog.locations&lt;/code&gt; section. Replace the example rules with a single entry pointing to your repository.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# in app-config.yaml&lt;/span&gt;

&lt;span class="na"&gt;catalog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;import&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;entityFilename&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;catalog-info.yaml&lt;/span&gt;
    &lt;span class="na"&gt;pullRequestBranchName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backstage-integration&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;allow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Component&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;API&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Resource&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Group&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;User&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;System&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Domain&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Template&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Location&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;locations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Remove the example locations and add this one:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/your-org/sample-service/blob/main/catalog-info.yaml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart your &lt;code&gt;yarn dev&lt;/code&gt; process for the changes to take effect. When Backstage starts up, it will fetch this YAML file, process it and add the &lt;code&gt;sample-service&lt;/code&gt; component to the catalog. You can now see it on the main page. This declarative, "as-code" approach to catalog management is powerful because the catalog stays in sync with your source code, and ownership information is version-controlled right alongside the service itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Integrate a CI/CD Plugin (GitHub Actions)
&lt;/h2&gt;

&lt;p&gt;Seeing a list of services is useful, but the real power of an IDP comes from integrating operational data. Let's add the GitHub Actions plugin to display CI/CD status directly on the component page in Backstage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Add GitHub Integration Configuration
&lt;/h3&gt;

&lt;p&gt;First, configure Backstage to authenticate with the GitHub API. This requires the Personal Access Token (PAT) you created earlier.&lt;/p&gt;

&lt;p&gt;Open &lt;code&gt;app-config.yaml&lt;/code&gt; and add the following &lt;code&gt;integrations&lt;/code&gt; section. You may also need to add the top-level &lt;code&gt;github&lt;/code&gt; key if it's not present.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# in app-config.yaml&lt;/span&gt;

&lt;span class="na"&gt;integrations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.com&lt;/span&gt;
      &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_TOKEN}&lt;/span&gt;

&lt;span class="c1"&gt;# This key may already exist. If so, just ensure the token is set.&lt;/span&gt;
&lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GITHUB_TOKEN}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We're using an environment variable &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; to avoid committing secrets to version control. When you run &lt;code&gt;yarn dev&lt;/code&gt;, you'll need to export this variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_classic_github_pat_here"&lt;/span&gt;
yarn dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Production Gotcha:&lt;/strong&gt; For a production deployment, you would use a secret management system like HashiCorp Vault or AWS Secrets Manager to inject this token, not an environment variable on your local machine. Proper secret handling is critical. Tools like the &lt;a href="https://github.com/features/security" rel="noopener noreferrer"&gt;GitHub Actions Security scanner&lt;/a&gt; can help you detect accidentally committed secrets. For a deep dive, check out our guide on how to &lt;a href="https://dev.to/blog/github-actions-security-how-to-stop-secret-leaks-in-cicd"&gt;stop secret leaks in CI/CD&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install and Configure the Plugin
&lt;/h3&gt;

&lt;p&gt;Next, install the GitHub Actions plugin package in your frontend app.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;packages/app
yarn add @backstage/plugin-github-actions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, you need to add the plugin's UI component to the entity page, which displays detailed information about a single component. Open the file &lt;code&gt;packages/app/src/components/catalog/EntityPage.tsx&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Import the plugin components, then modify the &lt;code&gt;cicdContent&lt;/code&gt; constant to conditionally render the GitHub Actions view.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="c1"&gt;// in packages/app/src/components/catalog/EntityPage.tsx&lt;/span&gt;

&lt;span class="c1"&gt;// ... other imports&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;EntityGithubActionsContent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;isGithubActionsAvailable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@backstage/plugin-github-actions&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Card&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;CardContent&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@material-ui/core&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Ensure Grid is imported&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;EntitySwitch&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@backstage/plugin-catalog&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Ensure EntitySwitch is imported&lt;/span&gt;

&lt;span class="c1"&gt;// ...&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cicdContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Grid&lt;/span&gt; &lt;span class="na"&gt;container&lt;/span&gt; &lt;span class="na"&gt;spacing&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; &lt;span class="na"&gt;alignItems&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"stretch"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;EntitySwitch&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;EntitySwitch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Case&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;isGithubActionsAvailable&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Grid&lt;/span&gt; &lt;span class="na"&gt;item&lt;/span&gt; &lt;span class="na"&gt;sm&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;EntityGithubActionsContent&lt;/span&gt; &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;Grid&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;EntitySwitch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Case&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;

      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;EntitySwitch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Grid&lt;/span&gt; &lt;span class="na"&gt;item&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Card&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;CardContent&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
              No CI/CD provider available for this entity.
            &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;CardContent&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;Card&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;Grid&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;EntitySwitch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;EntitySwitch&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;Grid&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code uses &lt;code&gt;EntitySwitch&lt;/code&gt; to conditionally render the GitHub Actions content only if the component has the necessary &lt;code&gt;github.com/project-slug&lt;/code&gt; annotation.&lt;/p&gt;

&lt;p&gt;After saving the file, the dev server should automatically reload. Navigate to your &lt;code&gt;sample-service&lt;/code&gt; component in the catalog. You should now see a "CI/CD" tab, and inside it, a view of the recent GitHub Actions workflow runs for that repository. A developer can now see if their last commit passed its tests without leaving Backstage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Create a Software Template with the Scaffolder
&lt;/h2&gt;

&lt;p&gt;One of the most powerful features of Backstage is the Software Scaffolder. It allows you to create templates for new projects, enforcing best practices and setting up everything a developer needs automatically.&lt;/p&gt;

&lt;p&gt;Let's create a template that scaffolds a new Node.js "hello world" service, complete with a &lt;code&gt;Dockerfile&lt;/code&gt;, a &lt;code&gt;catalog-info.yaml&lt;/code&gt; file and registration in a new GitHub repository.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create the Template Definition
&lt;/h3&gt;

&lt;p&gt;First, create a new directory for your templates at the root of your &lt;code&gt;dev-control-plane&lt;/code&gt; project.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; templates/nodejs-service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside this directory, create a &lt;code&gt;template.yaml&lt;/code&gt; file. This file defines the template's metadata and the input parameters it requires from the user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# in templates/nodejs-service/template.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scaffolder.backstage.io/v1beta3&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Template&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nodejs-service-template&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Node.js Service&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Creates a simple Node.js service with Docker.&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user:guest&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service&lt;/span&gt;

  &lt;span class="c1"&gt;# These parameters are used to gather user-provided information.&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Component Details&lt;/span&gt;
      &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;component_id&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;owner&lt;/span&gt;
      &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;component_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Name&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Unique name of the component&lt;/span&gt;
          &lt;span class="na"&gt;ui:field&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EntityNamePicker&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Description&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;A description for this component&lt;/span&gt;
        &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Owner&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Owner of the component&lt;/span&gt;
          &lt;span class="na"&gt;ui:field&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OwnerPicker&lt;/span&gt;
          &lt;span class="na"&gt;ui:options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;allowedKinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Group&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Repository Location&lt;/span&gt;
      &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;repoUrl&lt;/span&gt;
      &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;repoUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Repository Location&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;ui:field&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RepoUrlPicker&lt;/span&gt;
          &lt;span class="na"&gt;ui:options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;allowedHosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;github.com&lt;/span&gt;

  &lt;span class="c1"&gt;# These steps are executed in order.&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetch-base&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Fetch Base&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetch:template&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./content&lt;/span&gt;
        &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;component_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.component_id }}&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.description }}&lt;/span&gt;
          &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.owner }}&lt;/span&gt;
          &lt;span class="na"&gt;repoUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.repoUrl }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;publish&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Publish&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;publish:github&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;allowedHosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;github.com'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;This is ${{ parameters.description }}&lt;/span&gt;
        &lt;span class="na"&gt;repoUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.repoUrl }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;register&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Register&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;catalog:register&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;repoContentsUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.publish.output.repoContentsUrl }}&lt;/span&gt;
        &lt;span class="na"&gt;catalogInfoPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/catalog-info.yaml'&lt;/span&gt;

  &lt;span class="c1"&gt;# The output of a successful template run.&lt;/span&gt;
  &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;links&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Repository&lt;/span&gt;
        &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.publish.output.remoteUrl }}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Open in catalog&lt;/span&gt;
        &lt;span class="na"&gt;icon&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;catalog&lt;/span&gt;
        &lt;span class="na"&gt;entityRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.register.output.entityRef }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Create the Template Content
&lt;/h3&gt;

&lt;p&gt;Next, create a &lt;code&gt;content&lt;/code&gt; subdirectory within &lt;code&gt;templates/nodejs-service&lt;/code&gt;. This will hold the skeleton files for our new service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;templates/nodejs-service/content
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside &lt;code&gt;templates/nodejs-service/content&lt;/code&gt;, create the following files. These are Handlebars templates, where &lt;code&gt;{{ ... }}&lt;/code&gt; expressions will be replaced by user-provided values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;catalog-info.yaml.hbs&lt;/code&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# templates/nodejs-service/content/catalog-info.yaml.hbs&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backstage.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Component&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ values.component_id | dump }}&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ values.description | dump }}&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;github.com/project-slug&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ values.repoUrl | parseRepoUrl | pick('owner') }}/${{ values.repoUrl | parseRepoUrl | pick('repo') }}&lt;/span&gt;
    &lt;span class="na"&gt;backstage.io/techdocs-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dir:.&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service&lt;/span&gt;
  &lt;span class="na"&gt;lifecycle&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;experimental&lt;/span&gt;
  &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ values.owner | dump }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;index.js.hbs&lt;/code&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// templates/nodejs-service/content/index.js.hbs&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;express&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;express&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;express&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello from ${{ values.component_id }}!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;port&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Example app listening at http://localhost:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;port&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;package.json.hbs&lt;/code&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;templates/nodejs-service/content/package.json.hbs&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${{ values.component_id }}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"main"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"index.js"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"express"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^4.18.2"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;Dockerfile.hbs&lt;/code&gt;:&lt;/strong&gt;&lt;br&gt;
For an in-depth guide on creating efficient Dockerfiles, see our article on &lt;a href="https://dev.to/blog/docker-multi-stage-builds-smaller-secure-production-images"&gt;Docker multi-stage builds&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# templates/nodejs-service/content/Dockerfile.hbs&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; node:20-slim&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /usr/src/app&lt;/span&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; package*.json ./&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--omit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;dev

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;

&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8080&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; [ "node", "index.js" ]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Register the Template
&lt;/h3&gt;

&lt;p&gt;Finally, add the template to your &lt;code&gt;app-config.yaml&lt;/code&gt; so Backstage can find it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# in app-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;catalog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;locations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ... your other locations&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;file&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;../../templates/nodejs-service/template.yaml&lt;/span&gt;
      &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;allow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Template&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart your &lt;code&gt;yarn dev&lt;/code&gt; server. Go to &lt;code&gt;http://localhost:3000/create&lt;/code&gt;. You should now see your "Node.js Service" template. Clicking "Choose" will take you to a form where you can enter the component name, owner and desired GitHub repository location.&lt;/p&gt;

&lt;p&gt;When you click "Create", the Scaffolder will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Render the template files with your inputs.&lt;/li&gt;
&lt;li&gt;Create a new repository in your GitHub account.&lt;/li&gt;
&lt;li&gt;Push the rendered files to the new repository.&lt;/li&gt;
&lt;li&gt;Register the new component in the Backstage catalog.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You've now automated the creation of new services, ensuring they all start from a standardized, compliant baseline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Integrate Documentation with TechDocs
&lt;/h2&gt;

&lt;p&gt;The final piece of our control plane is centralized documentation. Backstage's TechDocs feature renders Markdown documentation stored alongside your code directly within the Backstage UI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configure TechDocs
&lt;/h3&gt;

&lt;p&gt;TechDocs requires a backend plugin and a location to store the generated documentation site. For local development, it can use a local generator and storage directory. This configuration is usually present by default in new Backstage applications.&lt;/p&gt;

&lt;p&gt;Open &lt;code&gt;app-config.yaml&lt;/code&gt; and ensure the &lt;code&gt;techdocs&lt;/code&gt; section is configured for local development.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# in app-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;techdocs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;local'&lt;/span&gt; &lt;span class="c1"&gt;# Can be 'local' or 'external'&lt;/span&gt;
  &lt;span class="na"&gt;generator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runIn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;docker'&lt;/span&gt; &lt;span class="c1"&gt;# 'docker' or 'local'&lt;/span&gt;
  &lt;span class="na"&gt;publisher&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;local'&lt;/span&gt; &lt;span class="c1"&gt;# 'local' or 'googleGcs' or 'awsS3' or 'azureBlobStorage'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Add Documentation to a Service
&lt;/h3&gt;

&lt;p&gt;Let's add documentation to the &lt;code&gt;sample-service&lt;/code&gt; we created earlier. In that service's repository, do the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add the TechDocs annotation&lt;/strong&gt; to &lt;code&gt;catalog-info.yaml&lt;/code&gt;. This tells Backstage where to find the documentation source. The &lt;code&gt;dir:.&lt;/code&gt; value means "look in the current directory".&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# in your-org/sample-service/catalog-info.yaml&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# ... other metadata&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ... other annotations&lt;/span&gt;
    &lt;span class="na"&gt;backstage.io/techdocs-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dir:.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
yaml

2. **Create an `mkdocs.yml` file** in the root of the repository. This is the configuration file for MkDocs, the static site generator TechDocs uses.



    ```yaml
    # in your-org/sample-service/mkdocs.yml
    site_name: 'Sample Service Documentation'
    nav:
      - Home: index.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create a &lt;code&gt;/docs&lt;/code&gt; directory&lt;/strong&gt; and add an &lt;code&gt;index.md&lt;/code&gt; file inside it.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;docs
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"# Sample Service&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;This is the main documentation page."&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; docs/index.md
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
yaml

Commit and push these changes to your repository.

Navigate to your `sample-service` component in Backstage and click the "Docs" tab. The first time, you may need to wait a few minutes for Backstage to generate the documentation site. Once it's ready, you'll see your rendered Markdown file. You now have a single place where any developer can find up-to-date, version-controlled documentation for any service.

## Troubleshooting Common Issues

When setting up Backstage for the first time, you might run into a few common problems.

### CORS Errors

**Symptom:** The Backstage frontend fails to load data from the backend, and you see Cross-Origin Resource Sharing (CORS) errors in your browser's developer console.
**Fix:** Ensure your `app-config.yaml` has the correct `backend.cors.origin` setting for local development:



```yaml
# in app-config.yaml
backend:
  # ...
  cors:
    origin: http://localhost:3000
    methods: [GET, POST, PUT, DELETE, PATCH, OPTIONS]
    credentials: true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  GitHub Auth Fails
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; The GitHub Actions plugin shows an error, or the Scaffolder fails at the "publish" step with an authentication error.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verify the Token:&lt;/strong&gt; Double-check that you've exported the &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; environment variable in the same terminal session where you run &lt;code&gt;yarn dev&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check Scopes:&lt;/strong&gt; Ensure your GitHub Personal Access Token (classic) has the &lt;code&gt;repo&lt;/code&gt; scope. For creating new repositories via the Scaffolder, it may also need the &lt;code&gt;workflow&lt;/code&gt; scope.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check Organization Settings:&lt;/strong&gt; If publishing to a GitHub organization, it may have settings that restrict PAT access or require third-party application approval.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Catalog Import Fails
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; A component you added to &lt;code&gt;catalog.locations&lt;/code&gt; doesn't appear in the UI.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Check the URL:&lt;/strong&gt; Make sure the &lt;code&gt;target&lt;/code&gt; URL in &lt;code&gt;app-config.yaml&lt;/code&gt; points directly to the raw &lt;code&gt;catalog-info.yaml&lt;/code&gt; file on your Git provider.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate YAML:&lt;/strong&gt; Use a YAML linter to check for syntax errors in your &lt;code&gt;catalog-info.yaml&lt;/code&gt;. Indentation errors are common.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check Backend Logs:&lt;/strong&gt; The Backstage backend logs (from the &lt;code&gt;yarn dev&lt;/code&gt; command) will often show detailed error messages about why a location failed to be ingested. Look for lines containing &lt;code&gt;Catalog-Processor&lt;/code&gt; or &lt;code&gt;error&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You have now built the foundation of a powerful Internal Developer Platform. You've created a central place for service discovery, integrated real-time operational data, automated new service creation and centralized documentation. This is the core of a "developer control plane" that can significantly improve your team's productivity and standardize your engineering practices. From here, you can explore hundreds of other plugins for tools like &lt;a href="https://dev.to/tutorials/how-to-set-up-argo-cd-gitops-for-kubernetes-automation"&gt;Argo CD&lt;/a&gt;, Kubernetes and Grafana to build out a truly comprehensive platform.&lt;/p&gt;

</description>
      <category>backstagetutorial</category>
      <category>internaldeveloperplatform</category>
      <category>backstagescaffolder</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>Supply Chain Security Proxy: Move Beyond Vulnerability Scanning</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Tue, 28 Apr 2026 14:15:14 +0000</pubDate>
      <link>https://forem.com/devopsstart/supply-chain-security-proxy-move-beyond-vulnerability-scanning-2oid</link>
      <guid>https://forem.com/devopsstart/supply-chain-security-proxy-move-beyond-vulnerability-scanning-2oid</guid>
      <description>&lt;p&gt;&lt;em&gt;This article was originally published on devopsstart.com. Learn why relying solely on CVE scanning is a reactive strategy and how to implement a security proxy to proactively secure your software supply chain.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Vulnerability scanning is a reactive failure state, not a security strategy.&lt;/p&gt;

&lt;p&gt;Most organizations treat Software Composition Analysis (SCA) as their primary defense against supply chain attacks. They plug in a scanner, wait for it to find a known CVE, and then assign a Jira ticket to a developer to update a library. This approach assumes that the vulnerability is already known and indexed in a database. It ignores the window of time between a malicious package upload and its discovery, and it does nothing to prevent zero-day supply chain attacks like dependency confusion or typosquatting.&lt;/p&gt;

&lt;p&gt;If you rely solely on scanners, you are documenting how you were breached rather than preventing the attack. To secure the perimeter, you must implement a supply chain security proxy that controls the ingress of every byte of third party code before it touches your build server.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Detection Gap
&lt;/h2&gt;

&lt;p&gt;Reliance on scanning creates a dangerous detection gap. When a malicious actor uploads a package to npm or PyPI that mimics a popular library (typosquatting), there is often a several hour or even several day lag before scanners flag that specific version. In a modern CI/CD pipeline, that package is pulled, built, and deployed to production in minutes. Your secrets are exfiltrated before the scanner alerts you.&lt;/p&gt;

&lt;p&gt;Consider dependency confusion. An attacker discovers the name of an internal corporate package, such as &lt;code&gt;corp-auth-lib&lt;/code&gt;. They upload a malicious package with the same name but a higher version number to the public npm registry. Without a security proxy, the build agent sees the higher version on the public registry and pulls it instead of the internal one. A scanner won't stop this because the package isn't vulnerable in the CVE sense; it is performing exactly as the attacker intended.&lt;/p&gt;

&lt;p&gt;I have seen this play out in environments with over 500 microservices where the scan and fix treadmill became a full time job for three engineers. They spent 40 hours a week chasing low severity CVEs while the actual architectural hole (direct internet access for build agents) remained open. By shifting focus from detecting a fire to controlling who enters the building, you eliminate entire classes of attacks. A security proxy acts as a mandatory checkpoint. If a package isn't on the allow list or fails a provenance check, it never enters the environment. This is the difference between a smoke detector and a locked door.&lt;/p&gt;

&lt;p&gt;For those managing complex pipelines, this shift is similar to how you might &lt;a href="https://dev.to/blog/secure-terraform-prs-with-an-architecture-firewall"&gt;secure Terraform PRs with an architecture firewall&lt;/a&gt; to prevent configuration drift. Instead of checking if the infrastructure is broken after the apply, you validate the intent before execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Balancing Velocity and Governance
&lt;/h2&gt;

&lt;p&gt;The most common pushback from developers is that a security proxy kills velocity. The Request a Package workflow is often viewed as a bureaucratic nightmare. Developers argue that forcing every dependency through a manual approval process slows down feature delivery, especially during the inner loop of development where &lt;code&gt;npm install&lt;/code&gt; is critical for prototyping.&lt;/p&gt;

&lt;p&gt;This argument is partially correct. If you implement a security proxy as a manual ticket system where a security officer must click Approve on every version bump, you create a bottleneck that developers will eventually bypass. They will use personal hotspots or tunnel out of the build environment just to get work done. The friction of a poorly implemented proxy is a security risk because it encourages shadow IT.&lt;/p&gt;

&lt;p&gt;The solution is to automate the governance. A modern security proxy should be a policy engine, not a manual gate. For example, you can set a policy that allows any package that has been public for more than 30 days, has more than 1,000 downloads, and is signed by a trusted vendor. This allows 95% of requests to pass through automatically while flagging high risk, brand new packages for a quick human review. The goal is to move from Allow All to Automated Governance.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Scanning Still Wins
&lt;/h2&gt;

&lt;p&gt;There are specific contexts where a security proxy is overkill. For very small teams (under 10 engineers) or early stage startups building a Proof of Concept (PoC), the operational overhead of maintaining a private registry like Artifactory or Nexus v3.x can outweigh the risk. At this scale, the attack surface is small and the priority is finding product market fit, not building a SLSA Level 4 compliant supply chain.&lt;/p&gt;

&lt;p&gt;Scanning also remains superior for identifying vulnerabilities in code you have already mirrored. A proxy prevents the ingress of bad code, but it cannot predict when a previously safe library is suddenly found to have a critical flaw. When Log4Shell hit, the problem wasn't that the library was newly introduced, it was that an existing, trusted library had a critical flaw. In that case, a proxy provides no protection for existing deployments. You still need a robust SCA tool to scan your current Bill of Materials (SBOM) and identify where the vulnerable version is running.&lt;/p&gt;

&lt;p&gt;For teams using fully managed serverless build environments where they have zero control over the network layer, a proxy is technically impossible to implement. These teams must rely on shift left scanning and strict dependency pinning in their lockfiles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing the Proxy Architecture
&lt;/h2&gt;

&lt;p&gt;To move beyond scanning, you need a centralized gateway that acts as a policy enforcement point.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architectural Pattern
&lt;/h3&gt;

&lt;p&gt;A supply chain security proxy sits between your build agents (GitHub Actions, GitLab Runners, Jenkins) and the public registries (Docker Hub, npm, PyPI). Instead of the build agent calling &lt;code&gt;docker pull&lt;/code&gt;, it calls &lt;code&gt;docker pull proxy.corp.com/image&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The proxy performs the following checks in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Identity&lt;/strong&gt;: Is the request coming from an authenticated build agent?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Allow-list&lt;/strong&gt;: Is this package/version approved for use in this project?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrity&lt;/strong&gt;: Does the checksum match the known good version?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provenance&lt;/strong&gt;: Is there a signed attestation proving this was built in a trusted environment?&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Hardening the Image Pipeline
&lt;/h3&gt;

&lt;p&gt;For container images, the proxy should integrate with Sigstore/Cosign. You don't trust the tag &lt;code&gt;latest&lt;/code&gt; or even a specific version; you verify the signature.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verifying an image signature using Cosign v2.2.4&lt;/span&gt;
cosign verify &lt;span class="nt"&gt;--key&lt;/span&gt; cosign.pub ghcr.io/my-org/my-app:v1.2.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If verification fails, the proxy blocks the pull. To take this further, enforce &lt;a href="https://slsa.dev/" rel="noopener noreferrer"&gt;SLSA framework&lt;/a&gt; requirements. A SLSA attestation is a signed piece of metadata that tells you how the artifact was built. If the attestation shows the image was built on a developer's laptop rather than a hardened CI runner, the proxy rejects it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stopping Dependency Confusion
&lt;/h3&gt;

&lt;p&gt;To kill dependency confusion, configure your proxy to use Virtual Repositories with strict resolution orders. In a tool like JFrog Artifactory v7.x, you create a virtual repository that aggregates a local (private) repo and a remote (public) repo.&lt;/p&gt;

&lt;p&gt;Configure the resolution order so that the local repository is searched first. More importantly, implement Exclusion Patterns. If a package starts with &lt;code&gt;corp-&lt;/code&gt;, the proxy must be configured to never check the public registry for that pattern.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Conceptual Proxy Policy for Dependency Resolution&lt;/span&gt;
&lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;corp-*"&lt;/span&gt;
    &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_EXTERNAL"&lt;/span&gt;
    &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Internal&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;packages&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;must&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;never&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;be&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;resolved&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;public&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;registries"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
    &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALLOW_EXTERNAL"&lt;/span&gt;
    &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;30d&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;downloads&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1000"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Chain of Trust: Proxy to Admission Controller
&lt;/h2&gt;

&lt;p&gt;The proxy is only the first half of the battle. The second half is ensuring that the Proxy-Approved status follows the artifact to the cluster. This is where the proxy integrates with a Kubernetes Admission Controller like Kyverno or OPA Gatekeeper.&lt;/p&gt;

&lt;p&gt;The workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ingress&lt;/strong&gt;: Proxy pulls &lt;code&gt;node:18-alpine&lt;/code&gt;, verifies the signature, and caches it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attestation&lt;/strong&gt;: The proxy (or a separate CI step) signs the image with a Security-Approved key.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment&lt;/strong&gt;: A developer tries to deploy the image to GKE or EKS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforcement&lt;/strong&gt;: The Admission Controller intercepts the request and checks for the Security-Approved signature.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a developer tries to bypass the proxy by pointing their deployment to a public image on Docker Hub, the Admission Controller blocks the pod.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example Kyverno Policy to enforce proxy-signed images&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;check-proxy-signature&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validationFailureAction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enforce&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;verify-image-signature&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
      &lt;span class="na"&gt;verifyImages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;imageReferences&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proxy.corp.com/*"&lt;/span&gt;
          &lt;span class="na"&gt;attestors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;entries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;publicKeys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ssh-rsa&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;AAAAB3Nza...&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;[your-proxy-public-key]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a complete chain of trust. The proxy ensures only vetted code enters the building, and the Admission Controller ensures only vetted code runs. If you see pods failing in your cluster, use guides on &lt;a href="https://dev.to/troubleshooting/kubernetes-troubleshooting-why-did-my-pod-die"&gt;Kubernetes troubleshooting&lt;/a&gt; to determine if it was a signature mismatch or a network failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Quarantine Zone
&lt;/h2&gt;

&lt;p&gt;Moving to a proxy requires a shift in how developers interact with dependencies. The most successful implementations use a Quarantine Zone. When a developer requests a new library, the proxy pulls it into a restricted, isolated mirror. It is then automatically scanned for malware and analyzed for suspicious signals, such as a package created 2 hours ago that tries to access &lt;code&gt;/etc/shadow&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If the package passes the automated gauntlet, it is promoted to the Approved repository. This allows developers to get tools quickly while keeping the production build environment sterile.&lt;/p&gt;

&lt;p&gt;Implement Dependency Pinning as a hard requirement. Using ranges like &lt;code&gt;^1.2.0&lt;/code&gt; in your &lt;code&gt;package.json&lt;/code&gt; or &lt;code&gt;requirements.txt&lt;/code&gt; is an invitation for disaster. The proxy should be configured to alert or block builds that do not use strict version pinning (for example, &lt;code&gt;1.2.3&lt;/code&gt;). This prevents stealthy updates where a vendor pushes a malicious version that fits within your range, bypassing your initial vetting.&lt;/p&gt;

&lt;p&gt;To maintain this at scale, integrate the request process into an &lt;a href="https://dev.to/blog/build-an-internal-developer-platform-with-backstage-and"&gt;Internal Developer Platform (IDP) built with Backstage&lt;/a&gt;, allowing developers to Request a Package via a UI form that triggers the automated quarantine pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operationalizing the Proxy
&lt;/h2&gt;

&lt;p&gt;Do not flip the switch for the entire company at once. You will break every build in the organization. Instead, follow this three step rollout:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Transparent Mode&lt;/strong&gt;: Deploy the proxy and configure build agents to use it, but set all policies to Log Only. This provides a baseline of every dependency currently used across the org. You will likely find thousands of dependencies you didn't know existed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching Mode&lt;/strong&gt;: Enable mirroring and caching. Ensure that if the public registry goes down, your builds still work. This provides immediate value to developers through faster builds and makes them allies in the security mission.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforcement Mode&lt;/strong&gt;: Start blocking the most dangerous patterns first (for example, dependency confusion patterns) before moving to strict signature verification and SLSA attestations.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The operational cost of maintaining this infrastructure is non-trivial. You need high availability for your registry, as it is now a single point of failure for all deployments. Use a distributed storage backend and ensure your proxy is scaled horizontally across multiple availability zones.&lt;/p&gt;

&lt;p&gt;Scanning is a useful tool for auditing, but it is a weak defense mechanism. By implementing a supply chain security proxy, you stop reacting to CVEs and start controlling your perimeter. You move the security boundary from the end of the pipe (the cluster) to the start of the pipe (the registry). When you combine a proxy with signature verification and a Kubernetes admission controller, you create a hardened pipeline where untrusted code simply cannot execute.&lt;/p&gt;

</description>
      <category>supplychainsecurity</category>
      <category>artifactprovenance</category>
      <category>slsaframework</category>
      <category>devsecopspipeline</category>
    </item>
    <item>
      <title>Secure Terraform PRs with an Architecture Firewall</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Fri, 24 Apr 2026 14:15:37 +0000</pubDate>
      <link>https://forem.com/devopsstart/secure-terraform-prs-with-an-architecture-firewall-2e4f</link>
      <guid>https://forem.com/devopsstart/secure-terraform-prs-with-an-architecture-firewall-2e4f</guid>
      <description>&lt;p&gt;&lt;em&gt;Stop the 'merge and pray' workflow! This guide was originally published on devopsstart.com and explores how to implement an automated architecture firewall for your Terraform PRs using OPA.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;An architecture firewall is a governance layer integrated into your CI/CD pipeline that automatically blocks infrastructure changes violating security or organizational standards before they reach your environment. Unlike a network firewall that filters packets, this firewall filters Pull Requests (PRs). It transforms your infrastructure requirements from passive documentation in a wiki into active, executable code that cannot be ignored.&lt;/p&gt;

&lt;p&gt;In this article, you will learn how to move beyond the "merge and pray" workflow by implementing Policy as Code (PaC). We will explore the technical bridge between a &lt;code&gt;terraform plan&lt;/code&gt; and automated validation using tools like Open Policy Agent (OPA) and Checkov. You'll discover how to create a pipeline that converts Terraform plans to JSON, evaluates them against strict guardrails and provides immediate feedback to developers via PR comments. By the end, you will have a strategy to enforce encryption, restrict public access and prevent accidental resource deletion without slowing down your engineering velocity. This approach aligns with modern &lt;a href="https://dev.to/blog/terraform-testing-best-practices-beyond-plan-and-pray"&gt;Terraform testing best practices&lt;/a&gt;, ensuring that your cloud footprint remains secure by design rather than by chance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Manual PR Reviews Fail the Architecture Test
&lt;/h2&gt;

&lt;p&gt;Relying solely on human peer reviews to catch security holes is a recipe for a production outage. In high-velocity environments, reviewers suffer from fatigue. When a developer submits a PR with 500 lines of HCL, a reviewer might miss a single &lt;code&gt;0.0.0.0/0&lt;/code&gt; in a security group or a missing &lt;code&gt;encryption_enabled = true&lt;/code&gt; flag on an S3 bucket. Humans are great at reviewing logic and intent, but they are terrible at consistently auditing thousands of lines of configuration against a 50-page security compliance PDF.&lt;/p&gt;

&lt;p&gt;The "Merge and Pray" workflow creates a dangerous gap where "architectural drift" occurs. This happens when the actual state of your cloud deviates from your intended security posture because a few "small" exceptions were merged over time. To solve this, you need an automated gate that operates on the &lt;code&gt;terraform plan&lt;/code&gt; output. This plan is the only source of truth because it represents exactly what Terraform intends to do, accounting for variables, modules and the current state of the cloud.&lt;/p&gt;

&lt;p&gt;For example, if you are using Terraform v1.9.0+, you can generate a machine-readable plan that your architecture firewall can analyze. This removes the ambiguity of reviewing the &lt;code&gt;.tf&lt;/code&gt; files alone, which doesn't show the final resolved values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate the binary plan file&lt;/span&gt;
terraform plan &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tfplan

&lt;span class="c"&gt;# Convert the binary plan to JSON for policy evaluation&lt;/span&gt;
terraform show &lt;span class="nt"&gt;-json&lt;/span&gt; tfplan &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; tfplan.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By shifting the audit from the code to the plan, you ensure that the firewall sees the final result, not just the intent. This is the foundation of a robust governance strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing the Policy as Code Engine
&lt;/h2&gt;

&lt;p&gt;To build an architecture firewall, you must choose a Policy as Code (PaC) engine. For simple, industry-standard checks, tools like Checkov or TFLint are excellent because they come with hundreds of pre-built policies. However, for complex organizational logic (such as "Production databases must be deployed in three availability zones and have a specific naming convention"), you need a general-purpose policy engine like Open Policy Agent (OPA). OPA uses a language called Rego to query JSON data.&lt;/p&gt;

&lt;p&gt;The technical flow is straightforward: your CI pipeline runs the plan, converts it to JSON and pipes that JSON into OPA. If the Rego policy returns a "deny" result, the CI pipeline fails and the PR is blocked from merging. This turns your security requirements into a unit test for your infrastructure.&lt;/p&gt;

&lt;p&gt;Below is a practical example of a Rego policy that prevents any AWS S3 bucket from being created without server-side encryption.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="ow"&gt;package&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;

&lt;span class="ow"&gt;import&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;if&lt;/span&gt;

&lt;span class="c1"&gt;# Default allow unless a violation is found&lt;/span&gt;
&lt;span class="ow"&gt;default&lt;/span&gt; &lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="c1"&gt;# Violation: S3 bucket without encryption&lt;/span&gt;
&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_changes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3_bucket"&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if server_side_encryption_configuration is missing or empty&lt;/span&gt;
    &lt;span class="c1"&gt;# In Terraform JSON, 'after' contains the planned state&lt;/span&gt;
    &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;after&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;server_side_encryption_configuration&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Security Violation: S3 bucket %s must have encryption enabled"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run this against your plan in a GitHub Action or GitLab CI runner using OPA v0.60.0, you would execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run OPA evaluation and capture the deny rules&lt;/span&gt;
opa &lt;span class="nb"&gt;eval&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; tfplan.json &lt;span class="nt"&gt;-d&lt;/span&gt; policy.rego &lt;span class="s2"&gt;"data.terraform.analysis.deny"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the output contains any messages, the firewall has triggered and the build should fail. This process provides a mathematical guarantee that no unencrypted bucket ever reaches production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Application: Preventing Production Catastrophes
&lt;/h2&gt;

&lt;p&gt;A common production nightmare is the accidental deletion of a critical resource, such as a primary database or a core VPC, due to a renaming error or a module refactor. A manual reviewer might not realize that changing a resource name in Terraform results in a "destroy and recreate" action. An architecture firewall can catch this by analyzing the &lt;code&gt;actions&lt;/code&gt; array in the &lt;code&gt;terraform plan&lt;/code&gt; JSON.&lt;/p&gt;

&lt;p&gt;By writing a policy that flags any &lt;code&gt;delete&lt;/code&gt; action on resources tagged as &lt;code&gt;critical&lt;/code&gt;, you create a safety net. This doesn't mean you can never delete things; it means you must explicitly acknowledge the risk, perhaps through a "break-glass" label on the PR or a manual override from a Lead Architect.&lt;/p&gt;

&lt;p&gt;Consider this scenario: a developer changes the name of an RDS instance to match a new naming convention. Terraform sees this as deleting the old DB and creating a new one. Without a firewall, the PR looks like a simple string change. With a firewall, the system sees a &lt;code&gt;delete&lt;/code&gt; action on an &lt;code&gt;aws_db_instance&lt;/code&gt; and blocks it.&lt;/p&gt;

&lt;p&gt;Here is how you would implement a "Protection" rule in Rego to block deletions of production databases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="ow"&gt;package&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;

&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_changes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"aws_db_instance"&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if the action includes 'delete'&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"delete"&lt;/span&gt;

    &lt;span class="c1"&gt;# Only apply this to production environments by checking input variables&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"prod"&lt;/span&gt;

    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"CRITICAL FAILURE: Attempting to delete production database %s. This action is blocked by the Architecture Firewall."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Integrating this into your workflow requires a tight loop. You can use &lt;a href="https://dev.to/tutorials/how-to-automate-terraform-reviews-with-github-actions"&gt;automation for Terraform reviews&lt;/a&gt; to post these specific error messages directly as comments on the offending line of the PR. This transforms the "No" from the security team into a helpful, automated suggestion from the platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for Architecture Guardrails
&lt;/h2&gt;

&lt;p&gt;Implementing a firewall can create friction if handled poorly. If every PR is blocked by 50 different warnings, developers will find ways to bypass the system. Use these strategies to balance security with velocity.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Distinguish Between Warnings and Failures&lt;/strong&gt;. Not every policy should block a merge. Use "Advisory" levels for things like "missing cost-center tag" (warning) and "Critical" levels for "open SSH port" (hard fail). This prevents the firewall from becoming a nuisance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Version Your Policies&lt;/strong&gt;. Treat your Rego or Checkov policies like application code. Store them in a separate Git repository, version them and test them against a suite of "known-bad" Terraform plans to ensure no regressions in your security posture.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Provide Remediation Guidance&lt;/strong&gt;. A failure message like &lt;code&gt;Policy violation: SEC-01&lt;/code&gt; is useless. Your firewall should return &lt;code&gt;Security Violation: Port 22 is open to 0.0.0.0/0. Please restrict this to the corporate VPN range (10.x.x.x)&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Implement an Exception Process&lt;/strong&gt;. There will always be a legitimate reason to break a rule. Create a standardized way to grant exceptions, such as requiring a specific metadata tag (&lt;code&gt;exception_id = "SEC-123"&lt;/code&gt;) that the policy engine is programmed to ignore.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shift Left with Local Pre-commit Hooks&lt;/strong&gt;. Don't make the CI pipeline the first time a developer sees a failure. Provide a &lt;code&gt;pre-commit&lt;/code&gt; configuration using tools like &lt;code&gt;terraform-docs&lt;/code&gt; and &lt;code&gt;checkov&lt;/code&gt; so they can catch errors on their local machine.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does the architecture firewall replace the need for manual peer reviews?
&lt;/h3&gt;

&lt;p&gt;No, it augments them. The firewall handles the "objective" checks (security, compliance, syntax) so that human reviewers can focus on the "subjective" checks (architecture design, business logic and efficiency). It removes the tedious parts of the review process, allowing engineers to have higher-level discussions about the implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which tool should I choose: OPA, Checkov, or Sentinel?
&lt;/h3&gt;

&lt;p&gt;If you are using Terraform Cloud/Enterprise, Sentinel is the native choice and offers the deepest integration. If you need a free, industry-standard scanner that works out-of-the-box with minimal configuration, go with Checkov. If you have complex, custom business logic that spans multiple cloud providers and requires a powerful query language, Open Policy Agent (OPA) is the gold standard. I have seen mature platform teams use Checkov for general security and OPA for custom organizational guardrails.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I prevent the firewall from slowing down my deployment pipeline?
&lt;/h3&gt;

&lt;p&gt;Running &lt;code&gt;terraform plan&lt;/code&gt; and &lt;code&gt;opa eval&lt;/code&gt; typically adds less than 60 seconds to a pipeline. To further optimize, you can run these checks in parallel with other tests. Additionally, by implementing local pre-commit hooks, you reduce the number of failed CI runs, meaning the pipeline only handles "clean" code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use this to manage costs?
&lt;/h3&gt;

&lt;p&gt;Yes, this is a powerful use case. You can write policies that analyze the &lt;code&gt;resource_changes&lt;/code&gt; for expensive instance types. For example, you can block any PR that attempts to spin up an &lt;code&gt;aws_instance&lt;/code&gt; of type &lt;code&gt;p4d.24xlarge&lt;/code&gt; unless the project has a specific "high-compute" approval tag. This prevents "bill shock" by catching expensive mistakes before the resources are actually provisioned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building an architecture firewall is about shifting your mindset from "trusting the reviewer" to "trusting the system." By implementing Policy as Code, you ensure that your security standards are consistently applied across every single PR, regardless of who is reviewing it. This creates a scalable governance model that allows your platform team to support hundreds of developers without becoming a bottleneck.&lt;/p&gt;

&lt;p&gt;To get started, don't try to automate your entire security handbook at once. Start with the "low hanging fruit": block public S3 buckets and open SSH ports. Once your team is comfortable with the automated feedback loop, gradually introduce more complex architectural rules.&lt;/p&gt;

&lt;p&gt;Your next steps are to install OPA or Checkov, integrate a &lt;code&gt;terraform show -json&lt;/code&gt; step into your GitHub Actions or GitLab CI and write your first "deny" rule. This transition from manual oversight to automated guardrails is the defining characteristic of a mature Platform Engineering organization.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>policyascode</category>
      <category>openpolicyagent</category>
      <category>devsecops</category>
    </item>
    <item>
      <title>Local LLM for Log Analysis: Privacy-First Debugging with Ollama</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Fri, 24 Apr 2026 14:05:27 +0000</pubDate>
      <link>https://forem.com/devopsstart/local-llm-for-log-analysis-privacy-first-debugging-with-ollama-361o</link>
      <guid>https://forem.com/devopsstart/local-llm-for-log-analysis-privacy-first-debugging-with-ollama-361o</guid>
      <description>&lt;p&gt;&lt;em&gt;Stop sending sensitive production logs to the cloud. This guide, originally published on devopsstart.com, shows you how to build a privacy-first debugging stack using Ollama and Llama 3.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Sending production logs to cloud AI APIs is a non-starter for any serious SRE in a regulated industry. The answer to maintaining security while gaining AI capabilities is to shift the inference engine to your own hardware. By deploying a local LLM stack using Ollama and Llama 3, you can perform semantic log analysis and root cause diagnosis without a single byte of data leaving your secure perimeter.&lt;/p&gt;

&lt;p&gt;Whether you are in fintech, healthcare, or govtech, the "Compliance Wall" is real. You cannot risk leaking PII, session tokens, or internal IP addresses to a third party, even with "Enterprise" privacy agreements. You can find the fundamental concepts of managing these workloads in the official &lt;a href="https://ollama.com/library" rel="noopener noreferrer"&gt;Ollama documentation&lt;/a&gt;, which provides the framework for running open-source models locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this take
&lt;/h2&gt;

&lt;p&gt;Most organizations try to solve the privacy problem with PII Redaction scripts before sending logs to a cloud provider. This is a flawed strategy. Regular expressions and basic NER (Named Entity Recognition) models always miss something. A leaked credit card number or a proprietary internal URL in a stack trace can trigger a compliance audit that costs your company millions. The only way to guarantee zero leakage is to ensure the data never leaves the air-gapped environment or the VPC.&lt;/p&gt;

&lt;p&gt;In a production environment with over 500 microservices, the sheer volume of logs makes manual grep-ing impossible. I have seen teams spend six hours correlating logs across three different namespaces just to find a single timeout. A local LLM, when fed a curated slice of logs, can identify the behavioral pattern of a failure in seconds. For example, a sequence of 200 OK responses that occur in an impossible order often indicates a logic bug that regex-based monitors will never catch.&lt;/p&gt;

&lt;p&gt;Consider the operational reality of a CrashLoopBackOff. Instead of manually running &lt;code&gt;kubectl logs&lt;/code&gt; and &lt;code&gt;kubectl describe&lt;/code&gt; and trying to map them in your head, you can pipe the output directly into a local model. When you are &lt;a href="https://dev.to/blog/how-to-fix-kubernetes-crashloopbackoff-in-production"&gt;Fixing Kubernetes CrashLoopBackOff in Production&lt;/a&gt;, the bottleneck is usually the cognitive load of parsing verbose Java or Go stack traces. A local LLM reduces this load by summarizing the failure point immediately.&lt;/p&gt;

&lt;p&gt;The cost of cloud tokens for log analysis is astronomical. Logs are verbose. If you send 10MB of logs to a cloud LLM for every incident, your monthly bill will skyrocket. Running a 7B or 8B parameter model on a dedicated GPU node costs nothing but the electricity and the initial hardware investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The strongest counter-argument
&lt;/h2&gt;

&lt;p&gt;The most common pushback against local LLMs is the "Hardware Tax." Critics argue that the VRAM requirements for acceptable performance are too high for a standard developer laptop or a typical DevOps jump box. It is true that running a 70B parameter model requires multiple A100s or H100s to be performant, which is an unreasonable ask for a local debugging setup. If you try to run a large model on a CPU with 16GB of RAM, the tokens per second will be so slow that you might as well go back to using &lt;code&gt;grep&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There is also the issue of Context Window limitations. A production log file can be several gigabytes, while most local models have a context window ranging from 8k to 128k tokens. You cannot simply upload a log file to Ollama and ask what happened. You have to implement a pre-processing pipeline to slice the logs, filter out the noise, and feed the model only the relevant window surrounding the timestamp of the error. This adds architectural complexity that a simple API call to OpenAI does not have.&lt;/p&gt;

&lt;p&gt;However, these arguments ignore the reality of model quantization. Using 4-bit quantization (GGUF format), you can run a Llama 3 8B model on a machine with as little as 8GB of VRAM with negligible loss in reasoning capability for log analysis. For DevOps tasks, you do not need the creative writing abilities of a 175B parameter model; you need a model that understands stack traces and Kubernetes events.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exceptions where cloud LLMs still win
&lt;/h2&gt;

&lt;p&gt;There are specific scenarios where a local LLM is the wrong tool. If you are a tiny startup with zero regulatory constraints and no dedicated hardware, the overhead of managing an Ollama instance is a distraction. In those cases, the speed of onboarding a cloud API outweighs the privacy risks.&lt;/p&gt;

&lt;p&gt;Cloud LLMs also win when you need cross-domain knowledge at an extreme scale. If your log error is caused by a very obscure bug in a niche third party library that was updated two weeks ago, a cloud model trained on the most recent web crawl might have the answer. A local model's knowledge is frozen at the time of its training.&lt;/p&gt;

&lt;p&gt;Additionally, if your team requires a collaborative, multi-user interface with complex permissioning and auditing for every single prompt, building that on top of Open WebUI requires more effort than using a managed SaaS platform. For the senior SRE who needs to diagnose a production outage in a secure environment, these advantages are irrelevant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing the Privacy-First Stack
&lt;/h2&gt;

&lt;p&gt;To move from theory to production, use Ollama for the backend, Llama 3 (8B) for the reasoning, and Open WebUI for the interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing the Engine
&lt;/h3&gt;

&lt;p&gt;On a Linux workstation with an NVIDIA GPU, install Ollama:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once installed, pull the Llama 3 model. I recommend the 8B version for most log tasks as it balances speed and accuracy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run llama3:8b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Log Pipeline Architecture
&lt;/h3&gt;

&lt;p&gt;You cannot dump a 1GB log file into the model. You must use a pipeline. The most effective flow is: &lt;code&gt;Log Source&lt;/code&gt; → &lt;code&gt;Grep/Awk Filter&lt;/code&gt; → &lt;code&gt;Local LLM&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For example, if you are debugging an OOMKilled pod, first extract the relevant events. If you have already followed the steps to &lt;a href="https://dev.to/troubleshooting/how-to-debug-oomkilled-pods-in-kubernetes-a-step-by-step-gui"&gt;Debug OOMKilled Pods in Kubernetes&lt;/a&gt;, you know that the &lt;code&gt;describe&lt;/code&gt; output is more valuable than the application logs.&lt;/p&gt;

&lt;p&gt;Use this bash script to automate the extraction and analysis:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Extract the last 100 lines of logs and the pod description&lt;/span&gt;
kubectl describe pod my-app-6f7d8-abc &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; pod_desc.txt
kubectl logs my-app-6f7d8-abc &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; pod_logs.txt

&lt;span class="c"&gt;# Combine them into a prompt file&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Act as a Senior SRE. Analyze the following Kubernetes pod description and logs to find the root cause of the failure. Focus on memory limits and exit codes."&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; prompt.txt
&lt;span class="nb"&gt;cat &lt;/span&gt;pod_desc.txt &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; prompt.txt
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"--- LOGS ---"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; prompt.txt
&lt;span class="nb"&gt;cat &lt;/span&gt;pod_logs.txt &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; prompt.txt

&lt;span class="c"&gt;# Pipe the prompt to Ollama&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;prompt.txt | ollama run llama3:8b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Prompt Engineering for DevOps
&lt;/h3&gt;

&lt;p&gt;Generic prompts yield generic answers. To get production-ready insights, give the LLM a persona and a specific constraint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bad Prompt:&lt;/strong&gt; "What is wrong with these logs?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good Prompt:&lt;/strong&gt;&lt;br&gt;
"Act as a Site Reliability Engineer specializing in Java Spring Boot applications. I am providing a heap dump summary and the last 50 lines of the application log. Identify if this is a Memory Leak or a sudden spike in traffic. Provide the answer in a bulleted list: 1. Root Cause, 2. Evidence from logs, 3. Recommended fix."&lt;/p&gt;

&lt;p&gt;When dealing with complex orchestration issues, such as those found when you &lt;a href="https://dev.to/troubleshooting/crashloopbackoff-kubernetes"&gt;Fix CrashLoopBackOff in Kubernetes Pods&lt;/a&gt;, use this template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Persona: Kubernetes Expert
Context: Pod is in CrashLoopBackOff.
Task: Analyze the 'Last State' termination message and the current logs.
Constraint: Ignore health check failures; focus on application-level exceptions.
Logs: [Insert Logs Here]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hardware Requirements and Performance
&lt;/h2&gt;

&lt;p&gt;The sweet spot for local log analysis is a machine with 24GB of VRAM (like an RTX 3090 or 4090). This allows you to run the 8B model with a massive context window or even experiment with the 70B model using heavy quantization.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Minimum (Fast)&lt;/th&gt;
&lt;th&gt;Recommended (Pro)&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;NVIDIA RTX 3060 (12GB)&lt;/td&gt;
&lt;td&gt;NVIDIA RTX 4090 (24GB)&lt;/td&gt;
&lt;td&gt;VRAM is the primary metric&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;16GB&lt;/td&gt;
&lt;td&gt;64GB&lt;/td&gt;
&lt;td&gt;Used for offloading if VRAM fills&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;50GB SSD&lt;/td&gt;
&lt;td&gt;200GB NVMe&lt;/td&gt;
&lt;td&gt;Models are large (4GB to 40GB each)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;Ubuntu 22.04&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04&lt;/td&gt;
&lt;td&gt;Best driver support for CUDA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you are forced to run on CPU (Apple Silicon M2/M3 is an exception and works great), expect a drop from 50 tokens per second to about 3 to 5 tokens per second. This is acceptable for asynchronous log analysis but frustrating for interactive chatting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Semantic Anomaly Detection vs. Regex
&lt;/h2&gt;

&lt;p&gt;Standard observability tools like Splunk or ELK rely on indices and keyword searches. If you search for "Error", you find errors. But what if the system is failing silently?&lt;/p&gt;

&lt;p&gt;Example: A payment gateway returns &lt;code&gt;200 OK&lt;/code&gt; for every request, but the response body says &lt;code&gt;{"status": "pending", "reason": "timeout"}&lt;/code&gt;. A regex monitor sees the &lt;code&gt;200&lt;/code&gt; and stays green. A local LLM can be prompted to look for logical contradictions:&lt;/p&gt;

&lt;p&gt;"Analyze these logs for 'silent failures'. Look for cases where the HTTP status is 200 but the response body indicates a failure or a timeout."&lt;/p&gt;

&lt;p&gt;This move from syntactic analysis (looking for patterns) to semantic analysis (understanding meaning) is the real power of the local LLM. It allows you to find the unknown unknowns that you didn't know to write a regex for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Log Streamlining and Noise Reduction
&lt;/h2&gt;

&lt;p&gt;One of the biggest costs in DevOps is Log Bloat. We store terabytes of &lt;code&gt;INFO&lt;/code&gt; logs that we never read. You can use a local LLM as a pre-processor to summarize logs before they are even archived.&lt;/p&gt;

&lt;p&gt;By running a small, fast model like Mistral v0.3, you can create a Log Summarizer that takes 1,000 lines of verbose debug logs and converts them into three sentences:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The application started successfully.&lt;/li&gt;
&lt;li&gt;It attempted to connect to the database three times and failed.&lt;/li&gt;
&lt;li&gt;It entered a sleep state for 30 seconds.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This reduces the cognitive load on the human engineer and can potentially reduce storage costs if you only archive the summaries and a sampled percentage of the raw logs.&lt;/p&gt;

&lt;p&gt;Local LLMs are the only viable path for secure, privacy-first debugging in highly regulated environments. While the hardware requirements are higher than using a cloud API, the trade-off is a total elimination of PII leakage risk and the removal of per-token costs. Start by installing Ollama on a GPU-enabled jump box, select a 4-bit quantized Llama 3 model, and begin piping your &lt;code&gt;kubectl&lt;/code&gt; outputs into it to reduce your mean time to resolution (MTTR).&lt;/p&gt;

</description>
      <category>ollama</category>
      <category>localllm</category>
      <category>loganalysis</category>
      <category>devopssecurity</category>
    </item>
    <item>
      <title>How to Build AI Agents for Kubernetes Deployments</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:15:34 +0000</pubDate>
      <link>https://forem.com/devopsstart/how-to-build-ai-agents-for-kubernetes-deployments-34m</link>
      <guid>https://forem.com/devopsstart/how-to-build-ai-agents-for-kubernetes-deployments-34m</guid>
      <description>&lt;p&gt;&lt;em&gt;Ever wanted an AI that doesn't just explain Kubernetes errors but actually helps you fix them? This guide, originally published on devopsstart.com, walks through building autonomous K8s agents using MCP, Kagent, and K8sGPT.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;AI agents for Kubernetes deployments are autonomous systems that follow an "Observe → Reason → Act" loop to resolve cluster issues without manual intervention. While a standard LLM can explain what a &lt;code&gt;CrashLoopBackOff&lt;/code&gt; is, a true agent can detect the error, pull the logs, analyze the stack trace, cross-reference it with recent Git commits, and propose a specific PR to fix the environment variable causing the crash.&lt;/p&gt;

&lt;p&gt;Building these agents requires moving beyond simple prompting and into "tool use" or "function calling." You are essentially giving an LLM a set of specialized skills (API wrappers) that allow it to interact with your cluster, your GitOps pipeline, and your observability stack. In this guide, you will learn how to architect these skills using the Model Context Protocol (MCP) and frameworks like Kagent and K8sGPT to automate the most tedious parts of Kubernetes operations.&lt;/p&gt;

&lt;p&gt;For a deep dive into the foundational concepts of managing the pods these agents will be monitoring, see the guide on &lt;a href="https://dev.to/blog/kubernetes-for-beginners-deploy-your-first-application"&gt;Kubernetes for Beginners: Deploy Your First Application&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting this tutorial, you need a functioning Kubernetes environment and the necessary API access for the LLM. I recommend a development cluster (Kind or Minikube) or a staging namespace in a cloud provider like GKE or EKS to avoid accidental production outages.&lt;/p&gt;

&lt;p&gt;You will need the following tools installed on your local machine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;kubectl v1.30+&lt;/strong&gt;: The standard Kubernetes CLI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Helm v3.14+&lt;/strong&gt;: For managing the agent's dependencies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.11+&lt;/strong&gt;: Most agent frameworks, including Kagent and LangChain, require modern Python.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An OpenAI API Key (GPT-4o)&lt;/strong&gt; or &lt;strong&gt;Anthropic API Key (Claude 3.5 Sonnet)&lt;/strong&gt;: Agents require high-reasoning models to avoid hallucinations during tool selection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;K8sGPT v0.12+&lt;/strong&gt;: For the diagnostic skill set implementation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You should also have a basic understanding of Kubernetes RBAC. Agents operate as identities within the cluster, and giving them &lt;code&gt;cluster-admin&lt;/code&gt; privileges is a security risk. You will need to be comfortable creating ServiceAccounts and RoleBindings to enforce the principle of least privilege.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;In this tutorial, we are building a "Deployment Guardian" agent. This isn't a monolithic script, but a modular system capable of three specific skills:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Automated Diagnostics&lt;/strong&gt;: Using K8sGPT to scan for misconfigurations and interpreting those errors using an LLM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource Right-Sizing&lt;/strong&gt;: Analyzing pod resource usage and suggesting updates to the Horizontal Pod Autoscaler (HPA).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitOps Sync Validation&lt;/strong&gt;: Monitoring ArgoCD application health and triggering syncs when drifts are detected.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The core of this architecture relies on the Model Context Protocol (MCP). MCP is an open standard that decouples the LLM from the specific implementation of the tool. Instead of writing a custom wrapper for every single &lt;code&gt;kubectl&lt;/code&gt; command, MCP allows you to expose a standardized "server" that tells the LLM exactly what tools are available, what arguments they take, and what the expected output format is.&lt;/p&gt;

&lt;p&gt;By the end of this guide, you will have an agent that provides the root cause and the exact YAML change needed to fix a deployment, integrated directly into your operational workflow. For those managing the underlying infrastructure of these clusters, understanding how to &lt;a href="https://dev.to/tutorials/deploy-eks-cluster-with-terraform"&gt;Deploy an EKS Cluster with Terraform&lt;/a&gt; provides the necessary context for where these agents actually reside.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Architecting the Agent Loop
&lt;/h2&gt;

&lt;p&gt;Before writing code, you must understand how the agent thinks. A standard LLM request is a linear path: Prompt → Response. An agent loop is circular.&lt;/p&gt;

&lt;p&gt;When you ask an agent to "Fix the failing deployment in the staging namespace," it performs the following sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Observation&lt;/strong&gt;: The agent calls a tool (for example, &lt;code&gt;get_pod_status&lt;/code&gt;) to see which pods are failing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt;: It observes three pods in &lt;code&gt;CrashLoopBackOff&lt;/code&gt; and reasons that it needs logs to understand the root cause.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action&lt;/strong&gt;: It calls &lt;code&gt;get_pod_logs&lt;/code&gt; for one of the failing pods.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observation&lt;/strong&gt;: The logs show a &lt;code&gt;java.lang.NullPointerException&lt;/code&gt; related to a missing database URL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt;: It checks the ConfigMap to see if the environment variable is defined.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action&lt;/strong&gt;: It calls &lt;code&gt;get_configmap&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final Response&lt;/strong&gt;: It concludes the environment variable is missing and suggests the specific &lt;code&gt;kubectl patch&lt;/code&gt; command or Git PR.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To implement this, you can use a framework like Kagent, which is built on AutoGen. It treats the "DevOps Engineer" as one agent and the "Kubernetes Cluster" as a tool-providing environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Implementing the Tooling Layer with MCP
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol (MCP) is the primary mechanism for production-grade agents. Instead of hardcoding functions into your Python script, you run an MCP server that exposes your Kubernetes API.&lt;/p&gt;

&lt;p&gt;First, install the MCP SDK for Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, create a simple MCP server that provides a "skill" to get pod events. This is more efficient than giving the LLM raw &lt;code&gt;kubectl&lt;/code&gt; access because you can filter the output to only include errors, which reduces token usage and hallucination risk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# k8s_mcp_server.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;K8s-Guardian&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_pod_errors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetches only Warning events for pods in a specific namespace.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--field-selector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type=Warning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error fetching events: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No warning events found.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run this server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python k8s_mcp_server.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM now sees &lt;code&gt;get_pod_errors&lt;/code&gt; as a capability. When it encounters a deployment failure, it will autonomously decide to call this function rather than guessing. This architectural separation allows you to update the Python "skill" without changing the prompt of the LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Configuring Least-Privilege RBAC
&lt;/h2&gt;

&lt;p&gt;Giving an AI agent a &lt;code&gt;kubeconfig&lt;/code&gt; with &lt;code&gt;cluster-admin&lt;/code&gt; is an unacceptable security risk. If the LLM hallucinates a command like &lt;code&gt;kubectl delete ns --all&lt;/code&gt;, the agent will execute it.&lt;/p&gt;

&lt;p&gt;You must create a dedicated &lt;code&gt;ServiceAccount&lt;/code&gt; with a restricted &lt;code&gt;Role&lt;/code&gt;. For our Deployment Guardian, the agent needs to read pods, events, and logs, but it should only be able to "patch" specific resources.&lt;/p&gt;

&lt;p&gt;Create a file named &lt;code&gt;agent-rbac.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;k8s-ai-agent&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ai-ops&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Role&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent-read-write-role&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods/log"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;configmaps"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;watch"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;apps"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deployments"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replicasets"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RoleBinding&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent-read-write-binding&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;
&lt;span class="na"&gt;subjects&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;k8s-ai-agent&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ai-ops&lt;/span&gt;
&lt;span class="na"&gt;roleRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Role&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent-read-write-role&lt;/span&gt;
  &lt;span class="na"&gt;apiGroup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace ai-ops
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; agent-rbac.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To connect your agent to this identity, use a token-based approach or a projected volume if the agent runs inside the cluster. For local development, you can impersonate the ServiceAccount to verify permissions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; staging &lt;span class="nt"&gt;--as&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;system:serviceaccount:ai-ops:k8s-ai-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Integrating K8sGPT for Diagnostic Skills
&lt;/h2&gt;

&lt;p&gt;While custom MCP tools are great for specific tasks, K8sGPT provides a powerful set of pre-built diagnostic skills. It scans your cluster for common issues and uses an LLM to explain them.&lt;/p&gt;

&lt;p&gt;First, install the K8sGPT CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;k8sgpt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, authenticate it with your LLM provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;k8sgpt auth add &lt;span class="nt"&gt;--backend&lt;/span&gt; openai &lt;span class="nt"&gt;--model&lt;/span&gt; gpt-4o
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To integrate K8sGPT into your agent's skill set, wrap the &lt;code&gt;k8sgpt analyze&lt;/code&gt; command into a tool. This allows the agent to trigger a full cluster scan and reason over the results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Adding K8sGPT as a tool in our MCP server
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_cluster_health&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Runs a K8sGPT analysis on the namespace to find errors.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k8sgpt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you run this, the output provides a detailed analysis:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ k8sgpt analyze --namespace staging
[!] Pod 'auth-service-6f7d' is in CrashLoopBackOff
Analysis: The pod is failing because the 'DB_PASSWORD' environment variable is missing.
The application expects this variable to be provided via a Secret.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent can now combine this high-level analysis with its own &lt;code&gt;get_configmap&lt;/code&gt; tool to find where the secret is missing. This creates a tiered diagnostic approach: K8sGPT finds the "what," and the custom MCP tools find the "how" and "where." If you see these errors frequently, check the &lt;a href="https://dev.to/troubleshooting/how-to-fix-kubernetes-crashloopbackoff-in-production"&gt;Fix Kubernetes CrashLoopBackOff in Production&lt;/a&gt; guide for manual remediation steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Building the Resource Optimization Skill
&lt;/h2&gt;

&lt;p&gt;Resource optimization requires the agent to observe metrics (via Prometheus or Metrics Server) and act on the Horizontal Pod Autoscaler (HPA).&lt;/p&gt;

&lt;p&gt;To implement this, your agent needs a tool that can query the Metrics Server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_pod_resource_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Retrieves CPU and Memory usage for a specific pod.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent's reasoning logic for optimization follows this pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Trigger&lt;/strong&gt;: The agent is asked to "Optimize the checkout-service."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observation&lt;/strong&gt;: It calls &lt;code&gt;get_pod_resource_usage&lt;/code&gt; and sees the pod is consistently using 95% of its memory limit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observation&lt;/strong&gt;: It calls &lt;code&gt;kubectl get hpa&lt;/code&gt; and sees the HPA is targeting 50% CPU, but the bottleneck is actually memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt;: The agent realizes the HPA should be updated to include memory metrics or the memory limit should be increased.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action&lt;/strong&gt;: It proposes a YAML change to the HPA definition.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For a detailed explanation of how HPA works to better tune your agent's prompts, read the &lt;a href="https://dev.to/blog/kubernetes-hpa-deep-dive-autoscaling-explained"&gt;Kubernetes HPA Deep Dive&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Automating GitOps with ArgoCD Integration
&lt;/h2&gt;

&lt;p&gt;An agent that runs &lt;code&gt;kubectl patch&lt;/code&gt; directly creates "configuration drift." The source of truth must always be Git. Therefore, your agent's "Act" phase should target your GitOps tool.&lt;/p&gt;

&lt;p&gt;If you are using ArgoCD, give your agent tools to interact with the ArgoCD API or the Git repository. First, ensure you have ArgoCD installed; if not, follow the &lt;a href="https://dev.to/tutorials/how-to-install-argo-cd-gitops-deployment-on-kubernetes"&gt;How to Install Argo CD&lt;/a&gt; guide.&lt;/p&gt;

&lt;p&gt;Now, create a tool that allows the agent to check the sync status of an application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_argocd_app_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Checks if an ArgoCD application is Synced and Healthy.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;argocd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app_name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "GitOps Loop" for the agent is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Detect&lt;/strong&gt;: The agent sees a pod failing in the cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diagnose&lt;/strong&gt;: It finds that the image tag &lt;code&gt;v1.2.0&lt;/code&gt; has a bug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolve&lt;/strong&gt;: It searches for the latest stable image tag in the registry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Act&lt;/strong&gt;: Instead of running &lt;code&gt;kubectl set image&lt;/code&gt;, it uses a GitHub API tool to create a Pull Request updating the image tag in the Git repository.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify&lt;/strong&gt;: It monitors ArgoCD until the app shows as &lt;code&gt;Synced&lt;/code&gt; and &lt;code&gt;Healthy&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This workflow ensures the AI agent remains a part of the governed pipeline. You can learn more about managing these sync policies in the &lt;a href="https://dev.to/tutorials/how-to-configure-advanced-argo-cd-sync-policies-for-gitops"&gt;Advanced Argo CD Sync Policies&lt;/a&gt; tutorial.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Implementing Safety Rails and Human-in-the-Loop (HITL)
&lt;/h2&gt;

&lt;p&gt;To prevent "hallucination-driven outages," you must implement a safety layer between the agent's reasoning and the action.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Dry-Run Constraint
&lt;/h3&gt;

&lt;p&gt;Every tool that modifies the cluster must implement a &lt;code&gt;--dry-run=server&lt;/code&gt; flag by default. The agent should first call the tool in dry-run mode and present the proposed change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;propose_deployment_patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deployment_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;patch_yaml&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Proposes a change to a deployment using dry-run.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patch_yaml&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deployment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deployment_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--patch-file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--dry-run=server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Proposed Change: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. The Approval Gate (HITL)
&lt;/h3&gt;

&lt;p&gt;The agent must not execute a &lt;code&gt;patch&lt;/code&gt; or &lt;code&gt;delete&lt;/code&gt; command without manual approval from a human operator, typically via a Slack bot or CLI prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent&lt;/strong&gt;: "I've found that the &lt;code&gt;auth-service&lt;/code&gt; is OOMKilled. I propose increasing the memory limit from 256Mi to 512Mi. Should I apply this change? [Yes/No]"&lt;br&gt;
&lt;strong&gt;Human&lt;/strong&gt;: "Yes"&lt;br&gt;
&lt;strong&gt;Agent&lt;/strong&gt;: (Executes the actual patch command)&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Policy-as-Code (Kyverno/OPA)
&lt;/h3&gt;

&lt;p&gt;A cluster-level policy engine like Kyverno or OPA Gatekeeper should be the final line of defense. For example, a policy that prevents any resource from being deleted in the &lt;code&gt;production&lt;/code&gt; namespace, regardless of the requester's identity.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 8: Testing and Validating Agent Performance
&lt;/h2&gt;

&lt;p&gt;Treat your agent's skills like production code.&lt;/p&gt;
&lt;h3&gt;
  
  
  Unit Testing Tools
&lt;/h3&gt;

&lt;p&gt;Test each MCP tool independently. If your &lt;code&gt;get_pod_errors&lt;/code&gt; tool fails to parse &lt;code&gt;kubectl&lt;/code&gt; output, the LLM will receive garbage and hallucinate a solution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Example test for the tool&lt;/span&gt;
python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"from k8s_mcp_server import get_pod_errors; print(get_pod_errors('staging'))"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario-Based Validation (Chaos Engineering)
&lt;/h3&gt;

&lt;p&gt;Test your agent by intentionally breaking things in a sandbox:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inject a Failure&lt;/strong&gt;: Delete a Secret that a deployment needs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trigger Agent&lt;/strong&gt;: Ask, "Why is the deployment failing?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Did it find the missing secret? (Correctness)&lt;/li&gt;
&lt;li&gt;Did it suggest the right fix? (Accuracy)&lt;/li&gt;
&lt;li&gt;Did it try to delete the namespace? (Safety)&lt;/li&gt;
&lt;li&gt;How many tool calls did it take? (Efficiency)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Token Cost and Latency Tracking
&lt;/h3&gt;

&lt;p&gt;Agents can be expensive. A complex diagnostic loop might call 10 different tools, sending significant context back to the LLM. Use tools like LangSmith or Arize Phoenix to trace the agent's thoughts. If the agent loops infinitely (calling the same tool repeatedly), refine the system prompt to include a "maximum tool call" limit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Agent "Loops" Infinitely
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: The agent calls &lt;code&gt;get_pod_status&lt;/code&gt; repeatedly for 20 turns.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: Update the system prompt: "If a tool returns the same result twice, do not call it again. Instead, try a different diagnostic tool or ask the user for more information."&lt;/p&gt;

&lt;h3&gt;
  
  
  RBAC "Forbidden" Errors
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: &lt;code&gt;Error from server (Forbidden): pods "my-pod" is forbidden: User "system:serviceaccount:ai-ops:k8s-ai-agent" cannot get resource "pods/log"&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: Check your &lt;code&gt;Role&lt;/code&gt; definition. &lt;code&gt;pods&lt;/code&gt; and &lt;code&gt;pods/log&lt;/code&gt; are different resources in Kubernetes. You must explicitly list &lt;code&gt;pods/log&lt;/code&gt; in the &lt;code&gt;resources&lt;/code&gt; section of your RBAC YAML.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hallucinated CLI Flags
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: The agent tries to run &lt;code&gt;kubectl get pods --show-all-errors&lt;/code&gt;, which is not a real flag.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: Be explicit in your MCP tool description. Instead of "Fetch pods," say "Fetch pods using the exact command &lt;code&gt;kubectl get pods -n {namespace}&lt;/code&gt;."&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Window Overflow
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: The agent "forgets" the initial error after calling several tools.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: Implement "summarization" in your tools. Instead of returning raw &lt;code&gt;kubectl&lt;/code&gt; output, filter for the top 5 most relevant errors before sending the text to the LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building AI agents for Kubernetes is a shift from "writing scripts" to "designing capabilities." By utilizing the Model Context Protocol (MCP), you decouple your agent's reasoning from the underlying API calls, allowing you to iterate on "skills" without breaking the agent's logic.&lt;/p&gt;

&lt;p&gt;We have moved from the basic "Observe → Reason → Act" loop to a production-ready architecture featuring least-privilege RBAC, GitOps integration via ArgoCD, and strict human-in-the-loop safety rails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actionable Next Steps:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start Small&lt;/strong&gt;: Implement one "read-only" skill (like the &lt;code&gt;get_pod_errors&lt;/code&gt; tool) and run it in a local Kind cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure the Perimeter&lt;/strong&gt;: Apply the RBAC constraints before moving the agent to a shared development environment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement the Gate&lt;/strong&gt;: Add a manual approval step for any tool that uses &lt;code&gt;kubectl patch&lt;/code&gt; or &lt;code&gt;kubectl delete&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor and Refine&lt;/strong&gt;: Use a tracing tool to see where your agent is hallucinating and refine your tool descriptions accordingly.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>kubernetesaiagents</category>
      <category>modelcontextprotocol</category>
      <category>k8sgpt</category>
      <category>gitopsautomation</category>
    </item>
    <item>
      <title>How to Manage Multiple Azure Subscriptions in Terraform</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Mon, 20 Apr 2026 22:01:41 +0000</pubDate>
      <link>https://forem.com/devopsstart/how-to-manage-multiple-azure-subscriptions-in-terraform-1bnf</link>
      <guid>https://forem.com/devopsstart/how-to-manage-multiple-azure-subscriptions-in-terraform-1bnf</guid>
      <description>&lt;p&gt;&lt;em&gt;Managing Hub-and-Spoke architectures in Azure can be a challenge when dealing with multiple subscriptions. This guide, originally published on devopsstart.com, explains how to use Terraform provider aliases to streamline your deployments.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Manage Multiple Azure Subscriptions in Terraform
&lt;/h2&gt;

&lt;p&gt;To deploy resources across multiple Azure subscriptions in a single Terraform configuration, you must use provider aliases. By default, the &lt;code&gt;azurerm&lt;/code&gt; provider targets only one subscription based on your authentication context. To override this, you define multiple provider blocks, assigning an &lt;code&gt;alias&lt;/code&gt; to each and specifying a unique &lt;code&gt;subscription_id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This pattern is essential for Hub-and-Spoke network architectures. In these environments, central shared services (like Azure Firewall or ExpressRoute) live in a Hub subscription, while application workloads reside in separate Spoke subscriptions. Without aliases, you would be forced to run separate Terraform states and pipelines for every single subscription, which makes cross-subscription networking a manual nightmare.&lt;/p&gt;

&lt;p&gt;You can find the complete provider specification in the &lt;a href="https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs" rel="noopener noreferrer"&gt;official Terraform Azure Provider documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing Provider Aliases
&lt;/h2&gt;

&lt;p&gt;To start, you need to configure your &lt;code&gt;providers.tf&lt;/code&gt; file. The provider without an alias becomes the default. Any provider with an &lt;code&gt;alias&lt;/code&gt; must be explicitly called when defining a resource using the &lt;code&gt;provider&lt;/code&gt; meta-argument.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# providers.tf&lt;/span&gt;

&lt;span class="c1"&gt;# Default provider (Spoke Subscription)&lt;/span&gt;
&lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="s2"&gt;"azurerm"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;features&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
  &lt;span class="nx"&gt;subscription_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"00000000-0000-0000-0000-000000000000"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Aliased provider (Hub Subscription)&lt;/span&gt;
&lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="s2"&gt;"azurerm"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;alias&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hub"&lt;/span&gt;
  &lt;span class="nx"&gt;features&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
  &lt;span class="nx"&gt;subscription_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"11111111-1111-1111-1111-111111111111"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you create a resource, use the &lt;code&gt;provider&lt;/code&gt; argument to tell Terraform which subscription to use. If you omit this, Terraform defaults to the primary provider.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Deploy a VNet in the Hub subscription&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"azurerm_virtual_network"&lt;/span&gt; &lt;span class="s2"&gt;"hub_vnet"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;provider&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;azurerm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hub&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hub-vnet"&lt;/span&gt;
  &lt;span class="nx"&gt;address_space&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"10.0.0.0/16"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;location&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"eastus"&lt;/span&gt;
  &lt;span class="nx"&gt;resource_group_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hub-rg"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Deploy a VNet in the Spoke subscription (default provider)&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"azurerm_virtual_network"&lt;/span&gt; &lt;span class="s2"&gt;"spoke_vnet"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"spoke-vnet"&lt;/span&gt;
  &lt;span class="nx"&gt;address_space&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"10.1.0.0/16"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;location&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"eastus"&lt;/span&gt;
  &lt;span class="nx"&gt;resource_group_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"spoke-rg"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cross-Subscription Data Referencing
&lt;/h2&gt;

&lt;p&gt;A common production scenario involves fetching an existing resource ID from a Hub subscription to use as a property in a Spoke resource, such as creating a VNet peering. In my experience, this is where most "Resource Not Found" errors occur because the data block defaults to the wrong subscription.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Fetch Hub VNet ID from the Hub subscription&lt;/span&gt;
&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"azurerm_virtual_network"&lt;/span&gt; &lt;span class="s2"&gt;"hub_vnet_data"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;provider&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;azurerm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hub&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hub-vnet"&lt;/span&gt;
  &lt;span class="nx"&gt;resource_group_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hub-rg"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create peering in the Spoke subscription pointing to the Hub&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"azurerm_virtual_network_peering"&lt;/span&gt; &lt;span class="s2"&gt;"spoke_to_hub"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;                      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"spoke-to-hub"&lt;/span&gt;
  &lt;span class="nx"&gt;resource_group_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"spoke-rg"&lt;/span&gt;
  &lt;span class="nx"&gt;virtual_network_name&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;azurerm_virtual_network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;spoke_vnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;remote_virtual_network_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;azurerm_virtual_network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hub_vnet_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By explicitly assigning &lt;code&gt;provider = azurerm.hub&lt;/code&gt; to the data block, Terraform authenticates against the Hub subscription to retrieve the ID before attempting to create the peering in the Spoke subscription.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Module Provider Gotcha
&lt;/h2&gt;

&lt;p&gt;The biggest mistake engineers make with multi-subscription setups is assuming modules inherit aliases automatically. They do not. If you call a module and it contains &lt;code&gt;azurerm&lt;/code&gt; resources, those resources will use the default provider regardless of where the module is called from.&lt;/p&gt;

&lt;p&gt;To fix this, you must explicitly pass the aliased provider into the module using the &lt;code&gt;providers&lt;/code&gt; map.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"spoke_workload"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"./modules/workload"&lt;/span&gt;

  &lt;span class="c1"&gt;# Map the module's internal 'azurerm' provider to the 'hub' alias&lt;/span&gt;
  &lt;span class="nx"&gt;providers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;azurerm&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;azurerm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hub&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;vnet_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;azurerm_virtual_network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hub_vnet_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside the module code, do not define a &lt;code&gt;provider&lt;/code&gt; block. Just use the standard &lt;code&gt;azurerm&lt;/code&gt; resource blocks; the mapping happens at the root level. This ensures your modules remain reusable across different environments. I have seen this fail in clusters with &amp;gt;50 nodes where a missed provider mapping caused a production workload to be deployed into a development subscription, leading to significant security audit failures. To maintain high reliability, consider &lt;a href="https://dev.to/blog/terraform-testing-best-practices-beyond-plan-and-pray"&gt;testing your infrastructure as code&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for Naming and Scale
&lt;/h2&gt;

&lt;p&gt;Avoid generic names like &lt;code&gt;azurerm.sub1&lt;/code&gt; or &lt;code&gt;azurerm.secondary&lt;/code&gt;. In a production environment with dozens of subscriptions, these names provide zero context and lead to configuration errors. Use functional names that describe the role of the subscription:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;azurerm.hub&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;azurerm.shared_services&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;azurerm.prod_workload&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;azurerm.identity_mgmt&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In environments with more than 50 subscriptions, managing these aliases in a single &lt;code&gt;providers.tf&lt;/code&gt; file becomes brittle. At that scale, I recommend splitting your state files by subscription or using a wrapper tool. This reduces the blast radius of a single &lt;code&gt;terraform apply&lt;/code&gt; and decreases the time spent in the "refreshing state" phase, which can otherwise take several minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Can I use the same Service Principal for multiple subscriptions?&lt;/strong&gt;&lt;br&gt;
Yes, as long as that Service Principal has the required RBAC roles (for example, Contributor) across all targeted subscriptions. Terraform handles the switching via the &lt;code&gt;subscription_id&lt;/code&gt; field in the provider block.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need to run &lt;code&gt;az account set&lt;/code&gt; before running Terraform?&lt;/strong&gt;&lt;br&gt;
No. When you explicitly define &lt;code&gt;subscription_id&lt;/code&gt; in the provider block, Terraform ignores the current active subscription in your Azure CLI session and targets the ID specified in the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does using aliases increase the plan time?&lt;/strong&gt;&lt;br&gt;
Slightly. Terraform must establish separate API sessions for each provider instance. In very large environments, this can add 10 to 30 seconds to the refresh phase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Using provider aliases is the only professional way to handle multi-subscription Azure deployments. By separating your Hub and Spoke configurations and explicitly passing providers to your modules, you eliminate the risk of deploying resources to the wrong environment.&lt;/p&gt;

&lt;p&gt;Your next steps should be to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Audit your current &lt;code&gt;providers.tf&lt;/code&gt; and rename any generic aliases to functional names.&lt;/li&gt;
&lt;li&gt;Check your module calls to ensure &lt;code&gt;providers = { ... }&lt;/code&gt; is being used for all non-default subscriptions.&lt;/li&gt;
&lt;li&gt;Implement data blocks to automate the linkage between Hub and Spoke resources.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>terraformazure</category>
      <category>azurermprovider</category>
      <category>infrastructureascode</category>
      <category>azuresubscriptionmanagement</category>
    </item>
    <item>
      <title>GitHub Actions Security: How to Stop Secret Leaks in CI/CD</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Mon, 20 Apr 2026 21:46:31 +0000</pubDate>
      <link>https://forem.com/devopsstart/github-actions-security-how-to-stop-secret-leaks-in-cicd-2nh5</link>
      <guid>https://forem.com/devopsstart/github-actions-security-how-to-stop-secret-leaks-in-cicd-2nh5</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on devopsstart.com, this guide explores how to eliminate static secrets and harden your GitHub Actions pipelines against credential theft.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The fastest way to compromise a production environment isn't by hacking a firewall; it's by stealing a long-lived AWS Access Key leaked in a GitHub Actions log. Secret leakage in CI/CD pipelines is a systemic risk because these pipelines possess the "keys to the kingdom", allowing them to provision infrastructure, modify databases and push code to production.&lt;/p&gt;

&lt;p&gt;When secrets leak, they typically happen through three vectors: accidental logging, compromised third-party actions or malicious pull requests from external contributors. To stop this, you must move from static secrets to identity-based authentication using OpenID Connect (OIDC) and implement a strict least-privilege model for your workflow permissions.&lt;/p&gt;

&lt;p&gt;In this guide, you will learn how to implement OIDC, the danger of mutable version tags, and how to defend against "pwn-request" attacks. For those managing complex infrastructure, combining these security practices with &lt;a href="https://dev.to/tutorials/how-to-automate-terraform-reviews-with-github-actions"&gt;how to automate terraform reviews with github actions&lt;/a&gt; ensures that security is baked into the code review process, not just the execution phase.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anatomy of a Secret Leak: Why Your Logs Aren't Safe
&lt;/h2&gt;

&lt;p&gt;GitHub provides a built-in masking feature that replaces known secrets with asterisks (&lt;code&gt;***&lt;/code&gt;) in the logs. However, this is a convenience feature, not a security boundary. Attackers can easily bypass masking by encoding the secret. If a developer runs &lt;code&gt;echo $SECRET | base64&lt;/code&gt;, the resulting string is no longer the original secret and will not be masked. Any user with read access to the action run can decode it instantly.&lt;/p&gt;

&lt;p&gt;Another common leak vector is the "debug dump". When a pipeline fails, developers often add &lt;code&gt;run: env&lt;/code&gt; or &lt;code&gt;run: printenv&lt;/code&gt; to debug the environment. This prints every single environment variable to the logs. While GitHub tries to mask the secrets, any variable that was dynamically generated or slightly modified during the build process will leak in plain text.&lt;/p&gt;

&lt;p&gt;The most dangerous leak comes from the supply chain. If you use a third-party action like &lt;code&gt;uses: some-random-user/setup-tool@v1&lt;/code&gt;, you are executing arbitrary code from that user's repository. If that account is compromised, the attacker can update the code in &lt;code&gt;@v1&lt;/code&gt; to &lt;code&gt;curl&lt;/code&gt; your environment variables to an external server. Because the action runs with the &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; and any secrets you passed to it, the attacker gains full access without leaving a trace in your logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Moving from Static Secrets to OIDC
&lt;/h2&gt;

&lt;p&gt;The industry standard for securing cloud access in CI/CD is OpenID Connect (OIDC). Long-lived IAM keys (the &lt;code&gt;AWS_ACCESS_KEY_ID&lt;/code&gt; and &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt; pair) are liabilities because they never expire and are often stored as static GitHub Secrets. If these leak, they remain valid until you manually rotate them. OIDC replaces these static keys with short-lived, identity-based tokens.&lt;/p&gt;

&lt;p&gt;With OIDC, GitHub Actions acts as an Identity Provider (IdP). When a workflow runs, it requests a JWT (JSON Web Token) from GitHub. The workflow then presents this token to the cloud provider (AWS, Azure or GCP). The cloud provider verifies the token's signature and checks if the "claims" (such as the repository name or the branch) match a pre-defined trust relationship. If they match, the provider issues a temporary security token, typically valid for one hour.&lt;/p&gt;

&lt;p&gt;To implement this in AWS, you first create an IAM Role with a Trust Policy that trusts the GitHub OIDC provider. Then, use the official &lt;code&gt;aws-actions/configure-aws-credentials&lt;/code&gt; action (v4). You must specify &lt;code&gt;permissions: id-token: write&lt;/code&gt; in your YAML to allow the runner to request the JWT.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: OIDC Authentication for AWS&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Secure Deploy&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;main&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt; &lt;span class="c1"&gt;# Required for requesting the JWT&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;  &lt;span class="c1"&gt;# Required for checkout&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout code&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure AWS credentials&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789012:role/github-oidc-role&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify Identity&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws sts get-caller-identity&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output of the last command shows the assumed role, not a static user. If this workflow is compromised, the attacker only has a temporary token that expires quickly, which reduces the blast radius significantly compared to static keys.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardening the Supply Chain: The Danger of Mutable Tags
&lt;/h2&gt;

&lt;p&gt;Most DevOps engineers use version tags when referencing actions, such as &lt;code&gt;uses: actions/checkout@v4&lt;/code&gt;. This looks clean, but it is a security anti-pattern. Tags in Git are mutable; a maintainer (or an attacker who has hijacked the account) can move the &lt;code&gt;v4&lt;/code&gt; tag to a different, malicious commit. You think you are using a trusted version, but the underlying code has changed without your knowledge.&lt;/p&gt;

&lt;p&gt;To eliminate this risk, pin actions to a full-length commit SHA. A SHA is an immutable fingerprint of the code. If the code changes by a single character, the SHA changes. While this makes updating actions more tedious, it is the only way to guarantee that the code you audited is the code running today.&lt;/p&gt;

&lt;p&gt;I have seen this fail in clusters with &amp;gt;50 nodes where a single compromised community action allowed an attacker to exfiltrate internal environment variables across dozens of repos. In a production environment with over 100 repositories, manually updating SHAs is a burden. Use a tool like Renovate Bot or Dependabot to automate these updates while keeping them pinned.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# UNSAFE: Using a mutable tag&lt;/span&gt;
&lt;span class="c1"&gt;# If the maintainer changes what @v4 points to, your pipeline is compromised.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

&lt;span class="c1"&gt;# SAFE: Using a full-length commit SHA&lt;/span&gt;
&lt;span class="c1"&gt;# This code will NEVER change, regardless of what happens to the repository tags.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11&lt;/span&gt; &lt;span class="c1"&gt;# v4.1.1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When pinning, always include a comment noting which version the SHA corresponds to. In clusters where security compliance is strict, such as those running on GKE Autopilot or hardened EKS nodes, this level of granularity is mandatory to pass SOC2 or ISO27001 audits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defending Against "Pwn-Requests" and Fork Attacks
&lt;/h2&gt;

&lt;p&gt;One of the most overlooked vulnerabilities in GitHub Actions is the handling of Pull Requests from forks. By default, the &lt;code&gt;pull_request&lt;/code&gt; event does not grant secrets to the runner for security reasons. However, developers often find this frustrating when they need to run integration tests that require a database key. To solve this, they use the &lt;code&gt;pull_request_target&lt;/code&gt; event.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;pull_request_target&lt;/code&gt; event is extremely dangerous. Unlike &lt;code&gt;pull_request&lt;/code&gt;, it runs in the context of the base branch (usually &lt;code&gt;main&lt;/code&gt;) and has access to secrets. If you have a workflow triggered by &lt;code&gt;pull_request_target&lt;/code&gt; that checks out the code from the PR branch and then runs a script, a malicious contributor can modify that script in their fork to &lt;code&gt;echo $SECRET | base64&lt;/code&gt;. Since the workflow runs with the base branch's permissions, the attacker steals your production credentials.&lt;/p&gt;

&lt;p&gt;To safely handle external contributions, never execute untrusted code from a fork while secrets are present. If you need to run tests on a PR, use the standard &lt;code&gt;pull_request&lt;/code&gt; event and utilize "Environment" protections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# DANGEROUS: Vulnerable to pwn-requests&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request_target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt; &lt;span class="c1"&gt;# This checks out the PR code from the fork&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm install &amp;amp;&amp;amp; npm test&lt;/span&gt; &lt;span class="c1"&gt;# The PR author can change 'npm test' to steal secrets&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.API_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The correct pattern is to require a manual approval from a maintainer before a workflow can access a protected environment's secrets. This creates a human-in-the-loop firewall that prevents automated credential theft.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for CI/CD Hardening
&lt;/h2&gt;

&lt;p&gt;To maintain a secure posture, implement these five practices across every repository in your organization.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Implement a Global Permissions Policy&lt;/strong&gt;: Start every job with the most restrictive permissions. Use &lt;code&gt;permissions: contents: read&lt;/code&gt; by default and only add &lt;code&gt;id-token: write&lt;/code&gt; or &lt;code&gt;packages: write&lt;/code&gt; when specifically required. This prevents a compromised action from deleting your repository.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Environment-Based Secrets&lt;/strong&gt;: Do not put production secrets in the global "Repository Secrets" section. Create a "Production" environment and assign secrets there. This allows you to enforce "Required Reviewers", meaning no code can access production keys without a senior engineer's sign-off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate Secret Scanning&lt;/strong&gt;: Integrate Gitleaks or TruffleHog into your pipeline as a pre-commit hook or an initial CI step. These tools look for patterns (like &lt;code&gt;AKIA...&lt;/code&gt; for AWS) and fail the build if a secret is detected in the commit history.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid Secret Passing via Env&lt;/strong&gt;: Instead of passing secrets as environment variables to every step, pass them only to the specific step that needs them. This minimizes the number of processes that have the secret in their memory space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotate Credentials Every 90 Days&lt;/strong&gt;: Even with OIDC, some legacy systems require static keys. Implement a strict rotation policy. If a key is not rotated regularly, a leak might go undetected for months, giving attackers a permanent backdoor.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does GitHub really mask all my secrets in the logs?
&lt;/h3&gt;

&lt;p&gt;No. GitHub only masks the exact string stored in the secret. If your code transforms the secret (e.g., base64 encoding, URL encoding or splitting the string), the resulting output will not be masked. Never rely on masking as a primary security control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is &lt;code&gt;pull_request_target&lt;/code&gt; worse than &lt;code&gt;pull_request&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;pull_request&lt;/code&gt; runs in the context of the merge commit and has no access to secrets from the base repository. &lt;code&gt;pull_request_target&lt;/code&gt; runs in the context of the base branch and has full access to secrets, meaning any code introduced by a contributor in a fork can access those secrets if the workflow executes that code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I use OIDC for every single cloud provider?
&lt;/h3&gt;

&lt;p&gt;Yes. Every major provider (AWS, Azure, GCP and HashiCorp Vault) now supports OIDC for GitHub Actions. Moving away from static JSON keys or CSV credential files reduces your operational overhead and eliminates the risk of "stale" credentials living in your repository settings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I still use version tags like &lt;code&gt;@v4&lt;/code&gt; if I use a private runner?
&lt;/h3&gt;

&lt;p&gt;Yes, but it is still a bad practice. Even on a private runner, a compromised third-party action can exfiltrate data from your internal network or steal the &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; to modify your source code. The location of the runner does not protect you from supply chain attacks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Securing GitHub Actions requires moving away from the "trust by default" mindset. The combination of OIDC for identity, SHA pinning for supply chain integrity and strict &lt;code&gt;permissions&lt;/code&gt; blocks creates a defense-in-depth strategy. The most critical immediate step you can take is auditing your workflows for &lt;code&gt;pull_request_target&lt;/code&gt; and replacing static cloud keys with OIDC roles.&lt;/p&gt;

&lt;p&gt;Start by implementing these three actionable steps today: first, replace all &lt;code&gt;v*&lt;/code&gt; tags with commit SHAs in your most critical deployment pipeline. Second, migrate your production cloud authentication to OIDC to eliminate long-lived keys. Third, configure GitHub Environments with mandatory reviewers for all production secrets. By shifting security left into your CI/CD configuration, you ensure that your pipeline is a tool for delivery, not a liability.&lt;/p&gt;

</description>
      <category>githubactionssecurity</category>
      <category>oidcauthentication</category>
      <category>cicdhardening</category>
      <category>supplychainsecurity</category>
    </item>
    <item>
      <title>Cursor vs Copilot vs Cody: Best AI Editor for DevOps</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Mon, 20 Apr 2026 21:41:26 +0000</pubDate>
      <link>https://forem.com/devopsstart/cursor-vs-copilot-vs-cody-best-ai-editor-for-devops-5a42</link>
      <guid>https://forem.com/devopsstart/cursor-vs-copilot-vs-cody-best-ai-editor-for-devops-5a42</guid>
      <description>&lt;p&gt;&lt;em&gt;Choosing the right AI editor for DevOps is about more than just autocomplete—it's about codebase context. Originally published on devopsstart.com, this guide compares Cursor, Copilot, and Cody for IaC and Kubernetes workflows.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Choosing an AI code assistant for DevOps isn't about who can write the cleanest Python function; it's about who understands the relationship between your &lt;code&gt;variables.tf&lt;/code&gt;, your Helm charts and your GitHub Actions workflow. Most AI tools are built for application developers, which means they often fail when faced with the fragmented nature of infrastructure. If you've ever had Copilot suggest a deprecated Terraform provider or a Kubernetes API version that hasn't existed since 1.16, you know the "context problem" firsthand.&lt;/p&gt;

&lt;p&gt;In this guide, you'll learn how to navigate the trade-offs between GitHub Copilot, Cursor and Sourcegraph Cody specifically through the lens of a Platform or DevOps engineer. We will dive into how each tool handles codebase indexing, how they manage the hallucinations common in YAML and HCL, and which one actually helps you reduce "time to first green build" in a complex CI/CD pipeline. By the end, you'll have a clear decision matrix to determine which tool fits your specific organizational scale, security requirements and infrastructure complexity. Whether you are managing a handful of scripts or a massive polyglot monorepo, the right choice depends on how the AI "sees" your architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Context Problem: Why General AI Fails DevOps
&lt;/h2&gt;

&lt;p&gt;DevOps engineers don't write linear code; they build distributed systems. A single change in a Terraform module might require updates to a Kubernetes manifest and a corresponding change in a CI pipeline. Standard AI completions fail here because they typically rely on "active tab" context. If you are editing &lt;code&gt;deployment.yaml&lt;/code&gt; but the relevant environment variable is defined in &lt;code&gt;terraform/outputs.tf&lt;/code&gt; (which is closed), the AI is guessing based on generic internet patterns, not your actual architecture.&lt;/p&gt;

&lt;p&gt;For example, imagine you are trying to reference a secret created by an ExternalSecrets operator. A generic AI will suggest a standard Kubernetes Secret syntax. A context-aware AI knows you are using &lt;code&gt;ExternalSecret&lt;/code&gt; objects and will suggest the correct API group. This is the difference between a tool that saves you five seconds of typing and a tool that prevents a production outage. To solve this, tools have moved toward Retrieval-Augmented Generation (RAG), which indexes your local or remote files to provide actual project awareness. You can read more about the complexities of managing these environments in the &lt;a href="https://dev.to/blog/kubernetes-v136-features-deprecations-upgrade-guide"&gt;Kubernetes v1.36 Features, Deprecations &amp;amp; Upgrade Guide&lt;/a&gt; to see why version-specific context is so critical.&lt;/p&gt;

&lt;p&gt;Consider this scenario: you need to add a new resource to a Terraform module that already has a strict naming convention and specific tagging requirements defined in a separate &lt;code&gt;locals.tf&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# locals.tf&lt;/span&gt;
&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;common_tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;
    &lt;span class="nx"&gt;Project&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Phoenix"&lt;/span&gt;
    &lt;span class="nx"&gt;ManagedBy&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Terraform"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# main.tf&lt;/span&gt;
&lt;span class="c1"&gt;# You start typing: resource "aws_s3_bucket" "logs" {&lt;/span&gt;
&lt;span class="c1"&gt;# A context-blind AI suggests: tags = { Name = "logs" }&lt;/span&gt;
&lt;span class="c1"&gt;# A context-aware AI suggests: tags = local.common_tags&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the AI knows your &lt;code&gt;locals.tf&lt;/code&gt; exists, it stops hallucinating generic tags and starts following your internal standards. This eliminates the manual "copy-paste" cycle that often leads to inconsistent infrastructure and failed compliance checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cursor: The AI-Native Powerhouse for IaC
&lt;/h2&gt;

&lt;p&gt;Cursor is not a plugin; it is a fork of VS Code. This architectural choice is a game changer for DevOps engineers because it allows the AI to integrate deeply with the IDE's indexing engine. While Copilot feels like a sophisticated autocomplete, Cursor feels like a pair programmer that has actually read your entire repository. It uses a local index of your files, meaning when you ask it to "Add a new environment to the staging cluster," it scans your existing &lt;code&gt;.tfvars&lt;/code&gt; and &lt;code&gt;kustomize&lt;/code&gt; overlays to mirror the pattern exactly.&lt;/p&gt;

&lt;p&gt;For those managing complex Terraform projects, Cursor's &lt;code&gt;@Codebase&lt;/code&gt; feature is indispensable. You can prompt the AI to analyze the relationship between different modules without opening every file. This is particularly useful when you are implementing &lt;a href="https://dev.to/blog/terraform-testing-best-practices-beyond-plan-and-pray"&gt;Terraform Testing Best Practices&lt;/a&gt; and need the AI to generate test cases based on the actual resource dependencies. In clusters with &amp;gt;100 nodes, where naming conventions are strict and dependencies are deep, this level of indexing prevents the "hallucinated resource" error that plagues plugin-based assistants.&lt;/p&gt;

&lt;p&gt;Here is how you would actually use Cursor to refactor a Kubernetes manifest to use a new ConfigMap source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In Cursor, you use the Cmd+K (or Ctrl+K) interface.&lt;/span&gt;
&lt;span class="c"&gt;# Prompt: "@Codebase update all deployments in /k8s/overlays/prod to use the &lt;/span&gt;
&lt;span class="c"&gt;# new configmap-v2 defined in configmap.yaml"&lt;/span&gt;

&lt;span class="c"&gt;# Cursor identifies all files in the directory and applies the change:&lt;/span&gt;
&lt;span class="c"&gt;# Before:&lt;/span&gt;
&lt;span class="c"&gt;# configMapRef:&lt;/span&gt;
&lt;span class="c"&gt;#   name: app-config&lt;/span&gt;
&lt;span class="c"&gt;# After:&lt;/span&gt;
&lt;span class="c"&gt;# configMapRef:&lt;/span&gt;
&lt;span class="c"&gt;#   name: app-config-v2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The magic here is that Cursor doesn't just find and replace text; it understands that the &lt;code&gt;configMapRef&lt;/code&gt; is a Kubernetes object property. It maintains the indentation of your YAML (which is the bane of every DevOps engineer's existence) and ensures that the change is consistent across all target files. This removes the tedious manual verification usually required after a bulk edit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sourcegraph Cody: Mastering the Enterprise Monorepo
&lt;/h2&gt;

&lt;p&gt;While Cursor excels at local indexing, Sourcegraph Cody is designed for the enterprise scale. Many Platform teams work in massive polyglot monorepos where the Terraform code is in one directory, the Go-based operator is in another and the documentation is in a separate Wiki or GitHub Pages site. Cody's strength lies in its ability to pull context from remote repositories and external documentation via the Sourcegraph index.&lt;/p&gt;

&lt;p&gt;Cody is the "Enterprise Context King" because it doesn't just look at your open files; it looks at your entire organization's knowledge graph. If your company has a proprietary way of handling VPC peering or a specific wrapper around Pulumi, Cody can be configured to prioritize those internal patterns over generic public documentation. This is vital for SOC2 or HIPAA compliant environments where "following the internal standard" is not a suggestion, but a legal requirement.&lt;/p&gt;

&lt;p&gt;Imagine you are tasked with updating a CI pipeline using a custom internal GitHub Action that isn't documented on the public web.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/deploy.yml&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Internal Deploy&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-corp/deploy-helper@v2&lt;/span&gt; &lt;span class="c1"&gt;# Cody knows this action exists in your org&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cluster_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.CLUSTER_ID }}&lt;/span&gt;
          &lt;span class="c1"&gt;# Cody suggests the 'environment' input because it indexed &lt;/span&gt;
          &lt;span class="c1"&gt;# the 'deploy-helper' repo in the same organization.&lt;/span&gt;
          &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;production'&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By indexing the &lt;code&gt;my-corp/deploy-helper&lt;/code&gt; repository, Cody provides suggestions for inputs and outputs that GitHub Copilot would simply guess. This reduces the need to constantly switch between your editor and the internal documentation browser. For teams implementing &lt;a href="https://dev.to/blog/gitops-testing-strategies-validate-deployments-with-argocd"&gt;GitOps Testing Strategies&lt;/a&gt;, Cody can help bridge the gap between the ArgoCD configuration and the underlying Kubernetes manifests by tracing the logic across different repositories.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparing AI Performance on YAML and HCL
&lt;/h2&gt;

&lt;p&gt;When it comes to Infrastructure as Code (IaC), the biggest risk is the "confidently wrong" suggestion. HCL (HashiCorp Configuration Language) and YAML are whitespace-sensitive and schema-dependent. GitHub Copilot is generally the fastest for simple snippets, but it is the most prone to hallucinating API versions. For example, it might suggest &lt;code&gt;apiVersion: extensions/v1beta1&lt;/code&gt; for an Ingress resource, which has been deprecated for years.&lt;/p&gt;

&lt;p&gt;Cursor and Cody perform better here because they can be anchored to specific versions of your codebase. If your project specifies Terraform v1.7.0 in a &lt;code&gt;.terraform-version&lt;/code&gt; file, Cursor is more likely to suggest syntax compatible with that version. In a head-to-head comparison for generating a complex Kubernetes NetworkPolicy, Cursor typically wins on formatting, while Cody wins on referencing your existing network architecture.&lt;/p&gt;

&lt;p&gt;Let's look at a practical comparison of how these tools handle a request to create a Kubernetes Service of type &lt;code&gt;LoadBalancer&lt;/code&gt; with specific cloud annotations for AWS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Prompt: "Create a LoadBalancer service for the 'api' deployment with AWS NLB annotations"&lt;/span&gt;

&lt;span class="c1"&gt;# Copilot: Often gives a generic LoadBalancer without the specific &lt;/span&gt;
&lt;span class="c1"&gt;# service.beta.kubernetes.io/aws-load-balancer-type: nlb annotation.&lt;/span&gt;

&lt;span class="c1"&gt;# Cursor: Checks your other services, sees you use 'nlb-ip' mode, and suggests:&lt;/span&gt;
&lt;span class="c1"&gt;# annotations:&lt;/span&gt;
&lt;span class="c1"&gt;#   service.beta.kubernetes.io/aws-load-balancer-type: "nlb-ip"&lt;/span&gt;
&lt;span class="c1"&gt;#   service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"&lt;/span&gt;

&lt;span class="c1"&gt;# Cody: References the official AWS Load Balancer Controller docs (if indexed)&lt;/span&gt;
&lt;span class="c1"&gt;# and suggests the most current annotation for your specific K8s version.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "hallucination risk" in Kubernetes is particularly high because the API evolves so rapidly. A tool that relies on a training set from 2022 will lead you toward deprecated fields. A tool that uses RAG to look at your current &lt;code&gt;kubectl version&lt;/code&gt; or your manifest files will guide you toward the current standard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for AI-Driven DevOps
&lt;/h2&gt;

&lt;p&gt;To get the most out of these tools without introducing security vulnerabilities or infrastructure drift, you must treat AI output as a "proposed change" rather than "final code." Follow these guidelines to maintain stability.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use Version Pinning in Prompts&lt;/strong&gt;: Never just ask for a "Terraform script." Specify the version. Use prompts like "Using Terraform v1.7.x and the AWS provider v5.0, create a VPC..." This forces the AI to narrow its search space and reduces the likelihood of deprecated syntax.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify with Static Analysis&lt;/strong&gt;: AI is great at writing code but terrible at verifying it. Always pipe AI-generated HCL through &lt;code&gt;terraform validate&lt;/code&gt; and YAML through &lt;code&gt;kube-linter&lt;/code&gt; or &lt;code&gt;datree&lt;/code&gt;. This catches the small indentation errors that AI frequently introduces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context-Seed Your Prompts&lt;/strong&gt;: In Cursor or Cody, explicitly tag the files that define your architecture. Instead of "Fix this error," use "@variables.tf &lt;a class="mentioned-user" href="https://dev.to/main"&gt;@main&lt;/a&gt;.tf fix the mismatch in the subnet ID." This provides the RAG engine with a direct path to the answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sanitize Secrets Before Indexing&lt;/strong&gt;: Ensure your &lt;code&gt;.gitignore&lt;/code&gt; is robust. While most modern AI editors respect &lt;code&gt;.gitignore&lt;/code&gt;, double-check that you aren't indexing &lt;code&gt;.terraform.lock.hcl&lt;/code&gt; or temporary state files that might contain sensitive metadata.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterative Refinement&lt;/strong&gt;: Start with a high-level architecture prompt, then drill down into specific resources. Asking an AI to "Write my entire EKS cluster" usually results in a mess. Ask it to "Define the VPC," then "Define the EKS cluster using that VPC," and finally "Add the node groups."&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Which AI editor is the most secure for corporate code?
&lt;/h3&gt;

&lt;p&gt;Sourcegraph Cody generally leads in enterprise security because it offers robust controls over where data is stored and how it is indexed. For organizations with strict data residency requirements, Cody's ability to run on-premises or in a private cloud is a major advantage. Cursor and Copilot have "Privacy Modes" that promise not to train on your data, but for SOC2/HIPAA environments, the transparency of Cody's indexing layer is typically more acceptable to security auditors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can these tools actually replace writing Terraform by hand?
&lt;/h3&gt;

&lt;p&gt;No, and attempting to do so is dangerous. AI is excellent at boilerplate (creating 10 similar S3 buckets) and translation (converting a Helm chart to a Kustomize overlay), but it cannot reason about your business logic or the cost implications of a specific instance type. Use AI to handle the "syntax toil" while you handle the "architectural intent."&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I stop the AI from suggesting deprecated Kubernetes APIs?
&lt;/h3&gt;

&lt;p&gt;The best way is to provide a "source of truth" file in your repository. Create a &lt;code&gt;K8S_STANDARDS.md&lt;/code&gt; file that lists your cluster version and preferred API versions. In Cursor or Cody, refer to this file using &lt;code&gt;@K8S_STANDARDS.md&lt;/code&gt; in your prompt. This overrides the AI's general training data with your specific project requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does using a fork like Cursor break my VS Code extensions?
&lt;/h3&gt;

&lt;p&gt;Since Cursor is a fork of VS Code, it is compatible with almost all VS Code extensions. You can import your existing themes, keybindings and plugins (like the HashiCorp Terraform extension) directly. The primary difference is the built-in AI layer, which replaces the need for a separate Copilot plugin.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The transition from "AI as a plugin" to "AI as an environment" is the most significant shift in DevOps productivity since the rise of GitOps. GitHub Copilot remains a solid choice for generalists who want a low-friction experience. However, for the specialized needs of a Platform Engineer, Cursor's local codebase indexing provides a level of precision in HCL and YAML that plugins cannot match. For those operating at a massive corporate scale, Sourcegraph Cody's remote context capabilities make it the only viable choice for navigating polyglot monorepos.&lt;/p&gt;

&lt;p&gt;Your next step should be a two-week trial: install Cursor for your local feature development to see if the &lt;code&gt;@Codebase&lt;/code&gt; indexing reduces your context-switching. Simultaneously, if you are in a large team, evaluate Cody's ability to index your internal documentation. Once you've chosen your tool, integrate a static analysis step into your CI pipeline to ensure that AI-generated speed doesn't come at the cost of production stability. Stop fighting with YAML indentation and start leveraging the context of your entire architecture.&lt;/p&gt;

</description>
      <category>aicodeeditors</category>
      <category>infrastructureascode</category>
      <category>kubernetesautomation</category>
      <category>devopstools</category>
    </item>
    <item>
      <title>Build an Internal Developer Platform with Backstage and</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Mon, 20 Apr 2026 21:36:22 +0000</pubDate>
      <link>https://forem.com/devopsstart/build-an-internal-developer-platform-with-backstage-and-5gjp</link>
      <guid>https://forem.com/devopsstart/build-an-internal-developer-platform-with-backstage-and-5gjp</guid>
      <description>&lt;p&gt;&lt;em&gt;Stop the 'ticket-ops' madness! This guide, originally published on devopsstart.com, shows you how to combine Backstage and Crossplane to build a true self-service Internal Developer Platform.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Stop forcing your developers to learn the intricacies of cloud provider consoles or struggle with 500-line Terraform modules just to get a database. The gap between raw infrastructure and developer productivity is where "ticket ops" thrives, slowing down deployment cycles and frustrating engineers. To solve this, you need an Internal Developer Platform (IDP) that abstracts infrastructure complexity into a self-service experience.&lt;/p&gt;

&lt;p&gt;An IDP allows developers to provision resources via a simplified interface without needing to be cloud experts. In this guide, you will learn how to build a production-ready IDP by combining Backstage and Crossplane. Backstage acts as your front-end portal, providing a unified interface for service discovery and software templates. Crossplane serves as the back-end control plane, turning Kubernetes into a universal API for managing cloud resources.&lt;/p&gt;

&lt;p&gt;By the end of this article, you will understand the architecture required to move from manual Infrastructure as Code (IaC) to a scalable Infrastructure as a Service (IaaS) model. You'll see exactly how to map a button click in a UI to a live AWS RDS instance via GitOps, reducing the cognitive load on your developers while maintaining strict governance for your platform team. For more on managing the underlying clusters, you can check out &lt;a href="https://dev.to/blog/kubernetes-for-beginners-deploy-your-first-application"&gt;Kubernetes for Beginners: Deploy Your First Application&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: Connecting Backstage to Crossplane
&lt;/h2&gt;

&lt;p&gt;Building an IDP isn't about one tool; it's about the pipeline. The most common mistake is trying to connect Backstage directly to a cloud API. That is a security nightmare and lacks auditability. Instead, use a GitOps-driven control plane architecture. In this flow, Backstage doesn't "create" the infrastructure; it "requests" it by committing a manifest to Git.&lt;/p&gt;

&lt;p&gt;The sequence works as follows: a developer selects a "Provision Postgres" template in the Backstage Scaffolder. Backstage then triggers a commit of a simple YAML file to a Git repository. An automated GitOps controller, such as ArgoCD, detects this change and syncs the manifest to a Kubernetes cluster. Inside that cluster, Crossplane v1.14.x sees the new Custom Resource (CR) and communicates with the cloud provider's API to provision the actual resource.&lt;/p&gt;

&lt;p&gt;This ensures that your Git history is the single source of truth, which is critical for compliance and disaster recovery. To ensure these deployments are handled reliably, you should learn &lt;a href="https://dev.to/tutorials/how-to-set-up-argo-cd-gitops-for-kubernetes-automation"&gt;How to Set Up Argo CD GitOps for Kubernetes Automation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The "connective tissue" here is the YAML schema. Backstage must output a manifest that exactly matches the &lt;code&gt;CompositeResourceDefinition&lt;/code&gt; (XRD) you've defined in Crossplane. If the Scaffolder outputs &lt;code&gt;db_size: small&lt;/code&gt; but Crossplane expects &lt;code&gt;storageClass: small&lt;/code&gt;, the request will hang in a "Pending" state. You must treat your XRDs as the API contract between your platform team and your developers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Abstracting Cloud Complexity with Crossplane Compositions
&lt;/h2&gt;

&lt;p&gt;If you give developers raw Crossplane resources, you've just traded Terraform for Kubernetes YAML, which does not reduce cognitive load. The real power of Crossplane lies in Compositions. A Composition allows you to bundle multiple low-level resources (like a VPC, a Subnet, and an RDS instance) into a single, high-level "Composite Resource" (XR) that developers can actually understand.&lt;/p&gt;

&lt;p&gt;For example, instead of requiring a developer to specify &lt;code&gt;db.aws.upbound.io/v1beta1&lt;/code&gt; with 20 mandatory fields, you create a &lt;code&gt;CompositeDatabase&lt;/code&gt; definition. The developer only provides a name and a size. Your platform team defines the "blueprint" that maps &lt;code&gt;size: small&lt;/code&gt; to a &lt;code&gt;t3.micro&lt;/code&gt; instance with 20GB of encrypted GP3 storage.&lt;/p&gt;

&lt;p&gt;Here is an example of a simplified &lt;code&gt;CompositeResourceDefinition&lt;/code&gt; (XRD) that defines the API your developers will use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apiextensions.crossplane.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CompositeResourceDefinition&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xpostgresdatabases.platform.example.org&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform.example.org&lt;/span&gt;
  &lt;span class="na"&gt;names&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;XPostgresDatabase&lt;/span&gt;
    &lt;span class="na"&gt;plural&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xpostgresdatabases&lt;/span&gt;
  &lt;span class="na"&gt;versions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1alpha1&lt;/span&gt;
    &lt;span class="na"&gt;served&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;referenceable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;openAPIV3Schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;object&lt;/span&gt;
        &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;object&lt;/span&gt;
            &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;storageGb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;integer&lt;/span&gt;
              &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And here is how the developer's request (the "Claim") looks. This is the exact YAML that Backstage will generate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform.example.org/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PostgresDatabase&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order-service-db&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order-service-prod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;storageGb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
  &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By using this approach, you eliminate the need for developers to know AWS-specific jargon. You can change the underlying instance type or backup policy in the Composition without ever touching the developer's manifest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing the Backstage Scaffolder for Self-Service
&lt;/h2&gt;

&lt;p&gt;The Backstage Scaffolder is the engine that turns a user's form input into a Git commit. To make this work with Crossplane, you create a &lt;code&gt;template.yaml&lt;/code&gt; file. This template defines the UI form (the questions you ask the developer) and the "steps" required to process the answer.&lt;/p&gt;

&lt;p&gt;In a production setup, your template should not just create a file; it should validate the input. For example, if a developer requests 10,000GB of storage, your template or a validating admission webhook in Kubernetes should catch it. The template uses "Nunjucks" templating to inject the form values into the Crossplane Claim YAML.&lt;/p&gt;

&lt;p&gt;Below is a snippet of a Backstage software template designed to provision a Crossplane database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backstage.io/template/scaffolder-entity/v1.0.0&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;provision-rds-postgres&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Provision RDS Postgres&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Creates a production-ready Postgres DB via Crossplane&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Database Details&lt;/span&gt;
      &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;dbName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Database Name&lt;/span&gt;
        &lt;span class="na"&gt;storageGb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;integer&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Storage Size (GB)&lt;/span&gt;
          &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Environment&lt;/span&gt;
          &lt;span class="na"&gt;enum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;staging&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;prod&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetch-base&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetch:template&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;templateRepo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;templates/infrastructure/rds&lt;/span&gt;
        &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.dbName }}&lt;/span&gt;
          &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.storageGb }}&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.environment }}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;publish&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;publish:github&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;allowedStatuses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;success&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;repoUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.com?owner=my-org&amp;amp;repo=${{ parameters.dbName }}-infra&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the developer clicks "Create," Backstage creates a new repository (or updates an existing one) with the resulting YAML. The critical part is the &lt;code&gt;fetch:template&lt;/code&gt; step. It takes the generic &lt;code&gt;claim.yaml&lt;/code&gt; from your template repository and fills it with the user's specific requirements. This removes the possibility of syntax errors in the YAML, as the developer never actually writes the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The GitOps Feedback Loop and Production Gotchas
&lt;/h2&gt;

&lt;p&gt;A major pain point in IDPs is the "black hole" effect: a developer clicks a button in Backstage, the commit happens, and then nothing. They have no idea if the database is actually ready or if the Crossplane provider is stuck in a back-off loop. To solve this, you must implement a feedback loop.&lt;/p&gt;

&lt;p&gt;One effective method is using the Backstage Kubernetes plugin combined with the Crossplane status fields. Crossplane updates the &lt;code&gt;status&lt;/code&gt; section of the Claim resource once the cloud provider confirms the resource is &lt;code&gt;Ready: True&lt;/code&gt;. You can configure Backstage to surface these Kubernetes resource statuses directly on the service's catalog page. If a resource is failing, the developer sees a "Warning" status in the portal, which links them to the logs.&lt;/p&gt;

&lt;p&gt;In clusters with &amp;gt;100 nodes, you'll notice that Crossplane's reconciliation loop can put significant pressure on the Kubernetes API server. I've seen cases where too many frequent updates to the status of 500+ cloud resources caused API latency. To mitigate this, tune the &lt;code&gt;pollInterval&lt;/code&gt; in your Crossplane providers. Don't check every 60 seconds if a database is ready; 5 or 10 minutes is usually sufficient for infrastructure that takes 15 minutes to provision.&lt;/p&gt;

&lt;p&gt;Another production gotcha is "orphaned resources." If a developer deletes the manifest from Git, ArgoCD deletes the Claim from Kubernetes, and Crossplane deletes the RDS instance. This is great for dev environments but catastrophic for production. You must implement a "deletion policy" in your Compositions. Set &lt;code&gt;deletionPolicy: Orphan&lt;/code&gt; for production workloads. This ensures that if the YAML is accidentally deleted, the actual cloud resource remains intact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for Platform Engineering
&lt;/h2&gt;

&lt;p&gt;Implementing an IDP is more of an organizational challenge than a technical one. If you build a perfect platform that no one uses, you've failed. Follow these principles to ensure adoption:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with the "Golden Path":&lt;/strong&gt; Do not try to automate every possible cloud resource on day one. Identify the three most requested resources (for example, S3 buckets, Postgres DBs, and Redis caches) and build high-quality templates for those. This provides immediate value and builds trust.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforce Governance via Compositions:&lt;/strong&gt; Use Crossplane Compositions to bake in security. Ensure every S3 bucket is encrypted and every RDS instance is in a private subnet by default. The developer shouldn't even see the "Encryption" checkbox; it should be mandatory and invisible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat your IDP as a Product:&lt;/strong&gt; Your developers are your customers. Conduct user interviews to find where the friction is. If they find the Backstage form too long, simplify it. If they need more visibility into costs, integrate a cost-tracking plugin.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement Strong RBAC:&lt;/strong&gt; Use Kubernetes namespaces to isolate claims. Ensure that a developer in the &lt;code&gt;team-a&lt;/code&gt; namespace cannot modify a &lt;code&gt;PostgresDatabase&lt;/code&gt; claim in the &lt;code&gt;team-b&lt;/code&gt; namespace. Use a tool like Kyverno to enforce these boundaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version your Compositions:&lt;/strong&gt; When you update a Composition (for example, upgrading the RDS instance class), don't just push it to production. Version your XRDs and Compositions so you can migrate services gradually rather than forcing a global update.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How does this approach differ from using Terraform with a CI/CD pipeline?&lt;/strong&gt;&lt;br&gt;
Traditional Terraform requires a "push" model where a pipeline runs &lt;code&gt;terraform apply&lt;/code&gt;. This often leads to state locking issues and configuration drift. The Backstage + Crossplane approach uses a "pull" model (Control Plane). Crossplane constantly monitors the state of the cloud and automatically corrects drift without needing a manual pipeline trigger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does this mean I have to migrate all my existing Terraform code to Crossplane?&lt;/strong&gt;&lt;br&gt;
No. You can run them side-by-side. Use Crossplane for new, self-service workloads while keeping your core networking and foundation (VPCs, IAM roles) in Terraform. You can even use the Terraform provider for Crossplane to manage existing Terraform modules through the Kubernetes API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens if the cloud provider API is down during provisioning?&lt;/strong&gt;&lt;br&gt;
Crossplane employs an exponential back-off strategy. If the AWS API returns a 500 error, Crossplane will keep retrying the request. The Kubernetes resource will stay in a &lt;code&gt;Synced: False&lt;/code&gt; state. Because you have a GitOps audit trail, you can easily see which resources are stuck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is Backstage overkill for small teams?&lt;/strong&gt;&lt;br&gt;
If you have fewer than five developers, a simple README and a set of shared Terraform modules might suffice. However, once you hit a scale where the platform team becomes a bottleneck for "simple" requests, the investment in Backstage pays off by eliminating the ticket queue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Combining Backstage and Crossplane allows you to move from a culture of "ticket-based infrastructure" to true self-service. By using Backstage as the user interface and Crossplane as the control plane, you create a system where developers can provision production-ready resources in minutes, not days. This doesn't just speed up delivery; it allows your platform engineers to stop performing repetitive manual tasks and start focusing on high-value architectural improvements.&lt;/p&gt;

&lt;p&gt;To get started, your first actionable step is to install Crossplane v1.14.x on a development cluster and create your first &lt;code&gt;CompositeResourceDefinition&lt;/code&gt; for a simple resource, like an S3 bucket. Once the API is working, set up a basic Backstage instance and create a software template that outputs the YAML required by that XRD. Start small, validate the "Golden Path" with one team, and then scale the platform to the rest of your organization.&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>backstageio</category>
      <category>crossplane</category>
      <category>internaldeveloperplatform</category>
    </item>
    <item>
      <title>Essential kubectl Commands Cheat Sheet</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:36:24 +0000</pubDate>
      <link>https://forem.com/devopsstart/essential-kubectl-commands-cheat-sheet-2elo</link>
      <guid>https://forem.com/devopsstart/essential-kubectl-commands-cheat-sheet-2elo</guid>
      <description>&lt;p&gt;&lt;em&gt;Stop memorizing every flag! I've put together a handy kubectl cheat sheet for managing pods, deployments, and debugging. Originally published on devopsstart.com.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pod Management
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get pods&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List all pods in current namespace&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get pods -A&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List pods across all namespaces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl describe pod &amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show detailed pod information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl delete pod &amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Delete a specific pod&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl logs &amp;lt;pod&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View pod logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl logs &amp;lt;pod&amp;gt; -f&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stream pod logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl exec -it &amp;lt;pod&amp;gt; -- sh&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Open shell in pod&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Deployments
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get deployments&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List all deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl scale deploy &amp;lt;name&amp;gt; --replicas=3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Scale a deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl rollout status deploy/&amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check rollout status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl rollout undo deploy/&amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rollback a deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl set image deploy/&amp;lt;name&amp;gt; &amp;lt;container&amp;gt;=&amp;lt;image&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Update container image&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Services and Networking
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get svc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List all services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get ingress&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List all ingress resources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl port-forward svc/&amp;lt;name&amp;gt; 8080:80&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Forward local port to service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get endpoints&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List service endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Debugging
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get events --sort-by=.metadata.creationTimestamp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View cluster events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl top pods&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show pod resource usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl top nodes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show node resource usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl logs &amp;lt;pod&amp;gt; --previous&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View logs from crashed container&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl describe node &amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check node conditions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Context and Config
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl config get-contexts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List all contexts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl config use-context &amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Switch context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl config current-context&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show current context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get ns&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List namespaces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl config set-context --current --namespace=&amp;lt;ns&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Set default namespace&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>kubernetes</category>
      <category>kubectl</category>
      <category>cheatsheet</category>
    </item>
    <item>
      <title>Debug Kubernetes CrashLoopBackOff in 30 Seconds</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:31:20 +0000</pubDate>
      <link>https://forem.com/devopsstart/debug-kubernetes-crashloopbackoff-in-30-seconds-1c7c</link>
      <guid>https://forem.com/devopsstart/debug-kubernetes-crashloopbackoff-in-30-seconds-1c7c</guid>
      <description>&lt;p&gt;&lt;em&gt;Struggling with a pod stuck in CrashLoopBackOff? This quick guide, originally published on devopsstart.com, shows you the exact commands to diagnose the root cause in seconds.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Your pod is stuck in &lt;code&gt;CrashLoopBackOff&lt;/code&gt; and you need to find out why — fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;--previous&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--previous&lt;/code&gt; flag shows logs from the last crashed container instance. This is the single most useful flag for debugging crash loops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Combine with describe for the full picture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe pod &amp;lt;pod-name&amp;gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 5 &lt;span class="s2"&gt;"Last State"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows the exit code and reason for the last termination:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Last State:  Terminated
  Reason:    OOMKilled
  Exit Code: 137
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common Exit Codes
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Application error&lt;/td&gt;
&lt;td&gt;Check app logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;137&lt;/td&gt;
&lt;td&gt;OOMKilled&lt;/td&gt;
&lt;td&gt;Increase memory limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;139&lt;/td&gt;
&lt;td&gt;Segfault&lt;/td&gt;
&lt;td&gt;Check binary compatibility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;143&lt;/td&gt;
&lt;td&gt;SIGTERM&lt;/td&gt;
&lt;td&gt;Graceful shutdown issue&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why It Works
&lt;/h2&gt;

&lt;p&gt;Kubernetes keeps logs from the previous container instance even after it crashes. Without &lt;code&gt;--previous&lt;/code&gt;, you'd only see logs from the current (possibly empty) instance that hasn't had time to produce output before crashing again.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>debugging</category>
      <category>pods</category>
    </item>
    <item>
      <title>Rapid Rollback: `kubectl set image` for Urgent Fixes</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:26:17 +0000</pubDate>
      <link>https://forem.com/devopsstart/rapid-rollback-kubectl-set-image-for-urgent-fixes-52l5</link>
      <guid>https://forem.com/devopsstart/rapid-rollback-kubectl-set-image-for-urgent-fixes-52l5</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on devopsstart.com. When production breaks, every second counts—here is how to use &lt;code&gt;kubectl set image&lt;/code&gt; for a precise and rapid rollback.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;You've just deployed a new container image to production, and almost immediately, monitoring alerts start screaming. Latency is spiking, error rates are through the roof, and your customers are experiencing service degradation. In these high-pressure moments, a fast, reliable rollback mechanism is critical. While Kubernetes offers robust rollout and rollback capabilities via &lt;code&gt;kubectl rollout undo&lt;/code&gt;, there are specific scenarios where &lt;code&gt;kubectl set image&lt;/code&gt; can provide a quicker, more direct path to recovery, especially when you know &lt;em&gt;exactly&lt;/em&gt; which image version you need to revert to.&lt;/p&gt;

&lt;p&gt;This tip focuses on leveraging &lt;code&gt;kubectl set image&lt;/code&gt; for urgent rollbacks. You'll learn when this command is most effective, how to accurately identify the correct previous image tag, and how to execute the command to quickly stabilize your application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding &lt;code&gt;kubectl set image&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;kubectl set image&lt;/code&gt; command is primarily designed to atomically update the image of one or more specific containers within a Kubernetes resource. It typically targets Deployments, StatefulSets, DaemonSets, or ReplicationControllers. When executed, it modifies the resource's Pod template to point to the new image tag, which then triggers a new rolling update.&lt;/p&gt;

&lt;p&gt;While &lt;code&gt;kubectl set image&lt;/code&gt; is frequently used for &lt;em&gt;forward&lt;/em&gt; deployments (e.g., updating &lt;code&gt;v1.1.9&lt;/code&gt; to &lt;code&gt;v1.2.0&lt;/code&gt;), its direct nature makes it exceptionally well-suited for rapid rollbacks. When you specify a previous, stable image, Kubernetes initiates a new rollout toward that desired state. This behavior differentiates it from &lt;code&gt;kubectl rollout undo&lt;/code&gt;, which inherently steps back through the deployment's recorded history, revision by revision.&lt;/p&gt;

&lt;p&gt;Here’s a common example of how &lt;code&gt;kubectl set image&lt;/code&gt; is used to update an image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nb"&gt;set &lt;/span&gt;image deployment/my-app my-container&lt;span class="o"&gt;=&lt;/span&gt;my-registry/my-app:v1.2.0 &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;deployment.apps/my-app image updated
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command updates &lt;code&gt;my-container&lt;/code&gt; within the &lt;code&gt;my-app&lt;/code&gt; deployment in the &lt;code&gt;production&lt;/code&gt; namespace to use the &lt;code&gt;v1.2.0&lt;/code&gt; image from &lt;code&gt;my-registry&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Urgent Rollback Scenario
&lt;/h2&gt;

&lt;p&gt;Consider this scenario: your &lt;code&gt;my-app:v1.2.0&lt;/code&gt; release introduced a critical bug that bypassed your staging environment checks. You pushed it to production an hour ago, and now, critical alerts are firing, indicating significant application failures. You need to revert to the last known good image, let's say &lt;code&gt;my-app:v1.1.9&lt;/code&gt;, &lt;em&gt;immediately&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Why might &lt;code&gt;kubectl set image&lt;/code&gt; be preferred over &lt;code&gt;kubectl rollout undo&lt;/code&gt; in such a situation?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Directness and Precision:&lt;/strong&gt; If you know the exact, stable image tag to which you need to revert, &lt;code&gt;kubectl set image&lt;/code&gt; offers an explicit and precise command. This avoids ambiguity and ensures you land on the intended stable state directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bypassing Unhealthy Revisions:&lt;/strong&gt; If multiple faulty deployments occurred after your last stable one (e.g., you tried &lt;code&gt;v1.2.0&lt;/code&gt;, then &lt;code&gt;v1.2.1-hotfix&lt;/code&gt;, both failed), &lt;code&gt;kubectl rollout undo&lt;/code&gt; would sequentially step back through these potentially problematic revisions. &lt;code&gt;kubectl set image&lt;/code&gt; allows you to jump directly to the known good &lt;code&gt;v1.1.9&lt;/code&gt; without traversing the unstable intermediate states.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forced Redeploy (Edge Cases):&lt;/strong&gt; In rare cases, even if an image tag is theoretically the same, you might want to force Kubernetes to re-pull container images and redeploy pods due to local caching issues or other inconsistencies. Re-setting the image explicitly with &lt;code&gt;kubectl set image&lt;/code&gt; can achieve this, ensuring fresh pods are created. For more on debugging common Kubernetes issues, refer to our article on &lt;a href="https://dev.to/troubleshooting/crashloopbackoff-kubernetes"&gt;Troubleshooting CrashLoopBackOff in Kubernetes&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Identifying the Previous Image Tag
&lt;/h2&gt;

&lt;p&gt;The critical first step for a &lt;code&gt;kubectl set image&lt;/code&gt; rollback is accurately identifying the last known good image tag. You can achieve this by inspecting your deployment's revision history:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check Rollout History:&lt;/strong&gt;&lt;br&gt;
This command provides a concise summary of your deployment's revision history, showing the changes made at each step.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl rollout &lt;span class="nb"&gt;history &lt;/span&gt;deployment/my-app &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
shell

    **Expected output:**


```bash
    deployment.apps/my-app 
    REVISION  CHANGE-CAUSE
    1         &amp;lt;none&amp;gt;
    2         my-container: my-registry/my-app:v1.1.8
    3         my-container: my-registry/my-app:v1.1.9
    4         my-container: my-registry/my-app:v1.2.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;From this output, if `v1.2.0` (revision 4) is currently causing issues, then `v1.1.9` (revision 3) is your immediate target for rollback. Note that `CHANGE-CAUSE` may also contain details if `--record` was used during deployment.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Describe a Specific Revision (Optional Verification):&lt;/strong&gt;&lt;br&gt;
To be absolutely certain about the container images used in a particular revision, you can describe it in detail. This is a good verification step before initiating a rollback.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl rollout &lt;span class="nb"&gt;history &lt;/span&gt;deployment/my-app &lt;span class="nt"&gt;--revision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3 &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
shell

    **Expected (truncated) output:**


```bash
    deployment.apps/my-app with revision 3
    Pod Template:
      Labels:       app=my-app
                    pod-template-hash=54c9c76...
      Containers:
        my-container:
          Image:        my-registry/my-app:v1.1.9
          Port:         8080/TCP
          Environment:  &amp;lt;none&amp;gt;
    ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This confirms that `my-registry/my-app:v1.1.9` was indeed the image used for revision 3, making it a reliable rollback target.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Executing the &lt;code&gt;kubectl set image&lt;/code&gt; Rollback
&lt;/h2&gt;

&lt;p&gt;Once you have identified the precise desired image tag (e.g., &lt;code&gt;my-registry/my-app:v1.1.9&lt;/code&gt; in our example), executing the rollback is straightforward and immediate:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nb"&gt;set &lt;/span&gt;image deployment/my-app my-container&lt;span class="o"&gt;=&lt;/span&gt;my-registry/my-app:v1.1.9 &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;deployment.apps/my-app image updated
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Upon execution, Kubernetes will immediately initiate a new rolling update. It will begin replacing the currently failing &lt;code&gt;v1.2.0&lt;/code&gt; pods with new ones running the specified stable &lt;code&gt;v1.1.9&lt;/code&gt; image.&lt;/p&gt;

&lt;p&gt;You can monitor the progress of this new rollout using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl rollout status deployment/my-app &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected output during rollout:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Waiting &lt;span class="k"&gt;for &lt;/span&gt;deployment &lt;span class="s2"&gt;"my-app"&lt;/span&gt; rollout to finish: 1 old replicas are pending termination...
Waiting &lt;span class="k"&gt;for &lt;/span&gt;deployment &lt;span class="s2"&gt;"my-app"&lt;/span&gt; rollout to finish: 1 old replicas are pending termination...
deployment &lt;span class="s2"&gt;"my-app"&lt;/span&gt; successfully rolled out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the rollout is complete, your application should be consistently running the stable &lt;code&gt;v1.1.9&lt;/code&gt; image, and your monitoring alerts should ideally begin to subside as service is restored.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Considerations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rollback Strategy Impact:&lt;/strong&gt; This &lt;code&gt;kubectl set image&lt;/code&gt; method performs a rolling update. It's crucial that your application is designed to handle a brief period where both the old (problematic) and new (stable) versions of pods are running concurrently. This typically means ensuring backward and forward compatibility for APIs and data schemas.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image Immutability:&lt;/strong&gt; Always strive to use immutable image tags (e.g., &lt;code&gt;v1.1.9&lt;/code&gt;, &lt;code&gt;v1.2.0&lt;/code&gt;, &lt;code&gt;sha256:abcdef...&lt;/code&gt;) rather than mutable tags like &lt;code&gt;latest&lt;/code&gt;. Immutable tags guarantee that a specific tag always refers to the exact same image content, which is fundamental for reliable and reproducible rollbacks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auditing and History:&lt;/strong&gt; Using &lt;code&gt;kubectl set image&lt;/code&gt; creates a new revision in the deployment's history. This automatically ensures that your rollback action is recorded, providing a clear audit trail of changes made to your deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateful Workloads:&lt;/strong&gt; For StatefulSets, exercising caution when changing image versions is paramount. If a new image version introduces changes that affect persistent storage or state, a simple image rollback might not fully resolve database schema migrations or data portability issues. Always understand the data implications.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;When a problematic image release throws production into disarray, reaction time is paramount. While &lt;code&gt;kubectl rollout undo&lt;/code&gt; is a valuable tool, &lt;code&gt;kubectl set image&lt;/code&gt; provides a direct, efficient, and precise alternative for reverting to a specific, known-good image. This capability can significantly reduce Mean Time To Recovery (MTTR) by allowing you to bypass potentially multiple failing revisions and jump straight to stability. By understanding your deployment history and precisely targeting the last stable&lt;/p&gt;

</description>
      <category>kubectl</category>
      <category>kubernetes</category>
      <category>rollback</category>
      <category>deployment</category>
    </item>
  </channel>
</rss>
