<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Yogesh VK</title>
    <description>The latest articles on Forem by Yogesh VK (@yogesh_vk).</description>
    <link>https://forem.com/yogesh_vk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3821385%2Fce580426-0152-47ef-a0df-c1df7d4f33bb.png</url>
      <title>Forem: Yogesh VK</title>
      <link>https://forem.com/yogesh_vk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/yogesh_vk"/>
    <language>en</language>
    <item>
      <title>AI + GitOps: Why Context Matters More Than Reconciliation</title>
      <dc:creator>Yogesh VK</dc:creator>
      <pubDate>Fri, 08 May 2026 06:15:00 +0000</pubDate>
      <link>https://forem.com/yogesh_vk/ai-gitops-why-context-matters-more-than-reconciliation-45e4</link>
      <guid>https://forem.com/yogesh_vk/ai-gitops-why-context-matters-more-than-reconciliation-45e4</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;There’s something very satisfying about watching a system converge. You push a change to Git. A pipeline runs. A few minutes later, the cluster reflects exactly what you defined.&lt;/p&gt;

&lt;p&gt;The system was Clean, Predictable and Repeatable.&lt;/p&gt;

&lt;p&gt;For a long time, I thought that was the hard part. It turns out, it isn’t.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;This was a fairly typical setup.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes cluster on GKE&lt;/li&gt;
&lt;li&gt;Applications deployed using Helm&lt;/li&gt;
&lt;li&gt;GitLab CI driving deployments&lt;/li&gt;
&lt;li&gt;Terraform managing underlying infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The deployment flow was simple:&lt;br&gt;
&lt;code&gt;git push origin main&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Which triggered a GitLab pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deploy&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;helm upgrade --install payments ./chart -f values.yaml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Nothing fancy. A basic Helm chart defined a service with autoscaling.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# values.yaml&lt;/span&gt;
&lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;

&lt;span class="na"&gt;autoscaling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;200m"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;256Mi"&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512Mi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Everything was stable.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Change
&lt;/h2&gt;

&lt;p&gt;We had started experimenting with using AI to assist in updating configurations. One suggestion came in to “optimize performance under load.” The diff looked reasonable.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt; autoscaling:
   enabled: true
&lt;span class="gd"&gt;-  maxReplicas: 10
&lt;/span&gt;&lt;span class="gi"&gt;+  maxReplicas: 30
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt; resources:
   requests:
&lt;span class="gd"&gt;-    cpu: "200m"
&lt;/span&gt;&lt;span class="gi"&gt;+    cpu: "500m"
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Nothing obviously wrong. More replicas. More CPU. Better performance.&lt;/p&gt;
&lt;h2&gt;
  
  
  The System Converges
&lt;/h2&gt;

&lt;p&gt;The change was merged. GitLab CI picked it up.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;Running with gitlab-runner...
&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; payments ./chart &lt;span class="nt"&gt;-f&lt;/span&gt; values.yaml
&lt;span class="go"&gt;Release "payments" has been upgraded
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A quick check:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;kubectl get pods -n payments
payments-7d9f6c8c4d-abc12   1/1 Running
payments-7d9f6c8c4d-def34   1/1 Running

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Everything looked healthy. No errors. No failed deploys. From a system perspective:&lt;/p&gt;

&lt;p&gt;Everything worked.&lt;/p&gt;
&lt;h2&gt;
  
  
  But Something Was Off…
&lt;/h2&gt;

&lt;p&gt;A few hours later, the signals started showing up.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cluster CPU usage was higher than usual — node autoscaling kicked in aggressively&lt;/li&gt;
&lt;li&gt;costs started creeping up — some unrelated workloads saw intermittent throttling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nothing had technically failed. But the system behavior had changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem Wasn’t the Pipeline&lt;/strong&gt;. The pipeline did exactly what it was designed to do.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git was the source of truth&lt;/li&gt;
&lt;li&gt;the change was applied&lt;/li&gt;
&lt;li&gt;the cluster matched the desired state
There was no drift. No inconsistency. The problem wasn’t reconciliation. It was the definition of the desired state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where Context Was Missing — The AI-generated suggestion didn’t know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;this service wasn’t latency-critical — the cluster had shared resource constraints&lt;/li&gt;
&lt;li&gt;aggressive scaling could impact other services — cost limits were important in this environment
From a configuration standpoint, the change was valid. From a system standpoint, it was incomplete.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Subtle Danger&lt;/strong&gt; — This is where AI + GitOps becomes interesting.&lt;/p&gt;

&lt;p&gt;GitOps gives us a powerful guarantee:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;the system will converge to the declared state&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;But it does not guarantee:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;that the declared state is the right one&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;And AI, without sufficient context, can generate configurations that are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;technically correct&lt;/li&gt;
&lt;li&gt;syntactically valid&lt;/li&gt;
&lt;li&gt;operationally deployable
…but not aligned with the system as a whole.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything Looks “Green”. Even the deployment flow looks clean:&lt;br&gt;
&lt;code&gt;helm upgrade --install payments ./chart -f values.yaml&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;And Kubernetes happily reports:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;kubectl get hpa

NAME       REFERENCE             TARGETS   MINPODS   MAXPODS
payments   Deployment/payments   40%/80%   2         30

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;No alerts. No failures. Just a system doing exactly what it was told to do.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Realization
&lt;/h2&gt;

&lt;p&gt;Over time, this became clearer. Reconciliation is deterministic. Context is not. GitLab CI will apply whatever is in Git. Kubernetes will enforce whatever is defined. But neither of them understands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;intent&lt;/li&gt;
&lt;li&gt;trade-offs&lt;/li&gt;
&lt;li&gt;system-wide impact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That layer still depends on context.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where AI Actually Helps
&lt;/h2&gt;

&lt;p&gt;AI becomes genuinely useful when it helps us understand changes, not blindly generate them.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;explaining what a Helm diff actually means&lt;/li&gt;
&lt;li&gt;highlighting scaling implications&lt;/li&gt;
&lt;li&gt;surfacing cost impact&lt;/li&gt;
&lt;li&gt;identifying potential blast radius&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s where AI shines and actually adds value.&lt;/p&gt;
&lt;h2&gt;
  
  
  Closing Thought
&lt;/h2&gt;

&lt;p&gt;GitOps is incredibly powerful. It gives us consistency, traceability, and convergence. But it assumes that the desired state is correct. In AI-assisted workflows, that assumption becomes weaker. Because now, the desired state may be influenced by a system that doesn’t fully understand the environment. And that shifts the problem.&lt;/p&gt;

&lt;p&gt;From:&lt;br&gt;
&lt;code&gt;Will the system converge?&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;To:&lt;br&gt;
&lt;code&gt;Are we converging to the right thing?&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;GitOps guarantees convergence. It does not guarantee correctness. That depends on context — and context still needs humans.&lt;/p&gt;

&lt;p&gt;Do you agree?&lt;/p&gt;

&lt;p&gt;Note: This post was developed using AI-assisted writing tools. While AI helped with structuring and phrasing, all concepts and examples reflect real-world engineering experience.&lt;/p&gt;

&lt;p&gt;Originally published on Medium:&lt;br&gt;
&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://medium.com/@yogesh.vk/ai-gitops-why-context-matters-more-than-reconciliation-f9e067d20e9a" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;medium.com&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



</description>
      <category>ai</category>
      <category>devops</category>
      <category>cicd</category>
    </item>
    <item>
      <title>AI as a Junior Platform Engineer: How I "Onboard" Coding Agents</title>
      <dc:creator>Yogesh VK</dc:creator>
      <pubDate>Fri, 01 May 2026 07:00:00 +0000</pubDate>
      <link>https://forem.com/yogesh_vk/ai-as-a-junior-platform-engineer-how-i-onboard-coding-agents-1nh2</link>
      <guid>https://forem.com/yogesh_vk/ai-as-a-junior-platform-engineer-how-i-onboard-coding-agents-1nh2</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The first time I started seriously using AI in my DevOps workflows, I made the same mistake I've seen many others make.&lt;br&gt;
I treated it like a tool.&lt;br&gt;
Something you prompt, get an answer from, and move on. It worked, to a point. But the results were inconsistent. Sometimes surprisingly good, sometimes completely off. It felt less like working with a system and more like rolling a dice.&lt;br&gt;
That changed when I started thinking about AI differently. Not as a tool - but as a junior platform engineer joining the team. That shift alone made everything more predictable.&lt;/p&gt;
&lt;h2&gt;
  
  
  The First Day Problem
&lt;/h2&gt;

&lt;p&gt;When a new engineer joins a team, we don't expect them to be productive immediately. We don't just hand them access to production systems and ask them be productive.&lt;br&gt;
Instead, we onboard them. We give them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;context about the system&lt;/li&gt;
&lt;li&gt;documentation&lt;/li&gt;
&lt;li&gt;boundaries&lt;/li&gt;
&lt;li&gt;a safe environment to contribute&lt;/li&gt;
&lt;li&gt;time to understand how things work
Without that, even a talented engineer will struggle. AI is no different.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Context Is the Difference Between Useful and Dangerous
&lt;/h2&gt;

&lt;p&gt;One of the biggest differences between good and bad AI output is context. Without context, an AI agent will give you generic answers. They might be technically correct, but not aligned with your system, your architecture, or your constraints. This is where something like a context.md file becomes incredibly powerful.&lt;br&gt;
Think of it as the onboarding document you would give a new engineer. It might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how your infrastructure is structured&lt;/li&gt;
&lt;li&gt;naming conventions&lt;/li&gt;
&lt;li&gt;environments and workflows&lt;/li&gt;
&lt;li&gt;constraints (cost, security, compliance)&lt;/li&gt;
&lt;li&gt;how Terraform modules are organized&lt;/li&gt;
&lt;li&gt;what "good" looks like in your system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the AI has this context, its suggestions start to feel less generic and more like they belong to your system. Just like a junior engineer who finally understands how things are wired.&lt;br&gt;
Sample context.md:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Platform Context&lt;/span&gt;

&lt;span class="gu"&gt;## Overview&lt;/span&gt;
This repository manages AWS infrastructure using Terraform.
Primary workloads run on EKS clusters across dev, staging, and production environments.

&lt;span class="gu"&gt;## Key Principles&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Prefer managed services where possible
&lt;span class="p"&gt;-&lt;/span&gt; Minimize blast radius of changes
&lt;span class="p"&gt;-&lt;/span&gt; Avoid cross-environment coupling
&lt;span class="p"&gt;-&lt;/span&gt; All changes must go through PR review

&lt;span class="gu"&gt;## Terraform Structure&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; modules/ → reusable infrastructure components
&lt;span class="p"&gt;-&lt;/span&gt; envs/dev → development environment
&lt;span class="p"&gt;-&lt;/span&gt; envs/staging → staging environment
&lt;span class="p"&gt;-&lt;/span&gt; envs/prod → production environment

&lt;span class="gu"&gt;## Naming Conventions&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Resources follow: &lt;span class="nt"&gt;&amp;lt;env&amp;gt;&lt;/span&gt;-&lt;span class="nt"&gt;&amp;lt;service&amp;gt;&lt;/span&gt;-&lt;span class="nt"&gt;&amp;lt;type&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Example: prod-payments-eks

&lt;span class="gu"&gt;## Guardrails&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Never modify production directly
&lt;span class="p"&gt;-&lt;/span&gt; No &lt;span class="sb"&gt;`terraform apply`&lt;/span&gt; without PR approval
&lt;span class="p"&gt;-&lt;/span&gt; Avoid changes that trigger resource replacement unless explicitly required

&lt;span class="gu"&gt;## Cost Constraints&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Prefer smaller instance types unless justified
&lt;span class="p"&gt;-&lt;/span&gt; Autoscaling should always have upper limits defined

&lt;span class="gu"&gt;## Security&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; IAM roles must follow least privilege
&lt;span class="p"&gt;-&lt;/span&gt; No wildcard permissions unless explicitly approved

&lt;span class="gu"&gt;## Review Expectations&lt;/span&gt;
When reviewing a Terraform plan, focus on:
&lt;span class="p"&gt;-&lt;/span&gt; Resource replacements
&lt;span class="p"&gt;-&lt;/span&gt; Changes in networking or IAM
&lt;span class="p"&gt;-&lt;/span&gt; Scaling or cost implications
&lt;span class="p"&gt;-&lt;/span&gt; Cross-module impact

&lt;span class="gu"&gt;## What "Good" Looks Like&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Small, isolated changes
&lt;span class="p"&gt;-&lt;/span&gt; Clear PR descriptions
&lt;span class="p"&gt;-&lt;/span&gt; Minimal blast radius
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Once I started using something like this, the difference was noticeable.&lt;br&gt;
The AI responses became less generic and more aligned with how the system was actually designed. It started picking up on patterns like naming conventions, environment separation, and even risk signals like resource replacements.&lt;br&gt;
It felt much closer to working with someone who had been onboarded into the system, rather than someone guessing from scratch.&lt;/p&gt;
&lt;h2&gt;
  
  
  Guardrails Matter More Than Intelligence
&lt;/h2&gt;

&lt;p&gt;When onboarding a new engineer, we don't just give context. We also define boundaries. What they should and should not do. Where they can make changes. What requires review.&lt;br&gt;
AI needs the same guardrails. For example, I'm comfortable letting AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;suggest Terraform changes&lt;/li&gt;
&lt;li&gt;explain plan outputs&lt;/li&gt;
&lt;li&gt;summarize pull requests&lt;/li&gt;
&lt;li&gt;generate draft configurations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But there are clear boundaries. AI should not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;directly apply infrastructure changes&lt;/li&gt;
&lt;li&gt;bypass review processes&lt;/li&gt;
&lt;li&gt;make decisions that require operational judgment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not limitations of capability. They are intentional design choices. Because just like with a new engineer, the goal is not maximum autonomy - it is safe contribution.&lt;/p&gt;
&lt;h2&gt;
  
  
  Start With PRs, Not Production
&lt;/h2&gt;

&lt;p&gt;When a new engineer joins, we usually don't give them direct production access on day one. We ask them to start with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;small changes&lt;/li&gt;
&lt;li&gt;pull requests&lt;/li&gt;
&lt;li&gt;code reviews&lt;/li&gt;
&lt;li&gt;guided feedback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This builds confidence and trust over time. The same model works extremely well with AI. Instead of letting AI operate directly on infrastructure, I treat it as a contributor to the PR workflow. It can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generate changes&lt;/li&gt;
&lt;li&gt;explain diffs&lt;/li&gt;
&lt;li&gt;highlight potential issues&lt;/li&gt;
&lt;li&gt;improve readability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the final decision still goes through human review. This keeps the system safe while still benefiting from AI acceleration.&lt;/p&gt;
&lt;h2&gt;
  
  
  Feedback Loops Make It Better
&lt;/h2&gt;

&lt;p&gt;A junior engineer improves with feedback. AI systems also improve with iteration. When something is off, the answer almost never is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AI doesn't work&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;More often, it means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The context was incomplete"&lt;br&gt;
 "The prompt didn't reflect constraints"&lt;br&gt;
 "The guardrails weren't clear"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Over time, refining context and expectations makes AI far more reliable. It starts behaving less like a random generator and more like a team member who understands the system.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Real Shift
&lt;/h2&gt;

&lt;p&gt;Thinking of AI as a junior platform engineer changes how you design workflows. Instead of asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"What can this tool do?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You start asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"How would I onboard someone into this system?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That question naturally leads you to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;better context&lt;/li&gt;
&lt;li&gt;clearer boundaries&lt;/li&gt;
&lt;li&gt;safer workflows&lt;/li&gt;
&lt;li&gt;more predictable outcomes&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Closing Thought
&lt;/h2&gt;

&lt;p&gt;AI in DevOps doesn't need to be treated as an autonomous operator. In many cases, it works best as a well-onboarded junior engineer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;guided by context&lt;/li&gt;
&lt;li&gt;constrained by guardrails&lt;/li&gt;
&lt;li&gt;contributing through safe workflows&lt;/li&gt;
&lt;li&gt;improving over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is not to replace engineers. It is to make systems easier to understand, safer to operate, and faster to evolve. And sometimes, the best way to do that is not to give AI more power - but to onboard it more thoughtfully.&lt;/p&gt;

&lt;p&gt;Curious to know what you think of this approach.&lt;/p&gt;

&lt;p&gt;Originally published on Medium:&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://medium.com/@yogesh.vk/ai-as-a-junior-platform-engineer-how-i-onboard-coding-agents-963e68e22742" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;medium.com&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



</description>
      <category>ai</category>
      <category>devops</category>
      <category>cloudcomputing</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Why AI and Automation Are Not Always the Right Answer in DevOps</title>
      <dc:creator>Yogesh VK</dc:creator>
      <pubDate>Fri, 10 Apr 2026 06:00:00 +0000</pubDate>
      <link>https://forem.com/yogesh_vk/why-ai-and-automation-are-not-always-the-right-answer-in-devops-18ch</link>
      <guid>https://forem.com/yogesh_vk/why-ai-and-automation-are-not-always-the-right-answer-in-devops-18ch</guid>
      <description>&lt;p&gt;In DevOps, AI, and platform engineering, speed without understanding can amplify failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Every engineering team reaches this moment sooner or later.&lt;/p&gt;

&lt;p&gt;A repetitive task appears. A process feels slow and folks just say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can we just automate this?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On the surface, it sounds like the right instinct. DevOps, after all, was built on the idea of reducing manual effort and increasing reliability through automation.&lt;/p&gt;

&lt;p&gt;But in my experience, automation is not always the right first answer.&lt;/p&gt;

&lt;p&gt;Sometimes the process is slow because it contains necessary human judgment. Sometimes the repetition is actually a signal that the underlying system design needs improvement. And increasingly, with AI entering DevOps workflows, there is a temptation to automate decisions that should still remain human.&lt;/p&gt;

&lt;p&gt;The question is not whether something can be automated. The more important question is whether it should be.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automation Often Scales Existing Problems
&lt;/h2&gt;

&lt;p&gt;One of the most common mistakes teams make is automating a broken or poorly understood workflow.&lt;/p&gt;

&lt;p&gt;A manual deployment process may feel slow and frustrating. But if the underlying release steps are unclear, automating them simply means the same confusion now happens faster and at larger scale.&lt;/p&gt;

&lt;p&gt;In infrastructure systems, this can be dangerous.&lt;/p&gt;

&lt;p&gt;A pipeline that automatically pushes Terraform changes into production may look efficient, but if reviewers do not fully understand the blast radius of those changes, automation simply accelerates risk.&lt;/p&gt;

&lt;p&gt;The result is not better engineering. It is faster failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Difference Between Repetition and Judgment
&lt;/h2&gt;

&lt;p&gt;Not every repetitive task is suitable for automation.&lt;/p&gt;

&lt;p&gt;Some tasks are repetitive because they are operationally necessary. Others require human context and judgment, even if the steps appear similar.&lt;/p&gt;

&lt;p&gt;For example, reviewing a Terraform plan may seem repetitive. But what the reviewer is actually doing is not checking syntax. They are making decisions about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;operational risk&lt;/li&gt;
&lt;li&gt;rollback impact&lt;/li&gt;
&lt;li&gt;customer-facing downtime&lt;/li&gt;
&lt;li&gt;security implications
That is not repetition. I think that is judgment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Automating the process without preserving that judgment layer often removes the most valuable part of the workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Makes This Even More Tempting
&lt;/h2&gt;

&lt;p&gt;The rise of AI in DevOps workflows makes this challenge even more relevant. AI tools can now can do all of these and more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summarize pull requests&lt;/li&gt;
&lt;li&gt;explain Terraform plans&lt;/li&gt;
&lt;li&gt;analyze logs&lt;/li&gt;
&lt;li&gt;suggest infrastructure changes
These are genuinely useful capabilities. But there is an important boundary.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI should help engineers understand systems better. It should not replace ownership of decisions.&lt;/p&gt;

&lt;p&gt;For example, using AI to explain a Terraform plan is helpful. Using AI to automatically approve and apply infrastructure changes is often the wrong answer. The operational responsibility still belongs to humans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Good Automation Removes Toil, Not Thinking
&lt;/h2&gt;

&lt;p&gt;The best automation removes toil, not thought. Good examples include some of these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;automatic formatting checks&lt;/li&gt;
&lt;li&gt;CI validation pipelines&lt;/li&gt;
&lt;li&gt;policy enforcement&lt;/li&gt;
&lt;li&gt;environment cleanup schedules&lt;/li&gt;
&lt;li&gt;cost anomaly alerts
These tasks are repetitive, rules-driven, and low in ambiguity. They benefit greatly from automation. What should remain human are workflows involving uncertainty, trade-offs, and accountability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Better Question to Ask&lt;br&gt;
Instead of asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can we automate this?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A better question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What part of this process is pure toil, and what part requires human judgment?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That distinction changes everything. Once you separate toil from judgment, automation becomes much safer and much more effective.&lt;/p&gt;
&lt;h2&gt;
  
  
  Closing Thought
&lt;/h2&gt;

&lt;p&gt;In DevOps and platform engineering, automation is incredibly powerful.&lt;/p&gt;

&lt;p&gt;But the goal should never be automation for its own sake. The goal is better systems in my opinion.&lt;/p&gt;

&lt;p&gt;Sometimes that means automation. Sometimes it means improving engineers and systems understanding first.&lt;/p&gt;

&lt;p&gt;And increasingly, with AI in the mix, it means being very deliberate about what we allow machines to decide on our behalf.&lt;/p&gt;

&lt;p&gt;Because not every slow process is a bad one. Some of them are where engineering judgment actually lives..&lt;/p&gt;

&lt;p&gt;Do you feel the same way?&lt;/p&gt;

&lt;p&gt;Originally published on Medium:&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://medium.com/@yogesh.vk/why-ai-and-automation-are-not-always-the-right-answer-in-devops-8c0cb5e439bf" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;medium.com&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>Using AI to Explain Terraform Plans to Humans</title>
      <dc:creator>Yogesh VK</dc:creator>
      <pubDate>Fri, 27 Mar 2026 06:16:00 +0000</pubDate>
      <link>https://forem.com/yogesh_vk/using-ai-to-explain-terraform-plans-to-humans-3dp4</link>
      <guid>https://forem.com/yogesh_vk/using-ai-to-explain-terraform-plans-to-humans-3dp4</guid>
      <description>&lt;p&gt;Turning raw infrastructure diffs into decisions engineers can actually understand.&lt;/p&gt;

&lt;h2&gt;
  
  
  INTRODUCTION
&lt;/h2&gt;

&lt;p&gt;Terraform plans are incredibly precise. They show every resource change, attribute modification, and dependency update that will occur during an apply.&lt;br&gt;
But precision is not the same as clarity.&lt;br&gt;
For many engineers reviewing infrastructure changes, Terraform plans feel more like a wall of text than a meaningful explanation of what is about to happen. The information is there, but extracting the real implications often requires experience and careful reading.&lt;/p&gt;

&lt;p&gt;This is exactly where AI can become useful. Not by executing infrastructure changes, but by translating Terraform plans into something humans can reason about.&lt;/p&gt;
&lt;h2&gt;
  
  
  THE PROBLEM WITH RAW TERRAFORM PLANS
&lt;/h2&gt;

&lt;p&gt;Terraform's plan output is designed for correctness, not readability.&lt;br&gt;
It faithfully lists changes such as resource replacements, attribute updates, and dependency adjustments. While this is ideal for machines and precise workflows, it can make reviews difficult for humans, especially in larger environments.&lt;br&gt;
A simple plan might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hundreds of attribute updates&lt;/li&gt;
&lt;li&gt;nested resource changes&lt;/li&gt;
&lt;li&gt;implicit dependencies across modules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What reviewers actually want to know is much simpler:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What changed?&lt;/li&gt;
&lt;li&gt;Why does it matter?&lt;/li&gt;
&lt;li&gt;Is the risk acceptable?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Terraform itself does not answer those questions.&lt;/p&gt;
&lt;h2&gt;
  
  
  WHERE HUMAN REVIEW BREAKS DOWN
&lt;/h2&gt;

&lt;p&gt;Experienced engineers eventually develop an instinct for reading Terraform plans. They scan for dangerous signals:&lt;br&gt;
resource replacement&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;subnet or network changes&lt;/li&gt;
&lt;li&gt;IAM policy expansions&lt;/li&gt;
&lt;li&gt;scaling changes in compute clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But this intuition takes time to build, and even experienced reviewers can miss subtle interactions when reviewing large changes late in the day or under delivery pressure.&lt;br&gt;
The real problem isn't lack of information. It's cognitive load.&lt;/p&gt;

&lt;p&gt;Terraform tells us everything. Humans only need to understand the important parts.&lt;/p&gt;
&lt;h2&gt;
  
  
  WHY AI IS GOOD AT THIS PROBLEM
&lt;/h2&gt;

&lt;p&gt;AI models are particularly good at summarizing structured text and identifying patterns.&lt;br&gt;
A Terraform plan contains many signals that AI can interpret effectively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which resources will be created, updated, or destroyed&lt;/li&gt;
&lt;li&gt;whether replacements will occur&lt;/li&gt;
&lt;li&gt;potential cost changes&lt;/li&gt;
&lt;li&gt;security-sensitive modifications&lt;/li&gt;
&lt;li&gt;large blast-radius changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of forcing humans to parse hundreds of lines of output, AI can produce a concise summary describing the operational impact.&lt;br&gt;
This transforms the Terraform plan from a raw diff into an explanation.&lt;/p&gt;
&lt;h2&gt;
  
  
  AI AS A REVIEW ASSISTANT IN CI/CD
&lt;/h2&gt;

&lt;p&gt;A practical place to integrate this capability is within CI/CD pipelines.&lt;br&gt;
After generating a Terraform plan, a pipeline step can feed the plan output into an AI model. The model then produces a human-readable summary that is attached to the pull request.&lt;br&gt;
Instead of reviewing raw plan text alone, engineers see a structured explanation such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Risk Summary: This change replaces the EKS node group, which will trigger a rolling replacement of worker nodes.
Security Impact: No IAM policies were expanded.
Cost Impact: Estimated monthly increase: approximately $120 due to increased instance size.
Operational Notes: Node replacement may temporarily reduce cluster capacity during rollout.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This type of explanation does not replace the Terraform plan. It simply helps humans understand it faster.&lt;/p&gt;
&lt;h2&gt;
  
  
  USING GITHUB ACTIONS FOR AI-ASSISTED PLAN REVIEWS
&lt;/h2&gt;

&lt;p&gt;GitHub Actions provides a natural place to implement this pattern.&lt;br&gt;
A typical pipeline already includes steps like formatting, validation, and plan generation. Adding an AI analysis step is straightforward and can operate entirely in read-only mode.&lt;br&gt;
The workflow might look like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run terraform plan&lt;/li&gt;
&lt;li&gt;Export plan output as JSON&lt;/li&gt;
&lt;li&gt;Send plan summary to an AI model&lt;/li&gt;
&lt;li&gt;Post a structured explanation as a pull request comment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key point is that the AI does not change infrastructure or execute Terraform commands. It only interprets the plan and produces a human-readable summary.&lt;br&gt;
This keeps the decision-making process firmly in human hands.&lt;/p&gt;
&lt;h2&gt;
  
  
  WHY THIS IMPROVES INFRASTRUCTURE SAFETY
&lt;/h2&gt;

&lt;p&gt;When infrastructure reviews fail, it is rarely because Terraform produced incorrect output.&lt;br&gt;
Failures occur because reviewers misinterpret the impact or miss important signals hidden within large plans.&lt;br&gt;
AI-assisted explanations reduce that risk by highlighting the kinds of changes humans care about most:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;replacements&lt;/li&gt;
&lt;li&gt;deletions&lt;/li&gt;
&lt;li&gt;network changes&lt;/li&gt;
&lt;li&gt;permission expansions&lt;/li&gt;
&lt;li&gt;scaling adjustments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI becomes a second set of eyes, helping reviewers focus their attention where it matters.&lt;/p&gt;
&lt;h2&gt;
  
  
  THE IMPORTANT BOUNDARY
&lt;/h2&gt;

&lt;p&gt;Even though AI can interpret plans effectively, it should never be allowed to execute them.&lt;br&gt;
Running terraform apply still requires human ownership and operational judgment. AI can explain consequences, but it cannot decide whether those consequences are acceptable.&lt;br&gt;
That boundary is what keeps AI useful rather than dangerous.&lt;/p&gt;
&lt;h2&gt;
  
  
  CLOSING THOUGHT
&lt;/h2&gt;

&lt;p&gt;Terraform already tells us what will change.&lt;br&gt;
AI helps answer the more useful question: What does this change actually mean?&lt;/p&gt;

&lt;p&gt;By turning raw infrastructure diffs into clear explanations, AI allows DevOps teams to review changes faster, understand risk better, and make more confident decisions.&lt;br&gt;
And that is exactly where AI belongs in infrastructure workflows - helping humans think more clearly, not replacing their judgment.&lt;/p&gt;

&lt;p&gt;How does your team review Terraform plans today - raw output, custom tooling, or something smarter?&lt;/p&gt;

&lt;p&gt;Originally published on Medium:&lt;br&gt;
&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://medium.com/@yogesh.vk/using-ai-to-explain-terraform-plans-to-humans-e631b264fafd" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;medium.com&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



</description>
      <category>ai</category>
      <category>devops</category>
      <category>terraform</category>
      <category>cicd</category>
    </item>
    <item>
      <title>5 Expensive Terraform Mistakes I Keep Seeing in Real Infrastructure and How AI can Help</title>
      <dc:creator>Yogesh VK</dc:creator>
      <pubDate>Fri, 20 Mar 2026 06:15:00 +0000</pubDate>
      <link>https://forem.com/yogesh_vk/5-expensive-terraform-mistakes-i-keep-seeing-in-real-infrastructure-and-how-ai-can-help-2m38</link>
      <guid>https://forem.com/yogesh_vk/5-expensive-terraform-mistakes-i-keep-seeing-in-real-infrastructure-and-how-ai-can-help-2m38</guid>
      <description>&lt;p&gt;Small infrastructure decisions that quietly turn into large cloud bills&lt;br&gt;
Infrastructure-as-Code has dramatically improved how teams manage cloud environments. Terraform in particular has made it possible to define infrastructure in version-controlled, repeatable configurations.&lt;/p&gt;

&lt;p&gt;In theory, this should make infrastructure both predictable and efficient.&lt;/p&gt;

&lt;p&gt;In practice, however, Terraform does not automatically make systems cost-efficient. It simply makes infrastructure changes easier to reproduce. If inefficient patterns exist in the configuration, Terraform will reproduce them perfectly.&lt;/p&gt;

&lt;p&gt;Over time, small infrastructure decisions accumulate. Many of them appear harmless when introduced, but months later they become visible as unexpectedly large cloud bills.&lt;/p&gt;

&lt;p&gt;Here are some of the most common Terraform patterns I keep seeing that quietly drive up infrastructure costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Oversized Compute That Never Gets Revisited
&lt;/h2&gt;

&lt;p&gt;One of the most common patterns starts early in a project.&lt;/p&gt;

&lt;p&gt;During initial development, engineers often choose slightly larger instance types to avoid performance issues. It is safer to start with extra capacity rather than risk under-provisioning a critical service.&lt;/p&gt;

&lt;p&gt;The problem is that these instance sizes often remain unchanged long after workloads stabilize.&lt;/p&gt;

&lt;p&gt;Terraform makes it easy to define infrastructure once and leave it untouched. As long as systems continue running without obvious performance problems, there is little incentive to revisit instance sizing decisions.&lt;/p&gt;

&lt;p&gt;Over time, this leads to clusters and services running on instance types that are significantly larger than necessary.&lt;/p&gt;

&lt;p&gt;This issue is particularly visible in Kubernetes clusters, where node groups are frequently defined with conservative sizing assumptions. If workloads later become more efficient, the underlying infrastructure may remain over-provisioned indefinitely.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Resource Replacements Triggered by Small Configuration Changes
Terraform’s declarative model means that certain configuration changes require resources to be replaced rather than updated.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example, modifying attributes such as subnet associations, encryption settings, or instance types may cause Terraform to destroy and recreate a resource.&lt;/p&gt;

&lt;p&gt;While Terraform clearly reports these replacements in the plan output, the operational and financial impact is not always obvious during review.&lt;/p&gt;

&lt;p&gt;Replacing compute clusters, databases, or node groups can temporarily increase infrastructure usage, create additional storage snapshots, or trigger redeployment processes that consume additional resources.&lt;/p&gt;

&lt;p&gt;When these replacements happen frequently across environments, they can contribute to unexpectedly high infrastructure costs.&lt;/p&gt;

&lt;p&gt;This is one reason many teams are beginning to use AI-assisted plan analysis in CI pipelines — to highlight resource replacements and explain their operational impact before they are applied.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Logging and Observability Configurations That Grow Without Limits
Terraform is often used to provision logging pipelines and observability systems. These systems are essential for debugging and monitoring production environments.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;However, logging configurations are frequently defined with very generous defaults.&lt;/p&gt;

&lt;p&gt;For example, teams may configure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;high verbosity log levels&lt;/li&gt;
&lt;li&gt;long retention periods&lt;/li&gt;
&lt;li&gt;large ingestion pipelines
These settings are useful during development but are rarely revisited as systems mature.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because Terraform configurations remain stable over time, these logging pipelines can continue collecting massive volumes of data long after the original debugging needs have passed.&lt;/p&gt;

&lt;p&gt;In some environments, observability costs eventually exceed the cost of the infrastructure being monitored.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Idle Infrastructure Environments
Another common Terraform pattern involves environment duplication.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Many organizations create separate environments for development, staging, integration testing, and experimentation. Terraform makes it easy to spin up these environments using identical modules.&lt;/p&gt;

&lt;p&gt;The problem is that these environments often remain running continuously even when they are rarely used.&lt;/p&gt;

&lt;p&gt;A staging environment that runs databases, compute nodes, load balancers, and storage resources can easily cost hundreds of dollars per month. Multiply that across multiple teams and environments, and the cost grows quickly.&lt;/p&gt;

&lt;p&gt;In many cases, these environments are only actively used during working hours.&lt;br&gt;
Automated scheduling policies or environment lifecycle management can dramatically reduce this waste, but these controls are rarely implemented initially.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Storage That Quietly Accumulates
Storage resources are particularly prone to long-term cost growth.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Terraform configurations frequently create:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;snapshots&lt;/li&gt;
&lt;li&gt;backups&lt;/li&gt;
&lt;li&gt;object storage buckets&lt;/li&gt;
&lt;li&gt;artifact repositories
Because storage is relatively inexpensive per gigabyte, these resources often grow without much scrutiny.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Over time, however, storage layers accumulate historical artifacts that are rarely accessed but continue to incur costs.&lt;/p&gt;

&lt;p&gt;Common examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;old database snapshots that were never cleaned up&lt;/li&gt;
&lt;li&gt;log archives retained indefinitely&lt;/li&gt;
&lt;li&gt;unused container images in registries&lt;/li&gt;
&lt;li&gt;artifact storage from old CI pipelines
Without lifecycle policies, these storage systems gradually become long-term archives rather than operational infrastructure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why These Issues Are Hard to Detect&lt;br&gt;
The most expensive Terraform mistakes rarely appear as obvious misconfigurations.&lt;/p&gt;

&lt;p&gt;Instead, they emerge gradually as systems evolve.&lt;/p&gt;

&lt;p&gt;Each decision may appear reasonable in isolation. The instance size seems safe. The logging level helps debugging. The staging environment might be needed later.&lt;/p&gt;

&lt;p&gt;The problem is that Terraform faithfully preserves these decisions over time.&lt;/p&gt;

&lt;p&gt;Without regular review, infrastructure configurations slowly drift away from the actual needs of the system.&lt;/p&gt;

&lt;p&gt;How AI Can Help Detect These Patterns Earlier&lt;br&gt;
AI cannot fix infrastructure architecture problems automatically. But it can help identify patterns that humans might overlook.&lt;/p&gt;

&lt;p&gt;For example, AI systems analyzing infrastructure configurations or Terraform plans can highlight signals such as:&lt;/p&gt;

&lt;p&gt;compute resources that appear significantly over-provisioned&lt;br&gt;
environments that remain idle for long periods&lt;br&gt;
storage resources that grow continuously without access&lt;br&gt;
resource replacements that may trigger unnecessary redeployments&lt;br&gt;
These insights allow teams to review infrastructure decisions earlier rather than discovering problems only when the monthly bill arrives.&lt;/p&gt;

&lt;p&gt;The Real Lesson&lt;br&gt;
Terraform is an incredibly powerful tool for managing infrastructure. But like any automation system, it faithfully executes the decisions encoded within it.&lt;/p&gt;

&lt;p&gt;If inefficient patterns exist in the configuration, Terraform will reproduce them perfectly every time.&lt;/p&gt;

&lt;p&gt;The goal is not to avoid mistakes entirely. That is unrealistic in complex systems.&lt;/p&gt;

&lt;p&gt;The goal is to detect small inefficiencies early — before they accumulate into large and expensive infrastructure problems.&lt;/p&gt;

&lt;p&gt;Closing Thought&lt;br&gt;
Cloud costs rarely explode because of one catastrophic decision.&lt;/p&gt;

&lt;p&gt;More often, they grow quietly from dozens of small infrastructure choices that were never revisited.&lt;/p&gt;

&lt;p&gt;Terraform gives us the power to manage infrastructure systematically. The challenge is making sure the systems we define remain aligned with how they are actually used.&lt;/p&gt;

&lt;p&gt;That requires continuous review, feedback, and sometimes a second set of eyes — whether human or machine.&lt;/p&gt;

&lt;p&gt;Originally published on Medium:&lt;br&gt;
&lt;a href="https://medium.com/@yogesh.vk/5-expensive-terraform-mistakes-i-keep-seeing-in-real-infrastructure-and-how-ai-can-help-9a4849ddfc91" rel="noopener noreferrer"&gt;https://medium.com/@yogesh.vk/5-expensive-terraform-mistakes-i-keep-seeing-in-real-infrastructure-and-how-ai-can-help-9a4849ddfc91&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>terraform</category>
      <category>aws</category>
    </item>
    <item>
      <title>AI for DevOps and Platform Engineering: Practical Use Cases That Actually Work</title>
      <dc:creator>Yogesh VK</dc:creator>
      <pubDate>Fri, 13 Mar 2026 04:53:55 +0000</pubDate>
      <link>https://forem.com/yogesh_vk/ai-for-devops-and-platform-engineering-practical-use-cases-that-actually-work-2a63</link>
      <guid>https://forem.com/yogesh_vk/ai-for-devops-and-platform-engineering-practical-use-cases-that-actually-work-2a63</guid>
      <description>&lt;p&gt;Moving beyond hype to real workflows where AI improves infrastructure engineering, and where AI is actually useful for DevOps and Platform Engineering teams today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;INTRODUCTION&lt;/strong&gt;&lt;br&gt;
AI is rapidly entering every corner of software engineering. DevOps and platform teams are no exception. New tools promise to generate infrastructure code, manage deployments, and even run operations autonomously.&lt;/p&gt;

&lt;p&gt;But most experienced infrastructure engineers react with skepticism.&lt;/p&gt;

&lt;p&gt;Infrastructure systems are complex, stateful, and deeply interconnected. Blind automation often introduces more risk than it removes. The question is not whether AI can be used in DevOps workflows — it is where it should be used, and where it should not.&lt;/p&gt;

&lt;p&gt;The most effective teams are not replacing engineers with AI. They are using AI to reduce cognitive load, surface hidden risks, and make better operational decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE SHIFT FROM AUTOMATION TO ASSISTED DECISION-MAKING&lt;/strong&gt;&lt;br&gt;
For years, DevOps focused heavily on automation. CI/CD pipelines automated builds, tests, deployments, and infrastructure provisioning. Infrastructure-as-Code tools like Terraform allowed teams to define environments in reproducible ways.&lt;/p&gt;

&lt;p&gt;AI introduces a new layer to this model.&lt;/p&gt;

&lt;p&gt;Instead of simply automating actions, AI can assist engineers in understanding the consequences of those actions. It becomes a reasoning layer that helps interpret complex systems rather than directly controlling them.&lt;/p&gt;

&lt;p&gt;In practice, this means AI is most valuable when it explains systems, analyzes changes, and highlights risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI FOR INFRASTRUCTURE CODE REVIEWS&lt;/strong&gt;&lt;br&gt;
Infrastructure changes often carry significant risk. A single change in Terraform can replace compute clusters, modify network boundaries, or expand IAM permissions.&lt;/p&gt;

&lt;p&gt;Traditional CI pipelines verify syntax and policy compliance, but they rarely explain the real impact of a change.&lt;/p&gt;

&lt;p&gt;AI can help fill this gap by reviewing Terraform plans and summarizing their implications. Instead of manually scanning hundreds of lines of plan output, engineers can see a concise explanation of what will change and why it matters.&lt;/p&gt;

&lt;p&gt;This turns infrastructure reviews into clearer conversations about risk and intent.&lt;/p&gt;

&lt;p&gt;Raw Terraform Plan (excerpt):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# aws_eks_node_group.platform_nodes must be replaced&lt;/span&gt;
&lt;span class="nx"&gt;-&lt;/span&gt;&lt;span class="err"&gt;/+&lt;/span&gt; &lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_eks_node_group"&lt;/span&gt; &lt;span class="s2"&gt;"platform_nodes"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;instance_types&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"t3.large"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="nx"&gt;-&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"m5.large"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nx"&gt;scaling_config&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;desired_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="nx"&gt;-&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;AI-Generated Explanation:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Terraform Plan Summary
High Impact Change
- EKS node group "platform_nodes" will be replaced
- Worker nodes will be recreated
Operational Impact
- Pods will be rescheduled during node replacement
- Temporary capacity reduction possible
Cost Impact
- Instance type upgrade (t3.large → m5.large)
- Estimated monthly increase: ~$120
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;AI IN CI/CD PIPELINES&lt;/strong&gt;&lt;br&gt;
CI/CD pipelines are another natural integration point for AI.&lt;/p&gt;

&lt;p&gt;Modern pipelines already perform many automated checks:&lt;/p&gt;

&lt;p&gt;formatting validation&lt;br&gt;
policy enforcement&lt;br&gt;
dependency scanning&lt;br&gt;
infrastructure plan generation&lt;br&gt;
AI can extend this pipeline by interpreting the results of those checks.&lt;/p&gt;

&lt;p&gt;For example, an AI step in a GitHub Actions workflow might analyze a Terraform plan and generate a structured summary highlighting resource replacements, cost changes, or security-sensitive updates.&lt;/p&gt;

&lt;p&gt;The pipeline still requires human approval before changes are applied. AI simply improves the context available to reviewers.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Terraform Plan Review&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
 &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
 &lt;span class="na"&gt;terraform-plan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
   &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
   &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
     &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
     &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Terraform Plan&lt;/span&gt;
       &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;terraform plan -out=tfplan&lt;/span&gt;
     &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Convert plan to JSON&lt;/span&gt;
       &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;terraform show -json tfplan &amp;gt; plan.json&lt;/span&gt;
     &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AI Plan Analysis&lt;/span&gt;
       &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
         &lt;span class="s"&gt;ai-review plan.json &amp;gt; plan-summary.md&lt;/span&gt;
     &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post summary to PR&lt;/span&gt;
       &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;marocchino/sticky-pull-request-comment@v2&lt;/span&gt;
       &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
         &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;plan-summary.md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The AI step reads the Terraform plan and generates a human-readable summary posted directly into the pull request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI FOR SHIFT-LEFT INFRASTRUCTURE SECURITY&lt;/strong&gt;&lt;br&gt;
DevSecOps practices encourage teams to identify security risks earlier in the development lifecycle. However, infrastructure security policies are often difficult to interpret or enforce consistently.&lt;br&gt;
AI can assist by analyzing infrastructure definitions and identifying potential issues before they reach production.&lt;/p&gt;

&lt;p&gt;For example, an AI assistant could flag:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;overly permissive IAM policies&lt;/li&gt;
&lt;li&gt;public exposure of internal services&lt;/li&gt;
&lt;li&gt;misconfigured storage access&lt;/li&gt;
&lt;li&gt;network boundary changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These insights can appear during pull request reviews or pipeline checks, allowing teams to address security concerns before deployment.&lt;/p&gt;

&lt;p&gt;Example PR comment:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Infrastructure Security Review
Issue Detected
- S3 bucket allows public read access
Resource
aws_s3_bucket.website_assets
Risk
Public exposure of application assets.

Suggested Fix
Add block_public_acls = true
Add block_public_policy = true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;AI FOR OBSERVABILITY AND INCIDENT RESPONSE&lt;/strong&gt;&lt;br&gt;
Operations teams often face the challenge of interpreting large volumes of monitoring data.&lt;/p&gt;

&lt;p&gt;Logs, metrics, and alerts can provide enormous amounts of information, but identifying the root cause of an issue still requires human reasoning.&lt;/p&gt;

&lt;p&gt;AI can assist by analyzing telemetry data and highlighting patterns that indicate emerging problems. Instead of scanning dashboards and logs manually, engineers receive summaries that connect signals across systems.&lt;/p&gt;

&lt;p&gt;Used carefully, this can reduce alert fatigue and accelerate incident investigation.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
Raw logs:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR connection timeout db-primary
ERROR connection timeout db-primary
ERROR connection timeout db-primary
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;AI explanation:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alert Analysis
 Pattern Detected
    Repeated connection failures to database cluster.
 Likely Cause
    Database connection pool exhaustion.
Suggested Investigation
    Check RDS connection limits and application pool size.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This ties AI to real operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WHERE AI SHOULD NOT BE USED&lt;/strong&gt;&lt;br&gt;
Despite its strengths, AI should not be allowed to control critical infrastructure operations without human oversight.&lt;/p&gt;

&lt;p&gt;Executing infrastructure changes, approving deployments, or modifying security policies are decisions that carry operational responsibility.&lt;/p&gt;

&lt;p&gt;AI can provide insight, but it cannot own the consequences of those decisions.&lt;/p&gt;

&lt;p&gt;The most effective DevOps teams treat AI as an assistant rather than an operator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BUILDING AI-AUGMENTED PLATFORM WORKFLOWS&lt;/strong&gt;&lt;br&gt;
The real opportunity is not replacing DevOps workflows, but enhancing them.&lt;/p&gt;

&lt;p&gt;A healthy AI-assisted platform might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI explanations for Terraform plans&lt;/li&gt;
&lt;li&gt;AI-generated summaries for infrastructure pull requests&lt;/li&gt;
&lt;li&gt;AI-assisted security analysis during CI/CD&lt;/li&gt;
&lt;li&gt;AI-powered analysis of observability data
Each capability improves clarity and reduces cognitive load while preserving human ownership of operational decisions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;CLOSING THOUGHT&lt;/strong&gt;&lt;br&gt;
AI will undoubtedly influence how infrastructure systems are built and operated. But its greatest value will not come from replacing engineers.&lt;/p&gt;

&lt;p&gt;It will come from helping them understand increasingly complex systems.&lt;/p&gt;

&lt;p&gt;DevOps was originally about bringing development and operations closer together. The next phase may be about bringing human judgment and machine insight into better balance.&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://medium.com/@yogesh.vk/ai-for-devops-and-platform-engineering-practical-use-cases-that-actually-work-efadc4a90f70?postPublishedType=initial" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;medium.com&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;





</description>
      <category>terraform</category>
      <category>ai</category>
      <category>cicd</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
