Forem: Yogesh VK

AI + GitOps: Why Context Matters More Than Reconciliation

Yogesh VK — Fri, 08 May 2026 06:15:00 +0000

Introduction

There’s something very satisfying about watching a system converge. You push a change to Git. A pipeline runs. A few minutes later, the cluster reflects exactly what you defined.

The system was Clean, Predictable and Repeatable.

For a long time, I thought that was the hard part. It turns out, it isn’t.

The Setup

This was a fairly typical setup.

Kubernetes cluster on GKE
Applications deployed using Helm
GitLab CI driving deployments
Terraform managing underlying infrastructure

The deployment flow was simple:
git push origin main

Which triggered a GitLab pipeline:

deploy:
  stage: deploy
  script:
    - helm upgrade --install payments ./chart -f values.yaml

Nothing fancy. A basic Helm chart defined a service with autoscaling.

# values.yaml
replicaCount: 2

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Everything was stable.

The Change

We had started experimenting with using AI to assist in updating configurations. One suggestion came in to “optimize performance under load.” The diff looked reasonable.

 autoscaling:
   enabled: true
-  maxReplicas: 10
+  maxReplicas: 30

 resources:
   requests:
-    cpu: "200m"
+    cpu: "500m"

Nothing obviously wrong. More replicas. More CPU. Better performance.

The System Converges

The change was merged. GitLab CI picked it up.

Running with gitlab-runner...
$ helm upgrade --install payments ./chart -f values.yaml
Release "payments" has been upgraded

A quick check:

kubectl get pods -n payments
payments-7d9f6c8c4d-abc12   1/1 Running
payments-7d9f6c8c4d-def34   1/1 Running

Everything looked healthy. No errors. No failed deploys. From a system perspective:

Everything worked.

But Something Was Off…

A few hours later, the signals started showing up.

cluster CPU usage was higher than usual — node autoscaling kicked in aggressively
costs started creeping up — some unrelated workloads saw intermittent throttling

Nothing had technically failed. But the system behavior had changed.

The Problem Wasn’t the Pipeline. The pipeline did exactly what it was designed to do.

Git was the source of truth
the change was applied
the cluster matched the desired state There was no drift. No inconsistency. The problem wasn’t reconciliation. It was the definition of the desired state.

Where Context Was Missing — The AI-generated suggestion didn’t know:

this service wasn’t latency-critical — the cluster had shared resource constraints
aggressive scaling could impact other services — cost limits were important in this environment From a configuration standpoint, the change was valid. From a system standpoint, it was incomplete.

The Subtle Danger — This is where AI + GitOps becomes interesting.

GitOps gives us a powerful guarantee:

the system will converge to the declared state

But it does not guarantee:

that the declared state is the right one

And AI, without sufficient context, can generate configurations that are:

technically correct
syntactically valid
operationally deployable …but not aligned with the system as a whole.

Everything Looks “Green”. Even the deployment flow looks clean:
helm upgrade --install payments ./chart -f values.yaml

And Kubernetes happily reports:

kubectl get hpa

NAME       REFERENCE             TARGETS   MINPODS   MAXPODS
payments   Deployment/payments   40%/80%   2         30

No alerts. No failures. Just a system doing exactly what it was told to do.

The Realization

Over time, this became clearer. Reconciliation is deterministic. Context is not. GitLab CI will apply whatever is in Git. Kubernetes will enforce whatever is defined. But neither of them understands:

intent
trade-offs
system-wide impact

That layer still depends on context.

Where AI Actually Helps

AI becomes genuinely useful when it helps us understand changes, not blindly generate them.

For example:

explaining what a Helm diff actually means
highlighting scaling implications
surfacing cost impact
identifying potential blast radius

That’s where AI shines and actually adds value.

Closing Thought

GitOps is incredibly powerful. It gives us consistency, traceability, and convergence. But it assumes that the desired state is correct. In AI-assisted workflows, that assumption becomes weaker. Because now, the desired state may be influenced by a system that doesn’t fully understand the environment. And that shifts the problem.

From:
Will the system converge?

To:
Are we converging to the right thing?

GitOps guarantees convergence. It does not guarantee correctness. That depends on context — and context still needs humans.

Do you agree?

Note: This post was developed using AI-assisted writing tools. While AI helped with structuring and phrasing, all concepts and examples reflect real-world engineering experience.

Originally published on Medium:

medium.com

AI as a Junior Platform Engineer: How I "Onboard" Coding Agents

Yogesh VK — Fri, 01 May 2026 07:00:00 +0000

Introduction

The first time I started seriously using AI in my DevOps workflows, I made the same mistake I've seen many others make.
I treated it like a tool.
Something you prompt, get an answer from, and move on. It worked, to a point. But the results were inconsistent. Sometimes surprisingly good, sometimes completely off. It felt less like working with a system and more like rolling a dice.
That changed when I started thinking about AI differently. Not as a tool - but as a junior platform engineer joining the team. That shift alone made everything more predictable.

The First Day Problem

When a new engineer joins a team, we don't expect them to be productive immediately. We don't just hand them access to production systems and ask them be productive.
Instead, we onboard them. We give them:

context about the system
documentation
boundaries
a safe environment to contribute
time to understand how things work Without that, even a talented engineer will struggle. AI is no different.

Context Is the Difference Between Useful and Dangerous

One of the biggest differences between good and bad AI output is context. Without context, an AI agent will give you generic answers. They might be technically correct, but not aligned with your system, your architecture, or your constraints. This is where something like a context.md file becomes incredibly powerful.
Think of it as the onboarding document you would give a new engineer. It might include:

how your infrastructure is structured
naming conventions
environments and workflows
constraints (cost, security, compliance)
how Terraform modules are organized
what "good" looks like in your system

Once the AI has this context, its suggestions start to feel less generic and more like they belong to your system. Just like a junior engineer who finally understands how things are wired.
Sample context.md:

# Platform Context

## Overview
This repository manages AWS infrastructure using Terraform.
Primary workloads run on EKS clusters across dev, staging, and production environments.

## Key Principles
- Prefer managed services where possible
- Minimize blast radius of changes
- Avoid cross-environment coupling
- All changes must go through PR review

## Terraform Structure
- modules/ → reusable infrastructure components
- envs/dev → development environment
- envs/staging → staging environment
- envs/prod → production environment

## Naming Conventions
- Resources follow: <env>-<service>-<type>
- Example: prod-payments-eks

## Guardrails
- Never modify production directly
- No `terraform apply` without PR approval
- Avoid changes that trigger resource replacement unless explicitly required

## Cost Constraints
- Prefer smaller instance types unless justified
- Autoscaling should always have upper limits defined

## Security
- IAM roles must follow least privilege
- No wildcard permissions unless explicitly approved

## Review Expectations
When reviewing a Terraform plan, focus on:
- Resource replacements
- Changes in networking or IAM
- Scaling or cost implications
- Cross-module impact

## What "Good" Looks Like
- Small, isolated changes
- Clear PR descriptions
- Minimal blast radius

Once I started using something like this, the difference was noticeable.
The AI responses became less generic and more aligned with how the system was actually designed. It started picking up on patterns like naming conventions, environment separation, and even risk signals like resource replacements.
It felt much closer to working with someone who had been onboarded into the system, rather than someone guessing from scratch.

Guardrails Matter More Than Intelligence

When onboarding a new engineer, we don't just give context. We also define boundaries. What they should and should not do. Where they can make changes. What requires review.
AI needs the same guardrails. For example, I'm comfortable letting AI:

suggest Terraform changes
explain plan outputs
summarize pull requests
generate draft configurations

But there are clear boundaries. AI should not:

directly apply infrastructure changes
bypass review processes
make decisions that require operational judgment

These are not limitations of capability. They are intentional design choices. Because just like with a new engineer, the goal is not maximum autonomy - it is safe contribution.

Start With PRs, Not Production

When a new engineer joins, we usually don't give them direct production access on day one. We ask them to start with:

small changes
pull requests
code reviews
guided feedback

This builds confidence and trust over time. The same model works extremely well with AI. Instead of letting AI operate directly on infrastructure, I treat it as a contributor to the PR workflow. It can:

generate changes
explain diffs
highlight potential issues
improve readability

But the final decision still goes through human review. This keeps the system safe while still benefiting from AI acceleration.

Feedback Loops Make It Better

A junior engineer improves with feedback. AI systems also improve with iteration. When something is off, the answer almost never is:

AI doesn't work

More often, it means:

"The context was incomplete"
"The prompt didn't reflect constraints"
"The guardrails weren't clear"

Over time, refining context and expectations makes AI far more reliable. It starts behaving less like a random generator and more like a team member who understands the system.

The Real Shift

Thinking of AI as a junior platform engineer changes how you design workflows. Instead of asking:

"What can this tool do?"

You start asking:

"How would I onboard someone into this system?"

That question naturally leads you to:

better context
clearer boundaries
safer workflows
more predictable outcomes

Closing Thought

AI in DevOps doesn't need to be treated as an autonomous operator. In many cases, it works best as a well-onboarded junior engineer:

guided by context
constrained by guardrails
contributing through safe workflows
improving over time

The goal is not to replace engineers. It is to make systems easier to understand, safer to operate, and faster to evolve. And sometimes, the best way to do that is not to give AI more power - but to onboard it more thoughtfully.

Curious to know what you think of this approach.

Originally published on Medium:

medium.com

Why AI and Automation Are Not Always the Right Answer in DevOps

Yogesh VK — Fri, 10 Apr 2026 06:00:00 +0000

In DevOps, AI, and platform engineering, speed without understanding can amplify failure.

Introduction

Every engineering team reaches this moment sooner or later.

A repetitive task appears. A process feels slow and folks just say:

Can we just automate this?

On the surface, it sounds like the right instinct. DevOps, after all, was built on the idea of reducing manual effort and increasing reliability through automation.

But in my experience, automation is not always the right first answer.

Sometimes the process is slow because it contains necessary human judgment. Sometimes the repetition is actually a signal that the underlying system design needs improvement. And increasingly, with AI entering DevOps workflows, there is a temptation to automate decisions that should still remain human.

The question is not whether something can be automated. The more important question is whether it should be.

Automation Often Scales Existing Problems

One of the most common mistakes teams make is automating a broken or poorly understood workflow.

A manual deployment process may feel slow and frustrating. But if the underlying release steps are unclear, automating them simply means the same confusion now happens faster and at larger scale.

In infrastructure systems, this can be dangerous.

A pipeline that automatically pushes Terraform changes into production may look efficient, but if reviewers do not fully understand the blast radius of those changes, automation simply accelerates risk.

The result is not better engineering. It is faster failure.

The Difference Between Repetition and Judgment

Not every repetitive task is suitable for automation.

Some tasks are repetitive because they are operationally necessary. Others require human context and judgment, even if the steps appear similar.

For example, reviewing a Terraform plan may seem repetitive. But what the reviewer is actually doing is not checking syntax. They are making decisions about:

operational risk
rollback impact
customer-facing downtime
security implications That is not repetition. I think that is judgment.

Automating the process without preserving that judgment layer often removes the most valuable part of the workflow.

AI Makes This Even More Tempting

The rise of AI in DevOps workflows makes this challenge even more relevant. AI tools can now can do all of these and more:

summarize pull requests
explain Terraform plans
analyze logs
suggest infrastructure changes These are genuinely useful capabilities. But there is an important boundary.

AI should help engineers understand systems better. It should not replace ownership of decisions.

For example, using AI to explain a Terraform plan is helpful. Using AI to automatically approve and apply infrastructure changes is often the wrong answer. The operational responsibility still belongs to humans.

Good Automation Removes Toil, Not Thinking

The best automation removes toil, not thought. Good examples include some of these:

automatic formatting checks
CI validation pipelines
policy enforcement
environment cleanup schedules
cost anomaly alerts These tasks are repetitive, rules-driven, and low in ambiguity. They benefit greatly from automation. What should remain human are workflows involving uncertainty, trade-offs, and accountability.

The Better Question to Ask
Instead of asking:

Can we automate this?

A better question is:

What part of this process is pure toil, and what part requires human judgment?

That distinction changes everything. Once you separate toil from judgment, automation becomes much safer and much more effective.

Closing Thought

In DevOps and platform engineering, automation is incredibly powerful.

But the goal should never be automation for its own sake. The goal is better systems in my opinion.

Sometimes that means automation. Sometimes it means improving engineers and systems understanding first.

And increasingly, with AI in the mix, it means being very deliberate about what we allow machines to decide on our behalf.

Because not every slow process is a bad one. Some of them are where engineering judgment actually lives..

Do you feel the same way?

Originally published on Medium:

medium.com

Using AI to Explain Terraform Plans to Humans

Yogesh VK — Fri, 27 Mar 2026 06:16:00 +0000

Turning raw infrastructure diffs into decisions engineers can actually understand.

INTRODUCTION

Terraform plans are incredibly precise. They show every resource change, attribute modification, and dependency update that will occur during an apply.
But precision is not the same as clarity.
For many engineers reviewing infrastructure changes, Terraform plans feel more like a wall of text than a meaningful explanation of what is about to happen. The information is there, but extracting the real implications often requires experience and careful reading.

This is exactly where AI can become useful. Not by executing infrastructure changes, but by translating Terraform plans into something humans can reason about.

THE PROBLEM WITH RAW TERRAFORM PLANS

Terraform's plan output is designed for correctness, not readability.
It faithfully lists changes such as resource replacements, attribute updates, and dependency adjustments. While this is ideal for machines and precise workflows, it can make reviews difficult for humans, especially in larger environments.
A simple plan might include:

hundreds of attribute updates
nested resource changes
implicit dependencies across modules

What reviewers actually want to know is much simpler:

What changed?
Why does it matter?
Is the risk acceptable?

Terraform itself does not answer those questions.

WHERE HUMAN REVIEW BREAKS DOWN

Experienced engineers eventually develop an instinct for reading Terraform plans. They scan for dangerous signals:
resource replacement

subnet or network changes
IAM policy expansions
scaling changes in compute clusters

But this intuition takes time to build, and even experienced reviewers can miss subtle interactions when reviewing large changes late in the day or under delivery pressure.
The real problem isn't lack of information. It's cognitive load.

Terraform tells us everything. Humans only need to understand the important parts.

WHY AI IS GOOD AT THIS PROBLEM

AI models are particularly good at summarizing structured text and identifying patterns.
A Terraform plan contains many signals that AI can interpret effectively:

which resources will be created, updated, or destroyed
whether replacements will occur
potential cost changes
security-sensitive modifications
large blast-radius changes

Instead of forcing humans to parse hundreds of lines of output, AI can produce a concise summary describing the operational impact.
This transforms the Terraform plan from a raw diff into an explanation.

AI AS A REVIEW ASSISTANT IN CI/CD

A practical place to integrate this capability is within CI/CD pipelines.
After generating a Terraform plan, a pipeline step can feed the plan output into an AI model. The model then produces a human-readable summary that is attached to the pull request.
Instead of reviewing raw plan text alone, engineers see a structured explanation such as:

Risk Summary: This change replaces the EKS node group, which will trigger a rolling replacement of worker nodes.
Security Impact: No IAM policies were expanded.
Cost Impact: Estimated monthly increase: approximately $120 due to increased instance size.
Operational Notes: Node replacement may temporarily reduce cluster capacity during rollout.

This type of explanation does not replace the Terraform plan. It simply helps humans understand it faster.

USING GITHUB ACTIONS FOR AI-ASSISTED PLAN REVIEWS

GitHub Actions provides a natural place to implement this pattern.
A typical pipeline already includes steps like formatting, validation, and plan generation. Adding an AI analysis step is straightforward and can operate entirely in read-only mode.
The workflow might look like:

Run terraform plan
Export plan output as JSON
Send plan summary to an AI model
Post a structured explanation as a pull request comment

The key point is that the AI does not change infrastructure or execute Terraform commands. It only interprets the plan and produces a human-readable summary.
This keeps the decision-making process firmly in human hands.

WHY THIS IMPROVES INFRASTRUCTURE SAFETY

When infrastructure reviews fail, it is rarely because Terraform produced incorrect output.
Failures occur because reviewers misinterpret the impact or miss important signals hidden within large plans.
AI-assisted explanations reduce that risk by highlighting the kinds of changes humans care about most:

replacements
deletions
network changes
permission expansions
scaling adjustments

The AI becomes a second set of eyes, helping reviewers focus their attention where it matters.

THE IMPORTANT BOUNDARY

Even though AI can interpret plans effectively, it should never be allowed to execute them.
Running terraform apply still requires human ownership and operational judgment. AI can explain consequences, but it cannot decide whether those consequences are acceptable.
That boundary is what keeps AI useful rather than dangerous.

CLOSING THOUGHT

Terraform already tells us what will change.
AI helps answer the more useful question: What does this change actually mean?

By turning raw infrastructure diffs into clear explanations, AI allows DevOps teams to review changes faster, understand risk better, and make more confident decisions.
And that is exactly where AI belongs in infrastructure workflows - helping humans think more clearly, not replacing their judgment.

How does your team review Terraform plans today - raw output, custom tooling, or something smarter?

Originally published on Medium:

medium.com

5 Expensive Terraform Mistakes I Keep Seeing in Real Infrastructure and How AI can Help

Yogesh VK — Fri, 20 Mar 2026 06:15:00 +0000

Small infrastructure decisions that quietly turn into large cloud bills
Infrastructure-as-Code has dramatically improved how teams manage cloud environments. Terraform in particular has made it possible to define infrastructure in version-controlled, repeatable configurations.

In theory, this should make infrastructure both predictable and efficient.

In practice, however, Terraform does not automatically make systems cost-efficient. It simply makes infrastructure changes easier to reproduce. If inefficient patterns exist in the configuration, Terraform will reproduce them perfectly.

Over time, small infrastructure decisions accumulate. Many of them appear harmless when introduced, but months later they become visible as unexpectedly large cloud bills.

Here are some of the most common Terraform patterns I keep seeing that quietly drive up infrastructure costs.

1. Oversized Compute That Never Gets Revisited

One of the most common patterns starts early in a project.

During initial development, engineers often choose slightly larger instance types to avoid performance issues. It is safer to start with extra capacity rather than risk under-provisioning a critical service.

The problem is that these instance sizes often remain unchanged long after workloads stabilize.

Terraform makes it easy to define infrastructure once and leave it untouched. As long as systems continue running without obvious performance problems, there is little incentive to revisit instance sizing decisions.

Over time, this leads to clusters and services running on instance types that are significantly larger than necessary.

This issue is particularly visible in Kubernetes clusters, where node groups are frequently defined with conservative sizing assumptions. If workloads later become more efficient, the underlying infrastructure may remain over-provisioned indefinitely.

Resource Replacements Triggered by Small Configuration Changes Terraform’s declarative model means that certain configuration changes require resources to be replaced rather than updated.

For example, modifying attributes such as subnet associations, encryption settings, or instance types may cause Terraform to destroy and recreate a resource.

While Terraform clearly reports these replacements in the plan output, the operational and financial impact is not always obvious during review.

Replacing compute clusters, databases, or node groups can temporarily increase infrastructure usage, create additional storage snapshots, or trigger redeployment processes that consume additional resources.

When these replacements happen frequently across environments, they can contribute to unexpectedly high infrastructure costs.

This is one reason many teams are beginning to use AI-assisted plan analysis in CI pipelines — to highlight resource replacements and explain their operational impact before they are applied.

Logging and Observability Configurations That Grow Without Limits Terraform is often used to provision logging pipelines and observability systems. These systems are essential for debugging and monitoring production environments.

However, logging configurations are frequently defined with very generous defaults.

For example, teams may configure:

high verbosity log levels
long retention periods
large ingestion pipelines These settings are useful during development but are rarely revisited as systems mature.

Because Terraform configurations remain stable over time, these logging pipelines can continue collecting massive volumes of data long after the original debugging needs have passed.

In some environments, observability costs eventually exceed the cost of the infrastructure being monitored.

Idle Infrastructure Environments Another common Terraform pattern involves environment duplication.

Many organizations create separate environments for development, staging, integration testing, and experimentation. Terraform makes it easy to spin up these environments using identical modules.

The problem is that these environments often remain running continuously even when they are rarely used.

A staging environment that runs databases, compute nodes, load balancers, and storage resources can easily cost hundreds of dollars per month. Multiply that across multiple teams and environments, and the cost grows quickly.

In many cases, these environments are only actively used during working hours.
Automated scheduling policies or environment lifecycle management can dramatically reduce this waste, but these controls are rarely implemented initially.

Storage That Quietly Accumulates Storage resources are particularly prone to long-term cost growth.

Terraform configurations frequently create:

snapshots
backups
object storage buckets
artifact repositories Because storage is relatively inexpensive per gigabyte, these resources often grow without much scrutiny.

Over time, however, storage layers accumulate historical artifacts that are rarely accessed but continue to incur costs.

Common examples include:

old database snapshots that were never cleaned up
log archives retained indefinitely
unused container images in registries
artifact storage from old CI pipelines Without lifecycle policies, these storage systems gradually become long-term archives rather than operational infrastructure.

Why These Issues Are Hard to Detect
The most expensive Terraform mistakes rarely appear as obvious misconfigurations.

Instead, they emerge gradually as systems evolve.

Each decision may appear reasonable in isolation. The instance size seems safe. The logging level helps debugging. The staging environment might be needed later.

The problem is that Terraform faithfully preserves these decisions over time.

Without regular review, infrastructure configurations slowly drift away from the actual needs of the system.

How AI Can Help Detect These Patterns Earlier
AI cannot fix infrastructure architecture problems automatically. But it can help identify patterns that humans might overlook.

For example, AI systems analyzing infrastructure configurations or Terraform plans can highlight signals such as:

compute resources that appear significantly over-provisioned
environments that remain idle for long periods
storage resources that grow continuously without access
resource replacements that may trigger unnecessary redeployments
These insights allow teams to review infrastructure decisions earlier rather than discovering problems only when the monthly bill arrives.

The Real Lesson
Terraform is an incredibly powerful tool for managing infrastructure. But like any automation system, it faithfully executes the decisions encoded within it.

If inefficient patterns exist in the configuration, Terraform will reproduce them perfectly every time.

The goal is not to avoid mistakes entirely. That is unrealistic in complex systems.

The goal is to detect small inefficiencies early — before they accumulate into large and expensive infrastructure problems.

Closing Thought
Cloud costs rarely explode because of one catastrophic decision.

More often, they grow quietly from dozens of small infrastructure choices that were never revisited.

Terraform gives us the power to manage infrastructure systematically. The challenge is making sure the systems we define remain aligned with how they are actually used.

That requires continuous review, feedback, and sometimes a second set of eyes — whether human or machine.

Originally published on Medium:
https://medium.com/@yogesh.vk/5-expensive-terraform-mistakes-i-keep-seeing-in-real-infrastructure-and-how-ai-can-help-9a4849ddfc91

AI for DevOps and Platform Engineering: Practical Use Cases That Actually Work

Yogesh VK — Fri, 13 Mar 2026 04:53:55 +0000

Moving beyond hype to real workflows where AI improves infrastructure engineering, and where AI is actually useful for DevOps and Platform Engineering teams today.

INTRODUCTION
AI is rapidly entering every corner of software engineering. DevOps and platform teams are no exception. New tools promise to generate infrastructure code, manage deployments, and even run operations autonomously.

But most experienced infrastructure engineers react with skepticism.

Infrastructure systems are complex, stateful, and deeply interconnected. Blind automation often introduces more risk than it removes. The question is not whether AI can be used in DevOps workflows — it is where it should be used, and where it should not.

The most effective teams are not replacing engineers with AI. They are using AI to reduce cognitive load, surface hidden risks, and make better operational decisions.

THE SHIFT FROM AUTOMATION TO ASSISTED DECISION-MAKING
For years, DevOps focused heavily on automation. CI/CD pipelines automated builds, tests, deployments, and infrastructure provisioning. Infrastructure-as-Code tools like Terraform allowed teams to define environments in reproducible ways.

AI introduces a new layer to this model.

Instead of simply automating actions, AI can assist engineers in understanding the consequences of those actions. It becomes a reasoning layer that helps interpret complex systems rather than directly controlling them.

In practice, this means AI is most valuable when it explains systems, analyzes changes, and highlights risk.

AI FOR INFRASTRUCTURE CODE REVIEWS
Infrastructure changes often carry significant risk. A single change in Terraform can replace compute clusters, modify network boundaries, or expand IAM permissions.

Traditional CI pipelines verify syntax and policy compliance, but they rarely explain the real impact of a change.

AI can help fill this gap by reviewing Terraform plans and summarizing their implications. Instead of manually scanning hundreds of lines of plan output, engineers can see a concise explanation of what will change and why it matters.

This turns infrastructure reviews into clearer conversations about risk and intent.

Raw Terraform Plan (excerpt):

# aws_eks_node_group.platform_nodes must be replaced
-/+ resource "aws_eks_node_group" "platform_nodes" {
instance_types = ["t3.large"] -> ["m5.large"]
scaling_config {
desired_size = 3 -> 3
}
}

AI-Generated Explanation:

Terraform Plan Summary
High Impact Change
- EKS node group "platform_nodes" will be replaced
- Worker nodes will be recreated
Operational Impact
- Pods will be rescheduled during node replacement
- Temporary capacity reduction possible
Cost Impact
- Instance type upgrade (t3.large → m5.large)
- Estimated monthly increase: ~$120

AI IN CI/CD PIPELINES
CI/CD pipelines are another natural integration point for AI.

Modern pipelines already perform many automated checks:

formatting validation
policy enforcement
dependency scanning
infrastructure plan generation
AI can extend this pipeline by interpreting the results of those checks.

For example, an AI step in a GitHub Actions workflow might analyze a Terraform plan and generate a structured summary highlighting resource replacements, cost changes, or security-sensitive updates.

The pipeline still requires human approval before changes are applied. AI simply improves the context available to reviewers.

name: Terraform Plan Review
on:
 pull_request:
jobs:
 terraform-plan:
   runs-on: ubuntu-latest
   steps:
     - uses: actions/checkout@v4
     - name: Run Terraform Plan
       run: terraform plan -out=tfplan
     - name: Convert plan to JSON
       run: terraform show -json tfplan > plan.json
     - name: AI Plan Analysis
       run: |
         ai-review plan.json > plan-summary.md
     - name: Post summary to PR
       uses: marocchino/sticky-pull-request-comment@v2
       with:
         path: plan-summary.md

The AI step reads the Terraform plan and generates a human-readable summary posted directly into the pull request.

AI FOR SHIFT-LEFT INFRASTRUCTURE SECURITY
DevSecOps practices encourage teams to identify security risks earlier in the development lifecycle. However, infrastructure security policies are often difficult to interpret or enforce consistently.
AI can assist by analyzing infrastructure definitions and identifying potential issues before they reach production.

For example, an AI assistant could flag:

overly permissive IAM policies
public exposure of internal services
misconfigured storage access
network boundary changes

These insights can appear during pull request reviews or pipeline checks, allowing teams to address security concerns before deployment.

Example PR comment:

Infrastructure Security Review
Issue Detected
- S3 bucket allows public read access
Resource
aws_s3_bucket.website_assets
Risk
Public exposure of application assets.

Suggested Fix
Add block_public_acls = true
Add block_public_policy = true

AI FOR OBSERVABILITY AND INCIDENT RESPONSE
Operations teams often face the challenge of interpreting large volumes of monitoring data.

Logs, metrics, and alerts can provide enormous amounts of information, but identifying the root cause of an issue still requires human reasoning.

AI can assist by analyzing telemetry data and highlighting patterns that indicate emerging problems. Instead of scanning dashboards and logs manually, engineers receive summaries that connect signals across systems.

Used carefully, this can reduce alert fatigue and accelerate incident investigation.

Example:
Raw logs:

ERROR connection timeout db-primary
ERROR connection timeout db-primary
ERROR connection timeout db-primary

AI explanation:

Alert Analysis
 Pattern Detected
    Repeated connection failures to database cluster.
 Likely Cause
    Database connection pool exhaustion.
Suggested Investigation
    Check RDS connection limits and application pool size.

This ties AI to real operations.

WHERE AI SHOULD NOT BE USED
Despite its strengths, AI should not be allowed to control critical infrastructure operations without human oversight.

Executing infrastructure changes, approving deployments, or modifying security policies are decisions that carry operational responsibility.

AI can provide insight, but it cannot own the consequences of those decisions.

The most effective DevOps teams treat AI as an assistant rather than an operator.

BUILDING AI-AUGMENTED PLATFORM WORKFLOWS
The real opportunity is not replacing DevOps workflows, but enhancing them.

A healthy AI-assisted platform might include:

AI explanations for Terraform plans
AI-generated summaries for infrastructure pull requests
AI-assisted security analysis during CI/CD
AI-powered analysis of observability data Each capability improves clarity and reduces cognitive load while preserving human ownership of operational decisions.

CLOSING THOUGHT
AI will undoubtedly influence how infrastructure systems are built and operated. But its greatest value will not come from replacing engineers.

It will come from helping them understand increasingly complex systems.

DevOps was originally about bringing development and operations closer together. The next phase may be about bringing human judgment and machine insight into better balance.

medium.com