Forem: Ravi Teja Reddy Mandala

The Hidden Layer Nobody Talks About in AI Systems (And Why It’s Breaking Production)

Ravi Teja Reddy Mandala — Fri, 01 May 2026 23:48:07 +0000

Everyone is talking about better prompts, better models, and better agents.

But production AI systems are not failing only because the model is weak.

They are failing because of a layer most teams never explicitly design.

A layer that quietly sits between the model output and the real system action.

And when this layer breaks, nothing looks obviously wrong.

No crash.

No stack trace.

No failed deployment.

Just bad decisions moving through the system.

The Layer You Didn’t Design

In traditional software systems, we usually understand the major layers:

API layer
business logic
database
monitoring

But in AI systems, there is another layer that often exists without a name.

I call it the decision layer.

This is the layer where model output becomes system behavior.

It is where:

a classification becomes an escalation
a summary becomes a customer response
a recommendation becomes an automated action
a confidence score becomes a business decision

The problem is simple:

Most teams treat this layer like it does not exist.

They put some of it in prompts.

Some of it in glue code.

Some of it in thresholds.

Some of it in undocumented assumptions.

Then they wonder why the system behaves unpredictably in production.

What This Looks Like in Production

Imagine an AI agent used in an incident response workflow.

The model sees logs, alerts, and recent deployment notes.

It responds:

"This looks like a transient network issue. Retry should fix it."

That sounds reasonable.

But what happens next?

Somewhere in the system, that response may cause:

an automated retry
an alert suppression
a ticket update
a lower severity classification
a delayed human escalation

The model did not just generate text.

It influenced action.

That is the dangerous part.

Because the actual decision may be scattered across prompts, parsing logic, workflow code, and assumptions made by the engineering team.

Why This Breaks Production Systems

1. Model outputs are probabilistic, but systems expect contracts

Software systems are built around contracts.

An API returns a known schema.

A function has expected inputs and outputs.

A database query has predictable behavior.

AI models do not naturally behave like that.

They produce probabilistic outputs.

Even when the answer looks correct, the format, confidence, or implied action may shift slightly.

That small shift can create a large downstream effect.

A model saying "likely safe to retry" is not the same as "retry automatically".

But many systems accidentally treat them the same.

2. Decisions become hidden inside text

In traditional software, you can usually trace the decision.

A condition failed.

A function returned false.

A rule was triggered.

In AI systems, the decision often hides inside natural language.

The system does not just need to know what the model said.

It needs to know what the model meant.

That creates a dangerous debugging problem.

Instead of asking:

Which function failed?

Teams start asking:

Why did the model think this?

That is a much harder question during an incident.

3. Prompts become business logic

Teams often put critical decision rules inside prompts.

For example:

"If the issue seems low risk, suggest remediation. If confidence is low, escalate to a human."

Now your prompt is not just instruction.

It is business logic.

And unlike normal business logic, it is harder to test, version, review, and monitor.

A small prompt change can silently change system behavior.

That is how AI systems break without looking broken.

4. Observability misses the most important part

Most production dashboards track:

latency
token usage
API errors
request volume
model response time

But they do not tell you whether the AI system made a good decision.

For AI systems, we also need to track:

wrong actions taken
unnecessary escalations
missed escalations
human overrides
rollback frequency
user corrections
cost of incorrect decisions

Without these signals, your system can look healthy while making poor decisions.

The Real Problem Is Not Just the Model

When an AI system fails, the first instinct is:

"We need a better model."

Sometimes that is true.

But often, the model is only part of the problem.

The bigger issue is that the system has no clear control over how model output becomes action.

That gap is where production failures happen.

A strong AI system is not just a model connected to tools.

It is a controlled decision system.

What Mature AI Systems Do Differently

The best production AI systems do not allow raw model output to directly control important actions.

They introduce structure, validation, and policy around the model.

1. Separate generation from decision-making

Do not let free-form text directly trigger system behavior.

Instead, ask the model for structured output.

Example structure:

issue_type: network
confidence: 0.62
recommended_action: retry
requires_human_review: true

Now your system can decide:

if confidence is below 0.8, escalate
if action is high risk, require approval
if repeated failure happens, stop automation
if user impact is high, notify human

The model can recommend.

The system should decide.

2. Create explicit decision policies

Decision policies should live outside the prompt.

They should be clear and testable.

For example:

auto-retry only when confidence is above 0.85
never suppress alerts for customer-impacting incidents
require human approval for database changes
escalate if the same issue repeats within 30 minutes
block automation if logs contain unknown patterns

3. Add decision observability

Do not only monitor the model.

Monitor the decisions.

Track:

what the model recommended
what action was taken
confidence score
human overrides
outcome success or failure

You are not only watching infrastructure.

You are watching judgment.

4. Build a control plane for AI actions

As AI systems become more autonomous, they need a control plane.

This includes:

policy enforcement
risk scoring
approval workflows
rollback behavior
audit trails
feedback loops

Without this, AI agents become unpredictable.

With this, they become controlled.

The Big Shift

We are moving from model-centric systems to decision-centric systems.

The real question is:

What happens when the model is uncertain or wrong?

That is where production engineering begins.

Because the cost of wrong decisions is real:

customer impact
wasted time
noisy incidents
missed escalations
operational risk

Final Thought

Your AI system is not just prompts, models, and agents.

It is a decision-making system.

And if you do not design the decision layer, your system will still make decisions.

Just not in a way you can control.

That is why many AI systems look impressive in demos but fail in production.

The missing layer was never the model.

It was the decision layer.

Question for the community

How are you handling this in your systems?

Are you letting model outputs drive actions directly, or do you have policies and control layers in place?

Why Most AI Agents Fail in Production Systems: A Systems Perspective

Ravi Teja Reddy Mandala — Mon, 13 Apr 2026 22:00:12 +0000

Most conversations around AI agents focus on model performance.

In real production environments, that is rarely the limiting factor.

After working closely with production systems, a clear pattern emerges:

AI does not fail because of intelligence limitations.
It fails because of system design gaps.

Let’s break this down from a systems engineering perspective.

1. Signal Quality > Model Quality

AI systems rely entirely on input signals.

But most production environments expose:

logs without context
metrics without causality
alerts without correlation

This creates fragmented visibility.

Even a highly capable model cannot make reliable decisions on inconsistent signals.

In practice, poor observability architecture is the first failure point.

2. Missing System Abstractions

Human operators rely on implicit understanding:

service dependencies
failure blast radius
historical patterns

AI systems do not have this intuition.

If your architecture does not explicitly define:

service relationships
ownership boundaries
failure domains

Then the system becomes non-interpretable for machines.

AI requires structured abstractions. Most systems were never designed for that.

3. Non-Deterministic Workflows

Incident response in many teams is:

partially documented
context-driven
experience-heavy

This works well for humans.

But AI systems require:

deterministic steps
clearly defined decision paths
reproducible workflows

Without this, automation becomes unreliable and unpredictable.

4. The Hidden Constraint: System Readiness

Before introducing AI into production, a more important question should be asked:

Is the system ready for AI?

A production system is “AI-ready” only if:

signals are consistent and correlated
dependencies are explicitly modeled
workflows are structured and repeatable

Without these, AI will amplify system weaknesses instead of solving them.

Key Insight

We are trying to apply AI to systems that were never designed to be understood by machines.

That is the core problem.

A Better Approach

Instead of asking:

“How do we improve the AI model?”

We should ask:

“How do we redesign systems to be machine-interpretable?”

That shift changes everything.

For engineers already experimenting with AI in production:

What has been the hardest challenge so far
— signal quality, dependency visibility, or workflow reliability?

I Put an AI Agent in My Incident Workflow for 7 Days. Here’s What Actually Broke.

Ravi Teja Reddy Mandala — Thu, 09 Apr 2026 00:50:17 +0000

Everyone says AI agents will reduce on-call fatigue.

So I tried adding one into a real production incident workflow not to replace engineers, but to assist with triage, summarization, and next-step recommendations.

It helped in some places.
It failed in others.
And the biggest lesson had less to do with the model and more to do with system design.

The Setup

I integrated an AI agent into a typical incident response flow:

Incoming alerts from monitoring systems
Initial triage and classification
Root cause hypothesis
Suggested remediation steps

The agent was allowed to:

Summarize alerts
Group duplicate incidents
Suggest possible causes
Draft remediation steps

The agent was NOT allowed to:

Execute production changes
Restart services
Modify configs
Trigger escalations automatically

This was intentional. I wanted to see where it adds value without risking production.

What Worked Surprisingly Well

1. Alert Summarization

The agent reduced noisy alerts into clean summaries.

Instead of reading through logs, I got:

“High latency observed in service X after deployment Y. Likely related to dependency Z.”

This alone saved time during high-pressure incidents.

2. Duplicate Incident Grouping

It grouped alerts that were actually the same issue.

This reduced alert fatigue and helped focus on the real root cause faster.

3. Drafting Next Steps

It suggested reasonable first actions:

Check recent deployments
Validate dependency health
Inspect error spikes

Not perfect, but a solid starting point.

What Broke Almost Immediately

1. Wrong Prioritization

The agent sometimes treated low-impact issues as critical.

Severity is not just data. It is context.
And context is hard.

2. False Confidence

The responses sounded very confident even when wrong.

This is dangerous in production systems.
Confidence ≠ correctness.

3. Noisy Recommendations

Some suggestions were technically valid but operationally useless.

Example:

“Restart the service”

In production, that is not always acceptable without deeper checks.

4. Escalation Confusion

It struggled to decide when to involve humans.

Too early → noise

Too late → risk

That balance is harder than it looks.

The Real Problem: System Design

After a week, it became clear:

The AI agent was not the main problem.

The real issues were:

Weak incident workflows
Poor escalation design
Lack of structured context
No clear guardrails

If your system is messy, the AI will reflect that mess faster.

The Architecture That Works Better

Here is what I would recommend instead:

Alert comes in
AI summarizes + groups signals
AI suggests possible causes
Human validates context
AI drafts remediation options
Human approves final action

AI as a co-pilot, not an autopilot.

Key Takeaways

AI is great at summarization and pattern detection
It struggles with context and real-world constraints
Confidence can be misleading
System design matters more than model capability

Most teams trying AI in incident response are not failing because of the model.

They are failing because their workflow is not designed for AI.

Final Thought

AI can absolutely improve incident response.

But if your escalation paths, permissions, and observability are weak,

the agent will not fix your system.

It will expose it.

Question for You

Would you allow an AI agent in your on-call workflow?

Recommendation only
Limited action with approval
Full automation

Curious to hear how others are approaching this.

Your AI Agent Is Not Failing. Your System Design Is.

Ravi Teja Reddy Mandala — Fri, 27 Mar 2026 01:39:04 +0000

Everyone is blaming AI agents.

“They hallucinate.”
“They don’t scale.”
“They can’t handle production.”

That’s not the real problem.

The real problem?

We are treating AI agents like tools.

Instead of systems.

In production, nothing works in isolation.

Not your services.
Not your pipelines.
Not your on-call workflows.

But somehow…

We expect AI agents to just “figure it out.”

Here’s what I’ve seen in real systems:

AI fails when:

Context is fragmented
State is lost between steps
Decisions are not traceable
There are no guardrails

Not because the model is bad.

Most teams are building:

❌ Prompt → Response → Done

But production needs:

✅ Context → State → Memory → Feedback → Control

That’s the difference between:

👉 Demo AI
vs
👉 Production AI

The shift is simple, but most miss it:

AI agents are not features.

They are distributed systems with reasoning loops.

Until we design them that way…

We’ll keep blaming the model
for system problems.

ai #sre #agents #devops #llm

What Actually Happens When You Put an AI Agent on Call

Ravi Teja Reddy Mandala — Thu, 19 Mar 2026 02:05:33 +0000

AI agent assisting with real-time production incident response

Everyone is talking about AI agents.

But very few are actually using them in real production workflows.

And almost no one talks about what really happens when they are put on call.

Writing code.

Reviewing pull requests.

Answering questions.

Automating workflows.

That part is easy.

But production is different.

It is noisy.

It is unpredictable.

And it does not forgive mistakes.

So I kept thinking about one question:

What actually happens when an AI agent becomes part of an on-call workflow?

Not in a demo.

Not in a toy setup.

But in real systems.

Why this matters

Modern production systems generate too much information during an incident.

A single issue can create:

alerts from multiple services
spikes in logs and metrics
duplicate symptom reports
confusion around root cause

This is exactly where an AI agent looks useful.

In theory, it can:

summarize the incident
correlate alerts
scan logs
suggest likely causes
recommend next actions

That sounds great.

But the real value is not in replacing the engineer.

It is in reducing the time spent navigating noise.

Where AI actually helps

After thinking through how an AI agent fits into on-call, I see four areas where it can be genuinely useful.

1. Incident summarization

During an active issue, the first problem is usually information overload.

An AI agent can quickly turn scattered signals into something readable:

what changed
which services are affected
when the issue started
what symptoms are most visible

That alone can save valuable time.

2. Log and alert correlation

A human usually jumps between dashboards, logs, alerts, and deployment history.

An AI agent can act like a first-pass investigator:

group similar errors
detect repeated patterns
connect alerts across services
highlight suspicious deployments or config changes

This does not replace debugging.

But it gives the engineer a much better starting point.

3. Runbook guidance

During an incident, even experienced engineers forget details.

An AI agent can help by pulling the most relevant runbook steps:

known mitigation paths
rollback instructions
common checks
escalation conditions

This is especially useful when incidents happen outside normal working hours.

4. Post-incident support

AI can also help after the issue is resolved:

summarize the timeline
draft incident notes
organize contributing factors
prepare a clean starting point for postmortem review

That reduces operational overhead and improves documentation quality.

Where people get it wrong

The biggest mistake is expecting AI to function like an autonomous incident commander.

That is risky.

Production systems are full of edge cases, hidden dependencies, and partial signals.

An AI agent may sound confident while still being wrong.

A safer way to use it

If you are introducing an AI agent into an on-call workflow, boundaries matter.

summarize, not blindly decide
recommend, not auto-execute
explain reasoning before suggesting action
stay within approved operational limits
escalate to humans for risky changes

This is where AI becomes useful without becoming dangerous.

Final thought

AI in production is not about intelligence.

It is about control, constraints, and safety.

The teams that benefit most will not be the ones that hand over control too early.

They will be the ones that use AI to make engineers faster, clearer, and more informed.

👇 Curious

Would you trust an AI agent to take action in production?

Or should it stay as a recommendation layer?

I Let AI Review 1,000 Lines of My Production Code — The Bugs It Found Shocked Me

Ravi Teja Reddy Mandala — Sat, 14 Mar 2026 05:31:45 +0000

Last week I ran an experiment.

Instead of reviewing a new production service manually, I asked an AI model to analyze around 1,000 lines of production code.

The goal was simple:

Find bugs a human reviewer might miss.

The result surprised me.

The AI identified multiple potential issues in less than two minutes — including a race condition and an error handling problem that had already caused a production incident months ago.

Here’s exactly what happened.

The Setup

The codebase contained roughly:

1,000 lines of production service code
several async workflows
API retry logic
distributed system error handling

The service runs in a cloud environment and processes internal infrastructure requests.

Instead of performing a traditional code review, I asked AI to:

• analyze the code

• identify risky patterns

• suggest improvements

What the AI Found

1. Hidden Race Condition

The AI detected a potential race condition involving asynchronous task execution.

The issue occurred when multiple requests triggered the same background worker.

This could lead to duplicate processing.

It wasn’t obvious during normal code review.

2. Silent Failure in Error Handling

One block caught exceptions but never logged them.

That meant failures could occur silently.

In production systems, silent failures are extremely dangerous because they hide operational issues.

3. Retry Logic That Could Amplify Outages

The AI also pointed out a retry pattern that could unintentionally amplify incidents.

Instead of exponential backoff, the system retried requests too aggressively.

Under heavy load, this could worsen outages.

Where AI Still Struggles

AI analysis isn't perfect.

In some cases the model suggested improvements that were unnecessary.

For example:

• refactoring code that was already optimized

• simplifying logic that existed for historical reasons

This is why human review is still critical.

What This Means for Engineers

AI won't replace engineers.

But it will dramatically change how we work.

Instead of reviewing every line of code manually, engineers will increasingly rely on AI to:

• scan large codebases

• identify risky patterns

• detect hidden bugs

The engineer's role becomes more about system design and decision making.

Final Thoughts

AI code analysis tools are improving rapidly.

They won't eliminate traditional reviews, but they can dramatically reduce the time it takes to detect problems in production systems.

And sometimes they find things humans miss.

The real question is:

How soon will AI become part of every engineering workflow?

I Let AI Review 1,000 Lines of My Production Code — The Bugs It Found Shocked Me

Ravi Teja Reddy Mandala — Sat, 14 Mar 2026 05:31:45 +0000

Last week I ran an experiment.

Instead of reviewing a new production service manually, I asked an AI model to analyze around 1,000 lines of production code.

The goal was simple:

Find bugs a human reviewer might miss.

The result surprised me.

The AI identified multiple potential issues in less than two minutes — including a race condition and an error handling problem that had already caused a production incident months ago.

Here’s exactly what happened.

The Setup

The codebase contained roughly:

1,000 lines of production service code
several async workflows
API retry logic
distributed system error handling

The service runs in a cloud environment and processes internal infrastructure requests.

Instead of performing a traditional code review, I asked AI to:

• analyze the code

• identify risky patterns

• suggest improvements

What the AI Found

1. Hidden Race Condition

The AI detected a potential race condition involving asynchronous task execution.

The issue occurred when multiple requests triggered the same background worker.

This could lead to duplicate processing.

It wasn’t obvious during normal code review.

2. Silent Failure in Error Handling

One block caught exceptions but never logged them.

That meant failures could occur silently.

In production systems, silent failures are extremely dangerous because they hide operational issues.

3. Retry Logic That Could Amplify Outages

The AI also pointed out a retry pattern that could unintentionally amplify incidents.

Instead of exponential backoff, the system retried requests too aggressively.

Under heavy load, this could worsen outages.

Where AI Still Struggles

AI analysis isn't perfect.

In some cases the model suggested improvements that were unnecessary.

For example:

• refactoring code that was already optimized

• simplifying logic that existed for historical reasons

This is why human review is still critical.

What This Means for Engineers

AI won't replace engineers.

But it will dramatically change how we work.

Instead of reviewing every line of code manually, engineers will increasingly rely on AI to:

• scan large codebases

• identify risky patterns

• detect hidden bugs

The engineer's role becomes more about system design and decision making.

Final Thoughts

AI code analysis tools are improving rapidly.

They won't eliminate traditional reviews, but they can dramatically reduce the time it takes to detect problems in production systems.

And sometimes they find things humans miss.

The real question is:

How soon will AI become part of every engineering workflow?

I Replaced My On-Call Runbook with AI — Here’s What Happened in Production

Ravi Teja Reddy Mandala — Thu, 12 Mar 2026 03:31:32 +0000

Last month I tried something risky.

Instead of waking up at 3AM to debug production incidents, I experimented with an AI assistant handling the first layer of incident triage.

No runbook.
No manual log digging.
Just AI analyzing alerts, logs, and metrics.

Here’s what actually happened in production.

The Problem Every On-Call Engineer Knows

If you've ever been on call, you know the routine.

PagerDuty fires.

You open logs.

You check dashboards.

You run the same 5 commands.

Every single time.

The process is predictable, but it still requires a human in the loop.

So I asked a simple question:

Why can't AI do the first layer of incident investigation?

The Idea

Instead of engineers performing repetitive triage, I built a simple AI incident assistant.

The AI receives alerts and performs initial debugging steps automatically.

Architecture looked like this:

Alert → AI Agent → Log Analysis → Root Cause Guess → Suggested Fix

Tools used:

OpenAI API
GitHub Actions
Kubernetes logs
Prometheus metrics

The AI Prompt

The core of the system was surprisingly simple.

You are a Site Reliability Engineer assistant.

Analyze the following production logs and metrics.

Tasks:
1. Identify possible root causes
2. Classify incident severity
3. Suggest debugging steps
4. Provide likely remediation

This prompt runs every time a critical alert fires.

Real Incident Example

Incident: API latency spike

Logs showed increased response times.

The AI analyzed the logs and returned:

Possible Root Cause
Redis latency increase due to connection pool saturation.

Suggested Debugging Steps

Check Redis CPU usage
Inspect connection pool metrics
Verify recent deployment changes

Suggested Fix
Scale Redis replicas or increase connection pool size.

Time to initial diagnosis:

3 minutes

Typical human triage time:

15–20 minutes

What Worked Surprisingly Well

The AI was very good at:

Pattern recognition in logs
Suggesting common infrastructure fixes
Identifying deployment-related issues

It reduced time spent on basic incident investigation dramatically.

What Failed

AI is not perfect.

Twice it suggested completely wrong root causes.

Example:

It blamed database contention when the real issue was a misconfigured feature flag.

Lesson learned:

Never allow AI to make production changes automatically.

AI should assist engineers, not replace them.

The Future of On-Call Engineering

The biggest realization was this:

AI doesn't replace engineers.

It replaces the boring parts of operations.

The repetitive steps.
The predictable debugging paths.
The manual log searching.

The future of SRE might look like this:

Alert → AI Investigation → Engineer Decision

Engineers focus on solving real problems.

AI handles the repetitive investigation.

Final Thoughts

After running this experiment for a few weeks, one thing became clear.

AI is incredibly useful for incident triage.

Not perfect.

But powerful enough to reduce on-call fatigue significantly.

And honestly…

Anything that reduces 3AM debugging sessions is worth exploring.

If you're experimenting with AI in DevOps or SRE workflows, I'd love to hear what you're building.

When AI Becomes Your On-Call Engineer: The Future of Incident Response

Ravi Teja Reddy Mandala — Tue, 10 Mar 2026 00:34:31 +0000

Modern production systems generate millions of logs and alerts. But what happens when AI starts acting like an on-call engineer? Let’s explore how AI is changing incident response forever.

The Problem With Traditional Incident Response

Most incident workflows still look like this:

Alert fires
PagerDuty wakes someone up
Engineer opens dashboards
Checks logs
Checks metrics
Correlates changes
Identifies root cause

Even for experienced engineers, this process often takes 20–60 minutes.

The real challenge isn't fixing the issue.

The real challenge is finding the signal inside massive operational noise.

In large cloud systems we often deal with:

Millions of logs
Hundreds of deployments
Thousands of metrics
Dozens of dependent services

Humans simply cannot analyze all this information quickly enough.

Enter AI-Driven Incident Triage

AI systems are starting to change how incidents are investigated.

Instead of engineers manually searching through dashboards and logs, AI can:

correlate logs across services
detect anomaly patterns
identify suspicious deployments
analyze request traces
generate possible root causes

This creates a new workflow:

Alert → AI Investigation → Human Confirmation → Fix

The engineer becomes the decision maker, not the log detective.

Example: AI Debugging a Production Incident

Imagine a latency spike in a payment API.

Traditional debugging might look like this:

Check Grafana dashboards
Search logs across services
Look at recent deployments
Analyze request traces
Compare infrastructure metrics

This investigation could easily take 30 minutes or more.

An AI system, however, could analyze all signals in seconds and return something like:

“Latency spike likely caused by increased retries between payment-service and auth-service after deployment version v2.4.1.”

Instead of digging through dashboards, the engineer immediately focuses on the real issue.

The Next Evolution: Autonomous Incident Response

The next phase is even more interesting.

AI systems will not only analyze incidents — they will start resolving them automatically.

We are already seeing early versions of this in modern platforms:

automatic rollback of faulty deployments
restarting unhealthy services
dynamic traffic routing
automated scaling decisions

This means many incidents could be resolved before engineers even notice them.

What This Means for SREs

AI will not replace SREs.

But it will significantly change the role of reliability engineers.

Instead of spending time manually debugging incidents, engineers will focus more on:

designing resilient architectures
building observability pipelines
training AI operational models
validating automated responses

SREs will shift from incident responders to reliability architects.

The Real Challenge: Trust

The biggest challenge isn't technology.

It's trust.

Engineers must learn to trust systems that can:

investigate incidents
recommend fixes
automatically resolve problems

But this pattern isn't new.

Years ago engineers were hesitant to trust:

automated deployments
autoscaling systems
infrastructure as code

Today those tools are essential.

AI-driven operations will likely follow the same path.

Final Thoughts

The future of reliability engineering may look very different from today.

Engineers will design systems.

AI will monitor them.

Many incidents will be detected, analyzed, and resolved automatically.

And the dreaded 2 AM production page might finally become rare.

Or at least… much quieter.

Build an AI Code Review Agent in GitHub Actions (That Actually Reduces Incidents

Ravi Teja Reddy Mandala — Sun, 22 Feb 2026 01:30:54 +0000

Build an AI Code Review Agent in GitHub Actions (That Actually Reduces Incidents)

A production-grade GitHub Actions workflow + an SRE reliability rubric that transforms AI from a code suggester into a structured risk detection system.

We tried AI code review in CI.

It was fast.
It was confident.
It was mostly noise.

It praised trivial refactors.
It nitpicked formatting.
It occasionally hallucinated “critical issues.”

And it did absolutely nothing to reduce production incidents.

The mistake wasn’t using AI.

The mistake was asking AI to “review code.”

In reliability engineering, we don’t ask:
“Is this code good?”

We ask:

What is the blast radius?
What is the rollback plan?
What happens under failure?
What is the operational risk?

So we rebuilt our AI reviewer using SRE principles.

This is the exact system.

🚨 Why Most AI Code Review Systems Fail

Most implementations:

• Run LLM over a PR diff

• Ask for general feedback

• Post suggestions as a comment

The result?

Unstructured opinions.

But production incidents are rarely caused by style issues.
They’re caused by:

Missing rollback strategy
Untested edge cases
Configuration drift
Silent failure paths
Inconsistent validation
Operational blind spots

If your AI doesn’t classify risk, it cannot reduce incidents.

🧠 The Shift: From “Suggestions” to “Structured Risk Classification”

We introduced a mandatory review schema.

AI must output:

Category
Severity
Confidence
Production Impact
Required Action

If it cannot classify something within this structure — it doesn’t get posted.

This immediately reduced noise by ~60%.

Because vague suggestions were eliminated.

📋 The Reliability Review Rubric

This is the foundation.

Category	Severity	Confidence	Required Output	Production Lens
Reliability	High	Certain	Rollback plan	Data loss? Downtime?
Security	High	Likely	Validation proof	External input risk?
Testing	Medium	Certain	Missing tests	Edge-case exposure?
Operability	Medium	Likely	Logging/metrics	Debuggability risk?
Performance	Medium	Uncertain	Benchmark proof	Latency spike risk?

Key Rule:

The AI is not allowed to:

Approve code
Suggest stylistic improvements unless they impact reliability
Comment without severity classification

This converts AI from opinion engine → reliability signal engine.

⚙️ GitHub Actions Architecture

We designed the workflow in 4 stages:

Diff Extraction
Context Enrichment
AI Risk Classification
Structured PR Feedback

🛠️ Step 1: Extract the True PR Surface Area

We only feed:

Changed files
Unified diff
File ownership context
Environment metadata (service type, criticality level)

name: AI Reliability Code Review

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  reliability-review:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Generate PR Diff
        run: git diff origin/main...HEAD > pr.diff

      - name: Collect Metadata
        run: |
          echo "service=payments-api" >> context.txt
          echo "tier=critical" >> context.txt

Why this matters:

Context drastically improves classification accuracy.
A migration change in a critical payments service ≠ UI change in a dashboard.

🤖 Step 2: AI Classification Layer

Instead of prompting:

“Review this code.”

We prompt:

“Classify each risk under this schema. If uncertain, mark confidence as uncertain. Do not speculate.”

Example expected output:

AI Risk Report

Category	Severity	Confidence	Finding
Reliability	High	Certain	DB migration lacks rollback path
Testing	Medium	Likely	No null-input test coverage
Operability	Medium	Certain	No structured error logging

No essays.
No praise.
Just structured risk.

💬 Step 3: Structured PR Comment Output

We auto-generate:

### 🔎 AI Reliability Review

| Category | Severity | Confidence | Impact |
|----------|----------|------------|--------|
| Reliability | High | Certain | Possible migration failure without rollback |
| Testing | Medium | Likely | Edge case failure under null input |

### Required Actions
- [ ] Document rollback strategy
- [ ] Add null-input test
- [ ] Add structured logging for error path

Now the human reviewer can triage immediately.

📊 Real Operational Impact

After deploying this system:

• Review noise reduced significantly

• Reviewers focused on high-severity items first

• Rollback plans increased across PRs

• Edge-case test coverage improved

• Incident retros showed fewer “missing test / missing rollback” root causes

AI didn’t reduce incidents.

Structured enforcement did.

AI simply enforced discipline consistently.

🔒 Guardrails That Made It Production-Safe

We added strict constraints:

AI cannot block merge directly
High severity items require human acknowledgment
Low confidence findings are labeled informational
The model cannot auto-edit code
Outputs must match strict JSON schema

If schema validation fails → comment not posted.

This prevents hallucination-driven chaos.

📦 Sample Rubric Configuration

{
  "rules": [
    {
      "name": "Rollback Plan Missing",
      "category": "reliability",
      "severity": "high",
      "confidence_threshold": "likely",
      "required_output": "Explicit rollback steps documented"
    },
    {
      "name": "Edge Case Test Missing",
      "category": "testing",
      "severity": "medium",
      "confidence_threshold": "certain",
      "required_output": "Add test covering null and boundary inputs"
    }
  ]
}

🧭 Why This Works (Engineering Psychology)

Developers ignore vague feedback.

They respond to:

Severity
Production impact
Explicit required actions

By aligning AI output with how SRE teams think during incidents, we shifted code review from “opinion discussion” to “risk mitigation workflow.”

🚀 Final Takeaway

AI in CI is not about automation.

It is about structured risk visibility at scale.

If you:

Force classification
Enforce severity
Include confidence
Require action
Validate schema

You transform AI from a novelty into a reliability multiplier.

The difference isn’t the model.

It’s the framework around it.

What reliability rule would you add to this rubric to prevent your most painful incident?

Drop it below. I’ll expand this framework in a follow-up post.

How I Reduced Production Incidents as a Senior SRE (Without Slowing Releases)

Ravi Teja Reddy Mandala — Thu, 29 Jan 2026 22:54:09 +0000

Why reliability work fails in many teams

Most teams try to improve reliability by adding more monitoring or writing longer runbooks. That usually increases operational overhead without reducing incidents.

Real reliability improvements come from making change delivery predictable, alerts actionable, and incident response repeatable.

This article explains the practical steps I used as a Senior Site Reliability Engineer to reduce production incidents without slowing release velocity.

Establish a reliability baseline

Before fixing anything, I standardized how reliability was measured so decisions were driven by data, not opinions.

Signal	What it tells you
Change failure rate	How often deployments cause incidents
MTTR (p50 / p90)	How quickly the system recovers
SEV-2+ incidents	Overall production stability
Alert volume	Signal-to-noise for on-call engineers
Error budget burn	User-visible reliability impact

Make deployments safe by default

Most incidents originate from change. Instead of reducing deployments, I focused on reducing deployment risk.

What changed

Progressive delivery (canary → staged → full rollout)
Health gates on error rate and latency
Automatic rollback on failed gates
Consistent release validation across services

This approach allowed frequent deployments while dramatically lowering failure impact.

Replace alert noise with SLO-based paging

Noisy alerts train engineers to ignore production signals. I enforced a simple rule:

If an alert doesn’t require human action, it should not page.

Alerting improvements

Removed non-actionable alerts
Converted threshold alerts to SLO burn-rate alerts
Required ownership and runbooks for every page
Standardized severity definitions (SEV-1 to SEV-3)

This reduced on-call fatigue while improving detection of real user impact.

Reduce MTTR with operational runbooks

Incidents are time-critical. Long documentation does not help during outages.

I rewrote runbooks to focus on the first 10 minutes:

Symptoms to confirm
Likely root causes
Safe mitigation steps
Validation checks
Escalation path

This significantly improved recovery speed and on-call confidence.

Make incidents produce engineering improvements

Every incident must result in a system change — not just documentation.

Effective post-incident actions included:

Deployment guardrails
Automated tests
Capacity limits
Retry and timeout tuning

If an action item does not change the system, it does not prevent recurrence.

What worked consistently

Deployment safety delivers the highest reliability ROI
Actionable alerts outperform more alerts
SLOs align engineering work with user experience
Short runbooks beat long documentation
Reliability must scale beyond individual engineers

Closing thoughts

Reliability is not about heroics or slowing teams down. It is about building systems that make the safe action the easy action.

When reliability becomes part of delivery rather than an afterthought, both stability and velocity improve.

AI-Assisted Incident Triage in Large-Scale Cloud Systems: A Human-Centered Reliability Framework

Ravi Teja Reddy Mandala — Thu, 29 Jan 2026 18:57:15 +0000

Introduction

As cloud infrastructures evolve toward extreme scale, incident response has transitioned from a primarily reactive engineering function to a core reliability discipline.

Modern cloud incidents are rarely caused by single-point failures. Instead, they emerge from complex interactions between services, control planes, configuration systems, and external dependencies. In this environment, the central challenge of incident response is no longer detection, but interpretation.

Artificial intelligence is increasingly proposed as a solution to this challenge. However, the most effective use of AI in incident management is not autonomous remediation, but the augmentation of human decision-making.

This article presents a practical, production-informed framework for AI-assisted incident triage, grounded in large-scale cloud operations and real-world reliability constraints.

Why Traditional Incident Response Breaks at Scale

Traditional incident response models assume relatively isolated failure domains and linear root-cause analysis. These assumptions do not hold in modern cloud platforms operating across thousands of services and regions.

Alert floods, noisy signals, and partial telemetry often overwhelm on-call engineers. Even well-instrumented systems struggle to provide actionable context during cascading failures. As system complexity grows, human operators are forced to reason under uncertainty, time pressure, and incomplete information.

At scale, the limiting factor is not tooling or observability coverage, but cognitive load.

The Role of AI: Augmentation, Not Automation

AI systems are frequently positioned as autonomous responders capable of diagnosing and resolving incidents end-to-end. In practice, fully autonomous remediation introduces unacceptable risk in high-stakes production environments.

A more effective and realistic role for AI is decision support. AI can assist engineers by correlating signals, surfacing historical patterns, ranking hypotheses, and narrowing the search space during triage.

When used correctly, AI reduces time-to-understanding rather than attempting to replace human judgment.

A Practical Architecture for AI-Assisted Triage

A production-grade AI-assisted triage system should operate as a layered decision-support pipeline rather than a monolithic model.

At a high level, the architecture consists of:

Signal ingestion from metrics, logs, traces, and alerts
Context enrichment using topology, ownership, and recent changes
Hypothesis generation based on historical incidents and failure patterns
Confidence scoring and prioritization for human review

This approach preserves human control while accelerating insight generation during critical incidents.

Signals, Context, and Correlation at Runtime

Raw signals are rarely meaningful in isolation. Metrics spikes, error rates, and latency anomalies must be interpreted within operational context.

Effective triage systems correlate runtime signals with deployment events, configuration changes, dependency health, and blast radius estimation. AI models excel at identifying non-obvious relationships across these dimensions, especially under time pressure.

The goal is not to identify a single root cause immediately, but to continuously refine the most plausible explanations as new data arrives.

Failure Modes and Guardrails

AI-assisted systems introduce their own failure modes, including overconfidence, stale learning, and bias toward historically frequent issues.

To mitigate these risks, guardrails are essential. These include human-in-the-loop validation, transparency in model reasoning, conservative confidence thresholds, and strict separation between recommendation and execution.

AI should inform decisions, not make irreversible changes independently.

Production Lessons from Large-Scale Cloud Operations

In real-world operations, the most valuable AI systems are those that respect operational realities: partial data, evolving architectures, and the need for fast, defensible decisions.

Teams that successfully integrate AI into incident response focus on incremental adoption, continuous feedback, and tight integration with existing workflows rather than wholesale automation.

The result is not fewer incidents, but faster understanding, reduced mean time to recovery, and more sustainable on-call practices.

Conclusion

As cloud systems continue to scale in complexity, incident response must evolve beyond manual triage and reactive tooling.

AI-assisted incident triage offers a pragmatic path forward when applied as a cognitive amplifier rather than an autonomous actor. By augmenting human judgment with context-aware analysis and signal correlation, organizations can respond to incidents with greater speed, confidence, and resilience.