<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ravi Teja Reddy Mandala</title>
    <description>The latest articles on Forem by Ravi Teja Reddy Mandala (@ravi_teja_8b63d9205dc7a13).</description>
    <link>https://forem.com/ravi_teja_8b63d9205dc7a13</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3740202%2Fa312e715-340a-465f-8b32-799b2d694bb6.png</url>
      <title>Forem: Ravi Teja Reddy Mandala</title>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ravi_teja_8b63d9205dc7a13"/>
    <language>en</language>
    <item>
      <title>Why Most AI Agents Fail in Production Systems: A Systems Perspective</title>
      <dc:creator>Ravi Teja Reddy Mandala</dc:creator>
      <pubDate>Mon, 13 Apr 2026 22:00:12 +0000</pubDate>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13/why-most-ai-agents-fail-in-production-systems-a-systems-perspective-5dmk</link>
      <guid>https://forem.com/ravi_teja_8b63d9205dc7a13/why-most-ai-agents-fail-in-production-systems-a-systems-perspective-5dmk</guid>
      <description>&lt;p&gt;Most conversations around AI agents focus on model performance.&lt;/p&gt;

&lt;p&gt;In real production environments, that is rarely the limiting factor.&lt;/p&gt;

&lt;p&gt;After working closely with production systems, a clear pattern emerges:&lt;/p&gt;

&lt;p&gt;AI does not fail because of intelligence limitations.&lt;br&gt;
It fails because of system design gaps.&lt;/p&gt;

&lt;p&gt;Let’s break this down from a systems engineering perspective.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. Signal Quality &amp;gt; Model Quality
&lt;/h3&gt;

&lt;p&gt;AI systems rely entirely on input signals.&lt;/p&gt;

&lt;p&gt;But most production environments expose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;logs without context&lt;/li&gt;
&lt;li&gt;metrics without causality&lt;/li&gt;
&lt;li&gt;alerts without correlation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates fragmented visibility.&lt;/p&gt;

&lt;p&gt;Even a highly capable model cannot make reliable decisions on inconsistent signals.&lt;/p&gt;

&lt;p&gt;In practice, poor observability architecture is the first failure point.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Missing System Abstractions
&lt;/h3&gt;

&lt;p&gt;Human operators rely on implicit understanding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;service dependencies&lt;/li&gt;
&lt;li&gt;failure blast radius&lt;/li&gt;
&lt;li&gt;historical patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI systems do not have this intuition.&lt;/p&gt;

&lt;p&gt;If your architecture does not explicitly define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;service relationships&lt;/li&gt;
&lt;li&gt;ownership boundaries&lt;/li&gt;
&lt;li&gt;failure domains&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then the system becomes non-interpretable for machines.&lt;/p&gt;

&lt;p&gt;AI requires structured abstractions. Most systems were never designed for that.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Non-Deterministic Workflows
&lt;/h3&gt;

&lt;p&gt;Incident response in many teams is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;partially documented&lt;/li&gt;
&lt;li&gt;context-driven&lt;/li&gt;
&lt;li&gt;experience-heavy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This works well for humans.&lt;/p&gt;

&lt;p&gt;But AI systems require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deterministic steps&lt;/li&gt;
&lt;li&gt;clearly defined decision paths&lt;/li&gt;
&lt;li&gt;reproducible workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without this, automation becomes unreliable and unpredictable.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. The Hidden Constraint: System Readiness
&lt;/h3&gt;

&lt;p&gt;Before introducing AI into production, a more important question should be asked:&lt;/p&gt;

&lt;p&gt;Is the system ready for AI?&lt;/p&gt;

&lt;p&gt;A production system is “AI-ready” only if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;signals are consistent and correlated&lt;/li&gt;
&lt;li&gt;dependencies are explicitly modeled&lt;/li&gt;
&lt;li&gt;workflows are structured and repeatable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these, AI will amplify system weaknesses instead of solving them.&lt;/p&gt;




&lt;h3&gt;
  
  
  Key Insight
&lt;/h3&gt;

&lt;p&gt;We are trying to apply AI to systems that were never designed to be understood by machines.&lt;/p&gt;

&lt;p&gt;That is the core problem.&lt;/p&gt;




&lt;h3&gt;
  
  
  A Better Approach
&lt;/h3&gt;

&lt;p&gt;Instead of asking:&lt;/p&gt;

&lt;p&gt;“How do we improve the AI model?”&lt;/p&gt;

&lt;p&gt;We should ask:&lt;/p&gt;

&lt;p&gt;“How do we redesign systems to be machine-interpretable?”&lt;/p&gt;

&lt;p&gt;That shift changes everything.&lt;/p&gt;




&lt;p&gt;For engineers already experimenting with AI in production:&lt;/p&gt;

&lt;p&gt;What has been the hardest challenge so far&lt;br&gt;
— signal quality, dependency visibility, or workflow reliability?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>cloud</category>
    </item>
    <item>
      <title>I Put an AI Agent in My Incident Workflow for 7 Days. Here’s What Actually Broke.</title>
      <dc:creator>Ravi Teja Reddy Mandala</dc:creator>
      <pubDate>Thu, 09 Apr 2026 00:50:17 +0000</pubDate>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13/i-put-an-ai-agent-in-my-incident-workflow-for-7-days-heres-what-actually-broke-4jlc</link>
      <guid>https://forem.com/ravi_teja_8b63d9205dc7a13/i-put-an-ai-agent-in-my-incident-workflow-for-7-days-heres-what-actually-broke-4jlc</guid>
      <description>&lt;p&gt;Everyone says AI agents will reduce on-call fatigue.&lt;/p&gt;

&lt;p&gt;So I tried adding one into a real production incident workflow not to replace engineers, but to assist with triage, summarization, and next-step recommendations.&lt;/p&gt;

&lt;p&gt;It helped in some places.&lt;br&gt;
It failed in others.&lt;br&gt;
And the biggest lesson had less to do with the model and more to do with system design.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I integrated an AI agent into a typical incident response flow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incoming alerts from monitoring systems&lt;/li&gt;
&lt;li&gt;Initial triage and classification&lt;/li&gt;
&lt;li&gt;Root cause hypothesis&lt;/li&gt;
&lt;li&gt;Suggested remediation steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent was allowed to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Summarize alerts&lt;/li&gt;
&lt;li&gt;Group duplicate incidents&lt;/li&gt;
&lt;li&gt;Suggest possible causes&lt;/li&gt;
&lt;li&gt;Draft remediation steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent was NOT allowed to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Execute production changes&lt;/li&gt;
&lt;li&gt;Restart services&lt;/li&gt;
&lt;li&gt;Modify configs&lt;/li&gt;
&lt;li&gt;Trigger escalations automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was intentional. I wanted to see where it adds value without risking production.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Worked Surprisingly Well
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Alert Summarization
&lt;/h3&gt;

&lt;p&gt;The agent reduced noisy alerts into clean summaries.&lt;/p&gt;

&lt;p&gt;Instead of reading through logs, I got:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“High latency observed in service X after deployment Y. Likely related to dependency Z.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This alone saved time during high-pressure incidents.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Duplicate Incident Grouping
&lt;/h3&gt;

&lt;p&gt;It grouped alerts that were actually the same issue.&lt;/p&gt;

&lt;p&gt;This reduced alert fatigue and helped focus on the real root cause faster.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Drafting Next Steps
&lt;/h3&gt;

&lt;p&gt;It suggested reasonable first actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check recent deployments&lt;/li&gt;
&lt;li&gt;Validate dependency health&lt;/li&gt;
&lt;li&gt;Inspect error spikes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not perfect, but a solid starting point.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Broke Almost Immediately
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Wrong Prioritization
&lt;/h3&gt;

&lt;p&gt;The agent sometimes treated low-impact issues as critical.&lt;/p&gt;

&lt;p&gt;Severity is not just data. It is context.&lt;br&gt;
And context is hard.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. False Confidence
&lt;/h3&gt;

&lt;p&gt;The responses sounded very confident even when wrong.&lt;/p&gt;

&lt;p&gt;This is dangerous in production systems.&lt;br&gt;
Confidence ≠ correctness.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Noisy Recommendations
&lt;/h3&gt;

&lt;p&gt;Some suggestions were technically valid but operationally useless.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Restart the service”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In production, that is not always acceptable without deeper checks.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Escalation Confusion
&lt;/h3&gt;

&lt;p&gt;It struggled to decide when to involve humans.&lt;/p&gt;

&lt;p&gt;Too early → noise&lt;br&gt;&lt;br&gt;
Too late → risk  &lt;/p&gt;

&lt;p&gt;That balance is harder than it looks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Problem: System Design
&lt;/h2&gt;

&lt;p&gt;After a week, it became clear:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The AI agent was not the main problem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The real issues were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weak incident workflows&lt;/li&gt;
&lt;li&gt;Poor escalation design&lt;/li&gt;
&lt;li&gt;Lack of structured context&lt;/li&gt;
&lt;li&gt;No clear guardrails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your system is messy, the AI will reflect that mess faster.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture That Works Better
&lt;/h2&gt;

&lt;p&gt;Here is what I would recommend instead:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Alert comes in
&lt;/li&gt;
&lt;li&gt;AI summarizes + groups signals
&lt;/li&gt;
&lt;li&gt;AI suggests possible causes
&lt;/li&gt;
&lt;li&gt;Human validates context
&lt;/li&gt;
&lt;li&gt;AI drafts remediation options
&lt;/li&gt;
&lt;li&gt;Human approves final action
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;AI as a co-pilot, not an autopilot.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AI is great at summarization and pattern detection
&lt;/li&gt;
&lt;li&gt;It struggles with context and real-world constraints
&lt;/li&gt;
&lt;li&gt;Confidence can be misleading
&lt;/li&gt;
&lt;li&gt;System design matters more than model capability
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams trying AI in incident response are not failing because of the model.&lt;/p&gt;

&lt;p&gt;They are failing because their &lt;strong&gt;workflow is not designed for AI.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;AI can absolutely improve incident response.&lt;/p&gt;

&lt;p&gt;But if your escalation paths, permissions, and observability are weak,&lt;br&gt;&lt;br&gt;
the agent will not fix your system.&lt;/p&gt;

&lt;p&gt;It will expose it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Question for You
&lt;/h2&gt;

&lt;p&gt;Would you allow an AI agent in your on-call workflow?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recommendation only
&lt;/li&gt;
&lt;li&gt;Limited action with approval
&lt;/li&gt;
&lt;li&gt;Full automation
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Curious to hear how others are approaching this.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>cloudcomputing</category>
    </item>
    <item>
      <title>Your AI Agent Is Not Failing. Your System Design Is.</title>
      <dc:creator>Ravi Teja Reddy Mandala</dc:creator>
      <pubDate>Fri, 27 Mar 2026 01:39:04 +0000</pubDate>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13/your-ai-agent-is-not-failing-your-system-design-is-3k90</link>
      <guid>https://forem.com/ravi_teja_8b63d9205dc7a13/your-ai-agent-is-not-failing-your-system-design-is-3k90</guid>
      <description>&lt;p&gt;Everyone is blaming AI agents.&lt;/p&gt;

&lt;p&gt;“They hallucinate.”&lt;br&gt;
“They don’t scale.”&lt;br&gt;
“They can’t handle production.”&lt;/p&gt;

&lt;p&gt;That’s not the real problem.&lt;/p&gt;

&lt;p&gt;The real problem?&lt;/p&gt;

&lt;p&gt;We are treating AI agents like tools.&lt;/p&gt;

&lt;p&gt;Instead of systems.&lt;/p&gt;

&lt;p&gt;In production, nothing works in isolation.&lt;/p&gt;

&lt;p&gt;Not your services.&lt;br&gt;
Not your pipelines.&lt;br&gt;
Not your on-call workflows.&lt;/p&gt;

&lt;p&gt;But somehow…&lt;/p&gt;

&lt;p&gt;We expect AI agents to just “figure it out.”&lt;/p&gt;

&lt;p&gt;Here’s what I’ve seen in real systems:&lt;/p&gt;

&lt;p&gt;AI fails when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context is fragmented&lt;/li&gt;
&lt;li&gt;State is lost between steps&lt;/li&gt;
&lt;li&gt;Decisions are not traceable&lt;/li&gt;
&lt;li&gt;There are no guardrails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not because the model is bad.&lt;/p&gt;

&lt;p&gt;Most teams are building:&lt;/p&gt;

&lt;p&gt;❌ Prompt → Response → Done&lt;/p&gt;

&lt;p&gt;But production needs:&lt;/p&gt;

&lt;p&gt;✅ Context → State → Memory → Feedback → Control&lt;/p&gt;

&lt;p&gt;That’s the difference between:&lt;/p&gt;

&lt;p&gt;👉 Demo AI&lt;br&gt;
vs&lt;br&gt;
👉 Production AI&lt;/p&gt;

&lt;p&gt;The shift is simple, but most miss it:&lt;/p&gt;

&lt;p&gt;AI agents are not features.&lt;/p&gt;

&lt;p&gt;They are distributed systems with reasoning loops.&lt;/p&gt;

&lt;p&gt;Until we design them that way…&lt;/p&gt;

&lt;p&gt;We’ll keep blaming the model&lt;br&gt;
for system problems.&lt;/p&gt;

&lt;h1&gt;
  
  
  ai #sre #agents #devops #llm
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>code</category>
      <category>developers</category>
    </item>
    <item>
      <title>What Actually Happens When You Put an AI Agent on Call</title>
      <dc:creator>Ravi Teja Reddy Mandala</dc:creator>
      <pubDate>Thu, 19 Mar 2026 02:05:33 +0000</pubDate>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13/what-actually-happens-when-you-put-an-ai-agent-on-call-1llf</link>
      <guid>https://forem.com/ravi_teja_8b63d9205dc7a13/what-actually-happens-when-you-put-an-ai-agent-on-call-1llf</guid>
      <description>&lt;p&gt;&lt;em&gt;AI agent assisting with real-time production incident response&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Everyone is talking about AI agents.&lt;/p&gt;

&lt;p&gt;But very few are actually using them in real production workflows.&lt;/p&gt;

&lt;p&gt;And almost no one talks about what really happens when they are put on call.&lt;/p&gt;

&lt;p&gt;Writing code.&lt;br&gt;&lt;br&gt;
Reviewing pull requests.&lt;br&gt;&lt;br&gt;
Answering questions.&lt;br&gt;&lt;br&gt;
Automating workflows.&lt;/p&gt;

&lt;p&gt;That part is easy.&lt;/p&gt;

&lt;p&gt;But production is different.&lt;/p&gt;

&lt;p&gt;It is noisy.&lt;br&gt;&lt;br&gt;
It is unpredictable.&lt;br&gt;&lt;br&gt;
And it does not forgive mistakes.&lt;/p&gt;

&lt;p&gt;So I kept thinking about one question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What actually happens when an AI agent becomes part of an on-call workflow?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not in a demo.&lt;br&gt;&lt;br&gt;
Not in a toy setup.&lt;br&gt;&lt;br&gt;
But in real systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Modern production systems generate too much information during an incident.&lt;/p&gt;

&lt;p&gt;A single issue can create:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;alerts from multiple services
&lt;/li&gt;
&lt;li&gt;spikes in logs and metrics
&lt;/li&gt;
&lt;li&gt;duplicate symptom reports
&lt;/li&gt;
&lt;li&gt;confusion around root cause
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly where an AI agent looks useful.&lt;/p&gt;

&lt;p&gt;In theory, it can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summarize the incident
&lt;/li&gt;
&lt;li&gt;correlate alerts
&lt;/li&gt;
&lt;li&gt;scan logs
&lt;/li&gt;
&lt;li&gt;suggest likely causes
&lt;/li&gt;
&lt;li&gt;recommend next actions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That sounds great.&lt;/p&gt;

&lt;p&gt;But the real value is not in replacing the engineer.&lt;/p&gt;

&lt;p&gt;It is in reducing the time spent navigating noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where AI actually helps
&lt;/h2&gt;

&lt;p&gt;After thinking through how an AI agent fits into on-call, I see four areas where it can be genuinely useful.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Incident summarization
&lt;/h3&gt;

&lt;p&gt;During an active issue, the first problem is usually information overload.&lt;/p&gt;

&lt;p&gt;An AI agent can quickly turn scattered signals into something readable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what changed
&lt;/li&gt;
&lt;li&gt;which services are affected
&lt;/li&gt;
&lt;li&gt;when the issue started
&lt;/li&gt;
&lt;li&gt;what symptoms are most visible
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That alone can save valuable time.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Log and alert correlation
&lt;/h3&gt;

&lt;p&gt;A human usually jumps between dashboards, logs, alerts, and deployment history.&lt;/p&gt;

&lt;p&gt;An AI agent can act like a first-pass investigator:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;group similar errors
&lt;/li&gt;
&lt;li&gt;detect repeated patterns
&lt;/li&gt;
&lt;li&gt;connect alerts across services
&lt;/li&gt;
&lt;li&gt;highlight suspicious deployments or config changes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This does not replace debugging.&lt;/p&gt;

&lt;p&gt;But it gives the engineer a much better starting point.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Runbook guidance
&lt;/h3&gt;

&lt;p&gt;During an incident, even experienced engineers forget details.&lt;/p&gt;

&lt;p&gt;An AI agent can help by pulling the most relevant runbook steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;known mitigation paths
&lt;/li&gt;
&lt;li&gt;rollback instructions
&lt;/li&gt;
&lt;li&gt;common checks
&lt;/li&gt;
&lt;li&gt;escalation conditions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is especially useful when incidents happen outside normal working hours.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Post-incident support
&lt;/h3&gt;

&lt;p&gt;AI can also help after the issue is resolved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summarize the timeline
&lt;/li&gt;
&lt;li&gt;draft incident notes
&lt;/li&gt;
&lt;li&gt;organize contributing factors
&lt;/li&gt;
&lt;li&gt;prepare a clean starting point for postmortem review
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That reduces operational overhead and improves documentation quality.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where people get it wrong
&lt;/h2&gt;

&lt;p&gt;The biggest mistake is expecting AI to function like an autonomous incident commander.&lt;/p&gt;

&lt;p&gt;That is risky.&lt;/p&gt;

&lt;p&gt;Production systems are full of edge cases, hidden dependencies, and partial signals.&lt;/p&gt;

&lt;p&gt;An AI agent may sound confident while still being wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  A safer way to use it
&lt;/h2&gt;

&lt;p&gt;If you are introducing an AI agent into an on-call workflow, boundaries matter.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summarize, not blindly decide
&lt;/li&gt;
&lt;li&gt;recommend, not auto-execute
&lt;/li&gt;
&lt;li&gt;explain reasoning before suggesting action
&lt;/li&gt;
&lt;li&gt;stay within approved operational limits
&lt;/li&gt;
&lt;li&gt;escalate to humans for risky changes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where AI becomes useful without becoming dangerous.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;AI in production is not about intelligence.&lt;/p&gt;

&lt;p&gt;It is about control, constraints, and safety.&lt;/p&gt;

&lt;p&gt;The teams that benefit most will not be the ones that hand over control too early.&lt;/p&gt;

&lt;p&gt;They will be the ones that use AI to make engineers faster, clearer, and more informed.&lt;/p&gt;




&lt;h2&gt;
  
  
  👇 Curious
&lt;/h2&gt;

&lt;p&gt;Would you trust an AI agent to take action in production?&lt;/p&gt;

&lt;p&gt;Or should it stay as a recommendation layer?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>agents</category>
      <category>programming</category>
    </item>
    <item>
      <title>I Let AI Review 1,000 Lines of My Production Code — The Bugs It Found Shocked Me</title>
      <dc:creator>Ravi Teja Reddy Mandala</dc:creator>
      <pubDate>Sat, 14 Mar 2026 05:31:45 +0000</pubDate>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13/i-let-ai-review-1000-lines-of-my-production-code-the-bugs-it-found-shocked-me-660</link>
      <guid>https://forem.com/ravi_teja_8b63d9205dc7a13/i-let-ai-review-1000-lines-of-my-production-code-the-bugs-it-found-shocked-me-660</guid>
      <description>&lt;p&gt;Last week I ran an experiment.&lt;/p&gt;

&lt;p&gt;Instead of reviewing a new production service manually, I asked an AI model to analyze around 1,000 lines of production code.&lt;/p&gt;

&lt;p&gt;The goal was simple:&lt;/p&gt;

&lt;p&gt;Find bugs a human reviewer might miss.&lt;/p&gt;

&lt;p&gt;The result surprised me.&lt;/p&gt;

&lt;p&gt;The AI identified multiple potential issues in less than two minutes — including a race condition and an error handling problem that had already caused a production incident months ago.&lt;/p&gt;

&lt;p&gt;Here’s exactly what happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;The codebase contained roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1,000 lines of production service code&lt;/li&gt;
&lt;li&gt;several async workflows&lt;/li&gt;
&lt;li&gt;API retry logic&lt;/li&gt;
&lt;li&gt;distributed system error handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The service runs in a cloud environment and processes internal infrastructure requests.&lt;/p&gt;

&lt;p&gt;Instead of performing a traditional code review, I asked AI to:&lt;/p&gt;

&lt;p&gt;• analyze the code&lt;br&gt;&lt;br&gt;
• identify risky patterns&lt;br&gt;&lt;br&gt;
• suggest improvements  &lt;/p&gt;




&lt;h2&gt;
  
  
  What the AI Found
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Hidden Race Condition
&lt;/h3&gt;

&lt;p&gt;The AI detected a potential race condition involving asynchronous task execution.&lt;/p&gt;

&lt;p&gt;The issue occurred when multiple requests triggered the same background worker.&lt;/p&gt;

&lt;p&gt;This could lead to duplicate processing.&lt;/p&gt;

&lt;p&gt;It wasn’t obvious during normal code review.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Silent Failure in Error Handling
&lt;/h3&gt;

&lt;p&gt;One block caught exceptions but never logged them.&lt;/p&gt;

&lt;p&gt;That meant failures could occur silently.&lt;/p&gt;

&lt;p&gt;In production systems, silent failures are extremely dangerous because they hide operational issues.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Retry Logic That Could Amplify Outages
&lt;/h3&gt;

&lt;p&gt;The AI also pointed out a retry pattern that could unintentionally amplify incidents.&lt;/p&gt;

&lt;p&gt;Instead of exponential backoff, the system retried requests too aggressively.&lt;/p&gt;

&lt;p&gt;Under heavy load, this could worsen outages.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where AI Still Struggles
&lt;/h2&gt;

&lt;p&gt;AI analysis isn't perfect.&lt;/p&gt;

&lt;p&gt;In some cases the model suggested improvements that were unnecessary.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;• refactoring code that was already optimized&lt;br&gt;&lt;br&gt;
• simplifying logic that existed for historical reasons  &lt;/p&gt;

&lt;p&gt;This is why &lt;strong&gt;human review is still critical.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means for Engineers
&lt;/h2&gt;

&lt;p&gt;AI won't replace engineers.&lt;/p&gt;

&lt;p&gt;But it will dramatically change how we work.&lt;/p&gt;

&lt;p&gt;Instead of reviewing every line of code manually, engineers will increasingly rely on AI to:&lt;/p&gt;

&lt;p&gt;• scan large codebases&lt;br&gt;&lt;br&gt;
• identify risky patterns&lt;br&gt;&lt;br&gt;
• detect hidden bugs  &lt;/p&gt;

&lt;p&gt;The engineer's role becomes more about &lt;strong&gt;system design and decision making.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;AI code analysis tools are improving rapidly.&lt;/p&gt;

&lt;p&gt;They won't eliminate traditional reviews, but they can dramatically reduce the time it takes to detect problems in production systems.&lt;/p&gt;

&lt;p&gt;And sometimes they find things humans miss.&lt;/p&gt;

&lt;p&gt;The real question is:&lt;/p&gt;

&lt;p&gt;How soon will AI become part of every engineering workflow?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>code</category>
      <category>developers</category>
    </item>
    <item>
      <title>I Let AI Review 1,000 Lines of My Production Code — The Bugs It Found Shocked Me</title>
      <dc:creator>Ravi Teja Reddy Mandala</dc:creator>
      <pubDate>Sat, 14 Mar 2026 05:31:45 +0000</pubDate>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13/i-let-ai-review-1000-lines-of-my-production-code-the-bugs-it-found-shocked-me-19gn</link>
      <guid>https://forem.com/ravi_teja_8b63d9205dc7a13/i-let-ai-review-1000-lines-of-my-production-code-the-bugs-it-found-shocked-me-19gn</guid>
      <description>&lt;p&gt;Last week I ran an experiment.&lt;/p&gt;

&lt;p&gt;Instead of reviewing a new production service manually, I asked an AI model to analyze around 1,000 lines of production code.&lt;/p&gt;

&lt;p&gt;The goal was simple:&lt;/p&gt;

&lt;p&gt;Find bugs a human reviewer might miss.&lt;/p&gt;

&lt;p&gt;The result surprised me.&lt;/p&gt;

&lt;p&gt;The AI identified multiple potential issues in less than two minutes — including a race condition and an error handling problem that had already caused a production incident months ago.&lt;/p&gt;

&lt;p&gt;Here’s exactly what happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;The codebase contained roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1,000 lines of production service code&lt;/li&gt;
&lt;li&gt;several async workflows&lt;/li&gt;
&lt;li&gt;API retry logic&lt;/li&gt;
&lt;li&gt;distributed system error handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The service runs in a cloud environment and processes internal infrastructure requests.&lt;/p&gt;

&lt;p&gt;Instead of performing a traditional code review, I asked AI to:&lt;/p&gt;

&lt;p&gt;• analyze the code&lt;br&gt;&lt;br&gt;
• identify risky patterns&lt;br&gt;&lt;br&gt;
• suggest improvements  &lt;/p&gt;




&lt;h2&gt;
  
  
  What the AI Found
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Hidden Race Condition
&lt;/h3&gt;

&lt;p&gt;The AI detected a potential race condition involving asynchronous task execution.&lt;/p&gt;

&lt;p&gt;The issue occurred when multiple requests triggered the same background worker.&lt;/p&gt;

&lt;p&gt;This could lead to duplicate processing.&lt;/p&gt;

&lt;p&gt;It wasn’t obvious during normal code review.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Silent Failure in Error Handling
&lt;/h3&gt;

&lt;p&gt;One block caught exceptions but never logged them.&lt;/p&gt;

&lt;p&gt;That meant failures could occur silently.&lt;/p&gt;

&lt;p&gt;In production systems, silent failures are extremely dangerous because they hide operational issues.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Retry Logic That Could Amplify Outages
&lt;/h3&gt;

&lt;p&gt;The AI also pointed out a retry pattern that could unintentionally amplify incidents.&lt;/p&gt;

&lt;p&gt;Instead of exponential backoff, the system retried requests too aggressively.&lt;/p&gt;

&lt;p&gt;Under heavy load, this could worsen outages.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where AI Still Struggles
&lt;/h2&gt;

&lt;p&gt;AI analysis isn't perfect.&lt;/p&gt;

&lt;p&gt;In some cases the model suggested improvements that were unnecessary.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;• refactoring code that was already optimized&lt;br&gt;&lt;br&gt;
• simplifying logic that existed for historical reasons  &lt;/p&gt;

&lt;p&gt;This is why &lt;strong&gt;human review is still critical.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means for Engineers
&lt;/h2&gt;

&lt;p&gt;AI won't replace engineers.&lt;/p&gt;

&lt;p&gt;But it will dramatically change how we work.&lt;/p&gt;

&lt;p&gt;Instead of reviewing every line of code manually, engineers will increasingly rely on AI to:&lt;/p&gt;

&lt;p&gt;• scan large codebases&lt;br&gt;&lt;br&gt;
• identify risky patterns&lt;br&gt;&lt;br&gt;
• detect hidden bugs  &lt;/p&gt;

&lt;p&gt;The engineer's role becomes more about &lt;strong&gt;system design and decision making.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;AI code analysis tools are improving rapidly.&lt;/p&gt;

&lt;p&gt;They won't eliminate traditional reviews, but they can dramatically reduce the time it takes to detect problems in production systems.&lt;/p&gt;

&lt;p&gt;And sometimes they find things humans miss.&lt;/p&gt;

&lt;p&gt;The real question is:&lt;/p&gt;

&lt;p&gt;How soon will AI become part of every engineering workflow?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>code</category>
      <category>developers</category>
    </item>
    <item>
      <title>I Replaced My On-Call Runbook with AI — Here’s What Happened in Production</title>
      <dc:creator>Ravi Teja Reddy Mandala</dc:creator>
      <pubDate>Thu, 12 Mar 2026 03:31:32 +0000</pubDate>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13/i-replaced-my-on-call-runbook-with-ai-heres-what-happened-in-production-lc5</link>
      <guid>https://forem.com/ravi_teja_8b63d9205dc7a13/i-replaced-my-on-call-runbook-with-ai-heres-what-happened-in-production-lc5</guid>
      <description>&lt;p&gt;Last month I tried something risky.&lt;/p&gt;

&lt;p&gt;Instead of waking up at 3AM to debug production incidents, I experimented with an AI assistant handling the &lt;strong&gt;first layer of incident triage&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;No runbook.&lt;br&gt;
No manual log digging.&lt;br&gt;
Just AI analyzing alerts, logs, and metrics.&lt;/p&gt;

&lt;p&gt;Here’s what actually happened in production.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Problem Every On-Call Engineer Knows
&lt;/h2&gt;

&lt;p&gt;If you've ever been on call, you know the routine.&lt;/p&gt;

&lt;p&gt;PagerDuty fires.&lt;/p&gt;

&lt;p&gt;You open logs.&lt;/p&gt;

&lt;p&gt;You check dashboards.&lt;/p&gt;

&lt;p&gt;You run the same 5 commands.&lt;/p&gt;

&lt;p&gt;Every single time.&lt;/p&gt;

&lt;p&gt;The process is predictable, but it still requires a human in the loop.&lt;/p&gt;

&lt;p&gt;So I asked a simple question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Why can't AI do the first layer of incident investigation?&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  The Idea
&lt;/h2&gt;

&lt;p&gt;Instead of engineers performing repetitive triage, I built a simple &lt;strong&gt;AI incident assistant&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The AI receives alerts and performs initial debugging steps automatically.&lt;/p&gt;

&lt;p&gt;Architecture looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alert → AI Agent → Log Analysis → Root Cause Guess → Suggested Fix
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tools used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI API&lt;/li&gt;
&lt;li&gt;GitHub Actions&lt;/li&gt;
&lt;li&gt;Kubernetes logs&lt;/li&gt;
&lt;li&gt;Prometheus metrics&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The AI Prompt
&lt;/h2&gt;

&lt;p&gt;The core of the system was surprisingly simple.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a Site Reliability Engineer assistant.

Analyze the following production logs and metrics.

Tasks:
1. Identify possible root causes
2. Classify incident severity
3. Suggest debugging steps
4. Provide likely remediation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prompt runs every time a critical alert fires.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real Incident Example
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Incident:&lt;/strong&gt; API latency spike&lt;/p&gt;

&lt;p&gt;Logs showed increased response times.&lt;/p&gt;

&lt;p&gt;The AI analyzed the logs and returned:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Possible Root Cause&lt;/strong&gt;&lt;br&gt;
Redis latency increase due to connection pool saturation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Suggested Debugging Steps&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check Redis CPU usage&lt;/li&gt;
&lt;li&gt;Inspect connection pool metrics&lt;/li&gt;
&lt;li&gt;Verify recent deployment changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Suggested Fix&lt;/strong&gt;&lt;br&gt;
Scale Redis replicas or increase connection pool size.&lt;/p&gt;

&lt;p&gt;Time to initial diagnosis:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3 minutes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Typical human triage time:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;15–20 minutes&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What Worked Surprisingly Well
&lt;/h2&gt;

&lt;p&gt;The AI was very good at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pattern recognition in logs&lt;/li&gt;
&lt;li&gt;Suggesting common infrastructure fixes&lt;/li&gt;
&lt;li&gt;Identifying deployment-related issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It reduced time spent on &lt;strong&gt;basic incident investigation&lt;/strong&gt; dramatically.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Failed
&lt;/h2&gt;

&lt;p&gt;AI is not perfect.&lt;/p&gt;

&lt;p&gt;Twice it suggested completely wrong root causes.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;It blamed database contention when the real issue was a &lt;strong&gt;misconfigured feature flag&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Lesson learned:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Never allow AI to make production changes automatically.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AI should assist engineers, not replace them.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Future of On-Call Engineering
&lt;/h2&gt;

&lt;p&gt;The biggest realization was this:&lt;/p&gt;

&lt;p&gt;AI doesn't replace engineers.&lt;/p&gt;

&lt;p&gt;It replaces the &lt;strong&gt;boring parts of operations&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The repetitive steps.&lt;br&gt;
The predictable debugging paths.&lt;br&gt;
The manual log searching.&lt;/p&gt;

&lt;p&gt;The future of SRE might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alert → AI Investigation → Engineer Decision
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Engineers focus on solving real problems.&lt;/p&gt;

&lt;p&gt;AI handles the repetitive investigation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;After running this experiment for a few weeks, one thing became clear.&lt;/p&gt;

&lt;p&gt;AI is incredibly useful for &lt;strong&gt;incident triage&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Not perfect.&lt;/p&gt;

&lt;p&gt;But powerful enough to reduce on-call fatigue significantly.&lt;/p&gt;

&lt;p&gt;And honestly…&lt;/p&gt;

&lt;p&gt;Anything that reduces 3AM debugging sessions is worth exploring.&lt;/p&gt;




&lt;p&gt;If you're experimenting with AI in DevOps or SRE workflows, I'd love to hear what you're building.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>programming</category>
      <category>sre</category>
    </item>
    <item>
      <title>When AI Becomes Your On-Call Engineer: The Future of Incident Response</title>
      <dc:creator>Ravi Teja Reddy Mandala</dc:creator>
      <pubDate>Tue, 10 Mar 2026 00:34:31 +0000</pubDate>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13/when-ai-becomes-your-on-call-engineer-the-future-of-incident-response-5bb9</link>
      <guid>https://forem.com/ravi_teja_8b63d9205dc7a13/when-ai-becomes-your-on-call-engineer-the-future-of-incident-response-5bb9</guid>
      <description>&lt;p&gt;Modern production systems generate millions of logs and alerts. But what happens when AI starts acting like an on-call engineer? Let’s explore how AI is changing incident response forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Traditional Incident Response
&lt;/h2&gt;

&lt;p&gt;Most incident workflows still look like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Alert fires
&lt;/li&gt;
&lt;li&gt;PagerDuty wakes someone up
&lt;/li&gt;
&lt;li&gt;Engineer opens dashboards
&lt;/li&gt;
&lt;li&gt;Checks logs
&lt;/li&gt;
&lt;li&gt;Checks metrics
&lt;/li&gt;
&lt;li&gt;Correlates changes
&lt;/li&gt;
&lt;li&gt;Identifies root cause
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even for experienced engineers, this process often takes &lt;strong&gt;20–60 minutes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The real challenge isn't fixing the issue.&lt;/p&gt;

&lt;p&gt;The real challenge is &lt;strong&gt;finding the signal inside massive operational noise&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In large cloud systems we often deal with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Millions of logs&lt;/li&gt;
&lt;li&gt;Hundreds of deployments&lt;/li&gt;
&lt;li&gt;Thousands of metrics&lt;/li&gt;
&lt;li&gt;Dozens of dependent services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Humans simply cannot analyze all this information quickly enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enter AI-Driven Incident Triage
&lt;/h2&gt;

&lt;p&gt;AI systems are starting to change how incidents are investigated.&lt;/p&gt;

&lt;p&gt;Instead of engineers manually searching through dashboards and logs, AI can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;correlate logs across services
&lt;/li&gt;
&lt;li&gt;detect anomaly patterns
&lt;/li&gt;
&lt;li&gt;identify suspicious deployments
&lt;/li&gt;
&lt;li&gt;analyze request traces
&lt;/li&gt;
&lt;li&gt;generate possible root causes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates a new workflow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert → AI Investigation → Human Confirmation → Fix&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The engineer becomes the &lt;strong&gt;decision maker&lt;/strong&gt;, not the &lt;strong&gt;log detective&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example: AI Debugging a Production Incident
&lt;/h2&gt;

&lt;p&gt;Imagine a latency spike in a payment API.&lt;/p&gt;

&lt;p&gt;Traditional debugging might look like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check Grafana dashboards
&lt;/li&gt;
&lt;li&gt;Search logs across services
&lt;/li&gt;
&lt;li&gt;Look at recent deployments
&lt;/li&gt;
&lt;li&gt;Analyze request traces
&lt;/li&gt;
&lt;li&gt;Compare infrastructure metrics
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This investigation could easily take &lt;strong&gt;30 minutes or more&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An AI system, however, could analyze all signals in seconds and return something like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Latency spike likely caused by increased retries between &lt;code&gt;payment-service&lt;/code&gt; and &lt;code&gt;auth-service&lt;/code&gt; after deployment version &lt;code&gt;v2.4.1&lt;/code&gt;.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of digging through dashboards, the engineer immediately focuses on the real issue.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Next Evolution: Autonomous Incident Response
&lt;/h2&gt;

&lt;p&gt;The next phase is even more interesting.&lt;/p&gt;

&lt;p&gt;AI systems will not only &lt;strong&gt;analyze incidents&lt;/strong&gt; — they will start &lt;strong&gt;resolving them automatically&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We are already seeing early versions of this in modern platforms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;automatic rollback of faulty deployments
&lt;/li&gt;
&lt;li&gt;restarting unhealthy services
&lt;/li&gt;
&lt;li&gt;dynamic traffic routing
&lt;/li&gt;
&lt;li&gt;automated scaling decisions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means many incidents could be resolved &lt;strong&gt;before engineers even notice them&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means for SREs
&lt;/h2&gt;

&lt;p&gt;AI will not replace SREs.&lt;/p&gt;

&lt;p&gt;But it will significantly &lt;strong&gt;change the role of reliability engineers&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of spending time manually debugging incidents, engineers will focus more on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;designing resilient architectures
&lt;/li&gt;
&lt;li&gt;building observability pipelines
&lt;/li&gt;
&lt;li&gt;training AI operational models
&lt;/li&gt;
&lt;li&gt;validating automated responses
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SREs will shift from &lt;strong&gt;incident responders&lt;/strong&gt; to &lt;strong&gt;reliability architects&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Challenge: Trust
&lt;/h2&gt;

&lt;p&gt;The biggest challenge isn't technology.&lt;/p&gt;

&lt;p&gt;It's trust.&lt;/p&gt;

&lt;p&gt;Engineers must learn to trust systems that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;investigate incidents
&lt;/li&gt;
&lt;li&gt;recommend fixes
&lt;/li&gt;
&lt;li&gt;automatically resolve problems
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But this pattern isn't new.&lt;/p&gt;

&lt;p&gt;Years ago engineers were hesitant to trust:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;automated deployments
&lt;/li&gt;
&lt;li&gt;autoscaling systems
&lt;/li&gt;
&lt;li&gt;infrastructure as code
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Today those tools are essential.&lt;/p&gt;

&lt;p&gt;AI-driven operations will likely follow the same path.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The future of reliability engineering may look very different from today.&lt;/p&gt;

&lt;p&gt;Engineers will design systems.&lt;/p&gt;

&lt;p&gt;AI will monitor them.&lt;/p&gt;

&lt;p&gt;Many incidents will be detected, analyzed, and resolved automatically.&lt;/p&gt;

&lt;p&gt;And the dreaded &lt;strong&gt;2 AM production page&lt;/strong&gt; might finally become rare.&lt;/p&gt;

&lt;p&gt;Or at least… much quieter.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>cloud</category>
      <category>sre</category>
    </item>
    <item>
      <title>Build an AI Code Review Agent in GitHub Actions (That Actually Reduces Incidents</title>
      <dc:creator>Ravi Teja Reddy Mandala</dc:creator>
      <pubDate>Sun, 22 Feb 2026 01:30:54 +0000</pubDate>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13/build-an-ai-code-review-agent-in-github-actions-that-actually-reduces-incidents-439</link>
      <guid>https://forem.com/ravi_teja_8b63d9205dc7a13/build-an-ai-code-review-agent-in-github-actions-that-actually-reduces-incidents-439</guid>
      <description>&lt;p&gt;Build an AI Code Review Agent in GitHub Actions (That Actually Reduces Incidents)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A production-grade GitHub Actions workflow + an SRE reliability rubric that transforms AI from a code suggester into a structured risk detection system.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We tried AI code review in CI.&lt;/p&gt;

&lt;p&gt;It was fast.&lt;br&gt;
It was confident.&lt;br&gt;
It was mostly noise.&lt;/p&gt;

&lt;p&gt;It praised trivial refactors.&lt;br&gt;
It nitpicked formatting.&lt;br&gt;
It occasionally hallucinated “critical issues.”&lt;/p&gt;

&lt;p&gt;And it did absolutely nothing to reduce production incidents.&lt;/p&gt;

&lt;p&gt;The mistake wasn’t using AI.&lt;/p&gt;

&lt;p&gt;The mistake was asking AI to “review code.”&lt;/p&gt;

&lt;p&gt;In reliability engineering, we don’t ask:&lt;br&gt;
“Is this code good?”&lt;/p&gt;

&lt;p&gt;We ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is the blast radius?&lt;/li&gt;
&lt;li&gt;What is the rollback plan?&lt;/li&gt;
&lt;li&gt;What happens under failure?&lt;/li&gt;
&lt;li&gt;What is the operational risk?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we rebuilt our AI reviewer using SRE principles.&lt;/p&gt;

&lt;p&gt;This is the exact system.&lt;/p&gt;


&lt;h2&gt;
  
  
  🚨 Why Most AI Code Review Systems Fail
&lt;/h2&gt;

&lt;p&gt;Most implementations:&lt;/p&gt;

&lt;p&gt;• Run LLM over a PR diff&lt;br&gt;&lt;br&gt;
• Ask for general feedback&lt;br&gt;&lt;br&gt;
• Post suggestions as a comment  &lt;/p&gt;

&lt;p&gt;The result?&lt;/p&gt;

&lt;p&gt;Unstructured opinions.&lt;/p&gt;

&lt;p&gt;But production incidents are rarely caused by style issues.&lt;br&gt;
They’re caused by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing rollback strategy&lt;/li&gt;
&lt;li&gt;Untested edge cases&lt;/li&gt;
&lt;li&gt;Configuration drift&lt;/li&gt;
&lt;li&gt;Silent failure paths&lt;/li&gt;
&lt;li&gt;Inconsistent validation&lt;/li&gt;
&lt;li&gt;Operational blind spots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your AI doesn’t classify risk, it cannot reduce incidents.&lt;/p&gt;


&lt;h2&gt;
  
  
  🧠 The Shift: From “Suggestions” to “Structured Risk Classification”
&lt;/h2&gt;

&lt;p&gt;We introduced a mandatory review schema.&lt;/p&gt;

&lt;p&gt;AI must output:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Category&lt;/li&gt;
&lt;li&gt;Severity&lt;/li&gt;
&lt;li&gt;Confidence&lt;/li&gt;
&lt;li&gt;Production Impact&lt;/li&gt;
&lt;li&gt;Required Action&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If it cannot classify something within this structure — it doesn’t get posted.&lt;/p&gt;

&lt;p&gt;This immediately reduced noise by ~60%.&lt;/p&gt;

&lt;p&gt;Because vague suggestions were eliminated.&lt;/p&gt;


&lt;h2&gt;
  
  
  📋 The Reliability Review Rubric
&lt;/h2&gt;

&lt;p&gt;This is the foundation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;Confidence&lt;/th&gt;
&lt;th&gt;Required Output&lt;/th&gt;
&lt;th&gt;Production Lens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Reliability&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Certain&lt;/td&gt;
&lt;td&gt;Rollback plan&lt;/td&gt;
&lt;td&gt;Data loss? Downtime?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Likely&lt;/td&gt;
&lt;td&gt;Validation proof&lt;/td&gt;
&lt;td&gt;External input risk?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Testing&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Certain&lt;/td&gt;
&lt;td&gt;Missing tests&lt;/td&gt;
&lt;td&gt;Edge-case exposure?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operability&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Likely&lt;/td&gt;
&lt;td&gt;Logging/metrics&lt;/td&gt;
&lt;td&gt;Debuggability risk?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Uncertain&lt;/td&gt;
&lt;td&gt;Benchmark proof&lt;/td&gt;
&lt;td&gt;Latency spike risk?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Key Rule:&lt;/p&gt;

&lt;p&gt;The AI is not allowed to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Approve code&lt;/li&gt;
&lt;li&gt;Suggest stylistic improvements unless they impact reliability&lt;/li&gt;
&lt;li&gt;Comment without severity classification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This converts AI from opinion engine → reliability signal engine.&lt;/p&gt;


&lt;h2&gt;
  
  
  ⚙️ GitHub Actions Architecture
&lt;/h2&gt;

&lt;p&gt;We designed the workflow in 4 stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Diff Extraction&lt;/li&gt;
&lt;li&gt;Context Enrichment&lt;/li&gt;
&lt;li&gt;AI Risk Classification&lt;/li&gt;
&lt;li&gt;Structured PR Feedback&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  🛠️ Step 1: Extract the True PR Surface Area
&lt;/h2&gt;

&lt;p&gt;We only feed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Changed files&lt;/li&gt;
&lt;li&gt;Unified diff&lt;/li&gt;
&lt;li&gt;File ownership context&lt;/li&gt;
&lt;li&gt;Environment metadata (service type, criticality level)
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AI Reliability Code Review&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;opened&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;synchronize&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;reopened&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;reliability-review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Generate PR Diff&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;git diff origin/main...HEAD &amp;gt; pr.diff&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Collect Metadata&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;echo "service=payments-api" &amp;gt;&amp;gt; context.txt&lt;/span&gt;
          &lt;span class="s"&gt;echo "tier=critical" &amp;gt;&amp;gt; context.txt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Why this matters:&lt;/p&gt;

&lt;p&gt;Context drastically improves classification accuracy.&lt;br&gt;
A migration change in a critical payments service ≠ UI change in a dashboard.&lt;/p&gt;


&lt;h2&gt;
  
  
  🤖 Step 2: AI Classification Layer
&lt;/h2&gt;

&lt;p&gt;Instead of prompting:&lt;/p&gt;

&lt;p&gt;“Review this code.”&lt;/p&gt;

&lt;p&gt;We prompt:&lt;/p&gt;

&lt;p&gt;“Classify each risk under this schema. If uncertain, mark confidence as uncertain. Do not speculate.”&lt;/p&gt;

&lt;p&gt;Example expected output:&lt;/p&gt;
&lt;h3&gt;
  
  
  AI Risk Report
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;Confidence&lt;/th&gt;
&lt;th&gt;Finding&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Reliability&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Certain&lt;/td&gt;
&lt;td&gt;DB migration lacks rollback path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Testing&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Likely&lt;/td&gt;
&lt;td&gt;No null-input test coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operability&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Certain&lt;/td&gt;
&lt;td&gt;No structured error logging&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No essays.&lt;br&gt;
No praise.&lt;br&gt;
Just structured risk.&lt;/p&gt;


&lt;h2&gt;
  
  
  💬 Step 3: Structured PR Comment Output
&lt;/h2&gt;

&lt;p&gt;We auto-generate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;### 🔎 AI Reliability Review&lt;/span&gt;

| Category | Severity | Confidence | Impact |
|----------|----------|------------|--------|
| Reliability | High | Certain | Possible migration failure without rollback |
| Testing | Medium | Likely | Edge case failure under null input |

&lt;span class="gu"&gt;### Required Actions&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Document rollback strategy
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Add null-input test
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Add structured logging for error path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the human reviewer can triage immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Real Operational Impact
&lt;/h2&gt;

&lt;p&gt;After deploying this system:&lt;/p&gt;

&lt;p&gt;• Review noise reduced significantly&lt;br&gt;&lt;br&gt;
• Reviewers focused on high-severity items first&lt;br&gt;&lt;br&gt;
• Rollback plans increased across PRs&lt;br&gt;&lt;br&gt;
• Edge-case test coverage improved&lt;br&gt;&lt;br&gt;
• Incident retros showed fewer “missing test / missing rollback” root causes  &lt;/p&gt;

&lt;p&gt;AI didn’t reduce incidents.&lt;/p&gt;

&lt;p&gt;Structured enforcement did.&lt;/p&gt;

&lt;p&gt;AI simply enforced discipline consistently.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔒 Guardrails That Made It Production-Safe
&lt;/h2&gt;

&lt;p&gt;We added strict constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI cannot block merge directly&lt;/li&gt;
&lt;li&gt;High severity items require human acknowledgment&lt;/li&gt;
&lt;li&gt;Low confidence findings are labeled informational&lt;/li&gt;
&lt;li&gt;The model cannot auto-edit code&lt;/li&gt;
&lt;li&gt;Outputs must match strict JSON schema&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If schema validation fails → comment not posted.&lt;/p&gt;

&lt;p&gt;This prevents hallucination-driven chaos.&lt;/p&gt;




&lt;h2&gt;
  
  
  📦 Sample Rubric Configuration
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rules"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Rollback Plan Missing"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"reliability"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"confidence_threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"likely"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"required_output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Explicit rollback steps documented"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Edge Case Test Missing"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"testing"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"medium"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"confidence_threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"certain"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"required_output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Add test covering null and boundary inputs"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧭 Why This Works (Engineering Psychology)
&lt;/h2&gt;

&lt;p&gt;Developers ignore vague feedback.&lt;/p&gt;

&lt;p&gt;They respond to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Severity&lt;/li&gt;
&lt;li&gt;Production impact&lt;/li&gt;
&lt;li&gt;Explicit required actions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By aligning AI output with how SRE teams think during incidents, we shifted code review from “opinion discussion” to “risk mitigation workflow.”&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 Final Takeaway
&lt;/h2&gt;

&lt;p&gt;AI in CI is not about automation.&lt;/p&gt;

&lt;p&gt;It is about structured risk visibility at scale.&lt;/p&gt;

&lt;p&gt;If you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Force classification&lt;/li&gt;
&lt;li&gt;Enforce severity&lt;/li&gt;
&lt;li&gt;Include confidence&lt;/li&gt;
&lt;li&gt;Require action&lt;/li&gt;
&lt;li&gt;Validate schema&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You transform AI from a novelty into a reliability multiplier.&lt;/p&gt;

&lt;p&gt;The difference isn’t the model.&lt;/p&gt;

&lt;p&gt;It’s the framework around it.&lt;/p&gt;




&lt;p&gt;What reliability rule would you add to this rubric to prevent your most painful incident?&lt;/p&gt;

&lt;p&gt;Drop it below. I’ll expand this framework in a follow-up post.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>sre</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>How I Reduced Production Incidents as a Senior SRE (Without Slowing Releases)</title>
      <dc:creator>Ravi Teja Reddy Mandala</dc:creator>
      <pubDate>Thu, 29 Jan 2026 22:54:09 +0000</pubDate>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13/how-i-reduced-production-incidents-as-a-senior-sre-without-slowing-releases-49kb</link>
      <guid>https://forem.com/ravi_teja_8b63d9205dc7a13/how-i-reduced-production-incidents-as-a-senior-sre-without-slowing-releases-49kb</guid>
      <description>&lt;h2&gt;
  
  
  Why reliability work fails in many teams
&lt;/h2&gt;

&lt;p&gt;Most teams try to improve reliability by adding more monitoring or writing longer runbooks. That usually increases operational overhead without reducing incidents.&lt;/p&gt;

&lt;p&gt;Real reliability improvements come from making change delivery predictable, alerts actionable, and incident response repeatable.&lt;/p&gt;

&lt;p&gt;This article explains the practical steps I used as a Senior Site Reliability Engineer to reduce production incidents &lt;strong&gt;without slowing release velocity&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Establish a reliability baseline
&lt;/h2&gt;

&lt;p&gt;Before fixing anything, I standardized how reliability was measured so decisions were driven by data, not opinions.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;What it tells you&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Change failure rate&lt;/td&gt;
&lt;td&gt;How often deployments cause incidents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTTR (p50 / p90)&lt;/td&gt;
&lt;td&gt;How quickly the system recovers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SEV-2+ incidents&lt;/td&gt;
&lt;td&gt;Overall production stability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alert volume&lt;/td&gt;
&lt;td&gt;Signal-to-noise for on-call engineers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error budget burn&lt;/td&gt;
&lt;td&gt;User-visible reliability impact&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7y5a6t4jdvj9ufg4cb71.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7y5a6t4jdvj9ufg4cb71.png" alt=" " width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Make deployments safe by default
&lt;/h2&gt;

&lt;p&gt;Most incidents originate from change. Instead of reducing deployments, I focused on reducing deployment risk.&lt;/p&gt;

&lt;h3&gt;
  
  
  What changed
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Progressive delivery (canary → staged → full rollout)&lt;/li&gt;
&lt;li&gt;Health gates on error rate and latency&lt;/li&gt;
&lt;li&gt;Automatic rollback on failed gates&lt;/li&gt;
&lt;li&gt;Consistent release validation across services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach allowed frequent deployments while dramatically lowering failure impact.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6wk8s7waea9rl6cugnf8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6wk8s7waea9rl6cugnf8.png" alt=" " width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Replace alert noise with SLO-based paging
&lt;/h2&gt;

&lt;p&gt;Noisy alerts train engineers to ignore production signals. I enforced a simple rule:&lt;/p&gt;

&lt;p&gt;If an alert doesn’t require human action, it should not page.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alerting improvements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Removed non-actionable alerts&lt;/li&gt;
&lt;li&gt;Converted threshold alerts to SLO burn-rate alerts&lt;/li&gt;
&lt;li&gt;Required ownership and runbooks for every page&lt;/li&gt;
&lt;li&gt;Standardized severity definitions (SEV-1 to SEV-3)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reduced on-call fatigue while improving detection of real user impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reduce MTTR with operational runbooks
&lt;/h2&gt;

&lt;p&gt;Incidents are time-critical. Long documentation does not help during outages.&lt;/p&gt;

&lt;p&gt;I rewrote runbooks to focus on the &lt;strong&gt;first 10 minutes&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Symptoms to confirm&lt;/li&gt;
&lt;li&gt;Likely root causes&lt;/li&gt;
&lt;li&gt;Safe mitigation steps&lt;/li&gt;
&lt;li&gt;Validation checks&lt;/li&gt;
&lt;li&gt;Escalation path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This significantly improved recovery speed and on-call confidence.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fotixk7ukz5mgcfyu4tqi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fotixk7ukz5mgcfyu4tqi.png" alt=" " width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Make incidents produce engineering improvements
&lt;/h2&gt;

&lt;p&gt;Every incident must result in a system change — not just documentation.&lt;/p&gt;

&lt;p&gt;Effective post-incident actions included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deployment guardrails&lt;/li&gt;
&lt;li&gt;Automated tests&lt;/li&gt;
&lt;li&gt;Capacity limits&lt;/li&gt;
&lt;li&gt;Retry and timeout tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If an action item does not change the system, it does not prevent recurrence.&lt;/p&gt;

&lt;h2&gt;
  
  
  What worked consistently
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Deployment safety delivers the highest reliability ROI&lt;/li&gt;
&lt;li&gt;Actionable alerts outperform more alerts&lt;/li&gt;
&lt;li&gt;SLOs align engineering work with user experience&lt;/li&gt;
&lt;li&gt;Short runbooks beat long documentation&lt;/li&gt;
&lt;li&gt;Reliability must scale beyond individual engineers&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Closing thoughts
&lt;/h2&gt;

&lt;p&gt;Reliability is not about heroics or slowing teams down. It is about building systems that make the safe action the easy action.&lt;/p&gt;

&lt;p&gt;When reliability becomes part of delivery rather than an afterthought, both stability and velocity improve.&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>software</category>
      <category>incident</category>
    </item>
    <item>
      <title>AI-Assisted Incident Triage in Large-Scale Cloud Systems: A Human-Centered Reliability Framework</title>
      <dc:creator>Ravi Teja Reddy Mandala</dc:creator>
      <pubDate>Thu, 29 Jan 2026 18:57:15 +0000</pubDate>
      <link>https://forem.com/ravi_teja_8b63d9205dc7a13/ai-assisted-incident-triage-in-large-scale-cloud-systems-a-human-centered-reliability-framework-3gp8</link>
      <guid>https://forem.com/ravi_teja_8b63d9205dc7a13/ai-assisted-incident-triage-in-large-scale-cloud-systems-a-human-centered-reliability-framework-3gp8</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;As cloud infrastructures evolve toward extreme scale, incident response has transitioned from a primarily reactive engineering function to a core reliability discipline.&lt;/p&gt;

&lt;p&gt;Modern cloud incidents are rarely caused by single-point failures. Instead, they emerge from complex interactions between services, control planes, configuration systems, and external dependencies. In this environment, the central challenge of incident response is no longer detection, but interpretation.&lt;/p&gt;

&lt;p&gt;Artificial intelligence is increasingly proposed as a solution to this challenge. However, the most effective use of AI in incident management is not autonomous remediation, but the augmentation of human decision-making.&lt;/p&gt;

&lt;p&gt;This article presents a practical, production-informed framework for AI-assisted incident triage, grounded in large-scale cloud operations and real-world reliability constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Traditional Incident Response Breaks at Scale
&lt;/h2&gt;

&lt;p&gt;Traditional incident response models assume relatively isolated failure domains and linear root-cause analysis. These assumptions do not hold in modern cloud platforms operating across thousands of services and regions.&lt;/p&gt;

&lt;p&gt;Alert floods, noisy signals, and partial telemetry often overwhelm on-call engineers. Even well-instrumented systems struggle to provide actionable context during cascading failures. As system complexity grows, human operators are forced to reason under uncertainty, time pressure, and incomplete information.&lt;/p&gt;

&lt;p&gt;At scale, the limiting factor is not tooling or observability coverage, but cognitive load.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Role of AI: Augmentation, Not Automation
&lt;/h2&gt;

&lt;p&gt;AI systems are frequently positioned as autonomous responders capable of diagnosing and resolving incidents end-to-end. In practice, fully autonomous remediation introduces unacceptable risk in high-stakes production environments.&lt;/p&gt;

&lt;p&gt;A more effective and realistic role for AI is decision support. AI can assist engineers by correlating signals, surfacing historical patterns, ranking hypotheses, and narrowing the search space during triage.&lt;/p&gt;

&lt;p&gt;When used correctly, AI reduces time-to-understanding rather than attempting to replace human judgment.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Practical Architecture for AI-Assisted Triage
&lt;/h2&gt;

&lt;p&gt;A production-grade AI-assisted triage system should operate as a layered decision-support pipeline rather than a monolithic model.&lt;/p&gt;

&lt;p&gt;At a high level, the architecture consists of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Signal ingestion from metrics, logs, traces, and alerts
&lt;/li&gt;
&lt;li&gt;Context enrichment using topology, ownership, and recent changes
&lt;/li&gt;
&lt;li&gt;Hypothesis generation based on historical incidents and failure patterns
&lt;/li&gt;
&lt;li&gt;Confidence scoring and prioritization for human review
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach preserves human control while accelerating insight generation during critical incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signals, Context, and Correlation at Runtime
&lt;/h2&gt;

&lt;p&gt;Raw signals are rarely meaningful in isolation. Metrics spikes, error rates, and latency anomalies must be interpreted within operational context.&lt;/p&gt;

&lt;p&gt;Effective triage systems correlate runtime signals with deployment events, configuration changes, dependency health, and blast radius estimation. AI models excel at identifying non-obvious relationships across these dimensions, especially under time pressure.&lt;/p&gt;

&lt;p&gt;The goal is not to identify a single root cause immediately, but to continuously refine the most plausible explanations as new data arrives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Modes and Guardrails
&lt;/h2&gt;

&lt;p&gt;AI-assisted systems introduce their own failure modes, including overconfidence, stale learning, and bias toward historically frequent issues.&lt;/p&gt;

&lt;p&gt;To mitigate these risks, guardrails are essential. These include human-in-the-loop validation, transparency in model reasoning, conservative confidence thresholds, and strict separation between recommendation and execution.&lt;/p&gt;

&lt;p&gt;AI should inform decisions, not make irreversible changes independently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Lessons from Large-Scale Cloud Operations
&lt;/h2&gt;

&lt;p&gt;In real-world operations, the most valuable AI systems are those that respect operational realities: partial data, evolving architectures, and the need for fast, defensible decisions.&lt;/p&gt;

&lt;p&gt;Teams that successfully integrate AI into incident response focus on incremental adoption, continuous feedback, and tight integration with existing workflows rather than wholesale automation.&lt;/p&gt;

&lt;p&gt;The result is not fewer incidents, but faster understanding, reduced mean time to recovery, and more sustainable on-call practices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;As cloud systems continue to scale in complexity, incident response must evolve beyond manual triage and reactive tooling.&lt;/p&gt;

&lt;p&gt;AI-assisted incident triage offers a pragmatic path forward when applied as a cognitive amplifier rather than an autonomous actor. By augmenting human judgment with context-aware analysis and signal correlation, organizations can respond to incidents with greater speed, confidence, and resilience.&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>sre</category>
      <category>ai</category>
      <category>distributedsystems</category>
    </item>
  </channel>
</rss>
