Forem: Ajay Devineni

Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking

Ajay Devineni — Tue, 26 May 2026 17:34:07 +0000

Yesterday a piece came out that framed something I've been watching build across production environments for months.
There is a category of production incident that engineering teams are not tracking yet — because it doesn't fit any existing postmortem template. The agent initiated an action. The action was technically correct given the agent's context. The context was incomplete. The infrastructure cascaded. By the time the incident review happened, three teams were arguing about whether it was an agent failure or an infrastructure failure. Kore.ai
That argument happens because the two disciplines — SRE and autonomous agents — have never been formally connected at the decision layer.
Here's the connection I want to make explicit.
What Chaos Engineering Gets Right
Mature chaos engineering programs have a property that's easy to overlook because it's invisible when it's working. Before a human engineer initiates any experiment — a fault injection, a latency spike, a dependency kill — they make a judgment call: does this system have capacity to absorb a perturbation right now?
They check error budget burn rate. They look at whether upstream dependencies are stable. They assess whether the on-call team has bandwidth to respond if something goes wrong. They check whether there's a deploy in flight that makes this a bad time.
That judgment call is informal, often intuitive, and sometimes wrong. But it exists. It's the human-in-the-loop that decides whether the system is in a state to safely absorb autonomous action.
Agents don't make that call. They evaluate their task context, form a plan, and execute. The question "is right now a safe time for this action given the current reliability state of the system?" is not in their decision loop.
The agents delivering production value in 2026 share one defining property: bounded scope. The agent handles one domain, with a defined tool set, and explicitly refuses tasks outside that boundary. The boundary is what makes autonomous deployment safe. GlobeNewswire
Boundary on task scope is necessary. It's not sufficient. You also need a boundary on timing — a gate that checks whether the system's current reliability state can absorb what the agent is about to do.
The Pre-Action SRE Gate
I want to introduce a concrete pattern here: the Pre-Action SRE Gate — a check an agent runs against your existing SRE signals before executing any state-changing action.
The gate has three checks, all using metrics I've built out across this series:
Check 1 — Error Budget Headroom
Before acting, the agent queries current SLO error budget remaining for the services in its blast radius. If error budget is below threshold — the system is already burning faster than acceptable — the agent does not act autonomously. It escalates.
This is the chaos engineering judgment call, formalized as a programmatic check.
Check 2 — AQDD State
Approval Queue Depth Drift tells you whether the human oversight layer is already backed up. If AQDD is elevated — meaning humans can't process approvals fast enough — autonomous action during that window means any mistake won't be caught in time. Agent holds.
Check 3 — HER Trend
If the agent's own Human Escalation Rate has been elevated in the recent window, it's operating outside its reliable envelope. Letting it take autonomous action in that state compounds the risk. Agent escalates.
None of these metrics are new. They're from Post 4 and Post 10 of this series. What's new is using them as gates before action, not just as observability signals after the fact.
python# agentsre/pre_action_gate.py

from dataclasses import dataclass
from typing import Optional
from datetime import datetime, timezone
import json

@dataclass
class SREGateResult:
"""
Result of a Pre-Action SRE Gate check.

If approved is False, the agent must not proceed with
autonomous action — escalate to human owner per ARO record.

Attributes:
    approved: Whether autonomous action is cleared
    blocking_check: Which check blocked (if any)
    error_budget_pct: Current error budget remaining (0-100)
    aqdd_depth: Current approval queue depth
    her_trend: Recent HER rate (0-100)
    recommendation: What the agent should do
    checked_at: Timestamp of gate check
"""
approved: bool
blocking_check: Optional[str]
error_budget_pct: float
aqdd_depth: int
her_trend: float
recommendation: str
checked_at: str

class PreActionSREGate:
"""
Pre-Action SRE Gate — checks your SRE signal state before
an agent executes any autonomous write, remediation, scale
event, or config change.

This is the chaos engineering judgment call, formalized.
A human engineer checks these things before running an experiment.
Your agent should check them before acting autonomously.

Thresholds should be calibrated per agent and task class
in shadow mode — same protocol as HER and RTD baselines.
"""

def __init__(self,
             error_budget_min_pct: float = 20.0,
             aqdd_max_depth: int = 3,
             her_max_trend_pct: float = 15.0):
    """
    Args:
        error_budget_min_pct: Minimum error budget % required
            for autonomous action. Below this = escalate.
            Default 20% — agent should not consume budget
            that's already critically low.
        aqdd_max_depth: Max approval queue depth before
            autonomous action is blocked. Above this,
            humans can't course-correct fast enough.
        her_max_trend_pct: Max recent HER rate before
            autonomous action is blocked. Elevated HER
            means agent is already outside reliable envelope.
    """
    self.error_budget_min_pct = error_budget_min_pct
    self.aqdd_max_depth = aqdd_max_depth
    self.her_max_trend_pct = her_max_trend_pct

def check(self,
          agent_id: str,
          intended_action: str,
          error_budget_pct: float,
          aqdd_depth: int,
          her_trend_pct: float) -> SREGateResult:
    """
    Run pre-action SRE gate check.

    Call this before any autonomous state-changing action.
    If result.approved is False — escalate, do not act.

    Args:
        agent_id: Agent requesting action clearance
        intended_action: Description of what agent plans to do
        error_budget_pct: Current error budget remaining (0-100)
        aqdd_depth: Current approval queue depth
        her_trend_pct: Agent's recent HER rate (0-100)

    Returns:
        SREGateResult with approval decision and reasoning
    """
    # Check 1: Error budget headroom
    if error_budget_pct < self.error_budget_min_pct:
        return SREGateResult(
            approved=False,
            blocking_check="error_budget",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"Error budget at {error_budget_pct:.1f}% — "
                f"below {self.error_budget_min_pct}% minimum. "
                "Escalate to human owner. Do not act autonomously."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # Check 2: Approval queue state
    if aqdd_depth > self.aqdd_max_depth:
        return SREGateResult(
            approved=False,
            blocking_check="aqdd",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"Approval queue depth {aqdd_depth} exceeds "
                f"maximum {self.aqdd_max_depth}. "
                "Human oversight is backed up — autonomous action "
                "cannot be safely course-corrected. Hold."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # Check 3: Agent's own HER trend
    if her_trend_pct > self.her_max_trend_pct:
        return SREGateResult(
            approved=False,
            blocking_check="her_trend",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"HER at {her_trend_pct:.1f}% — "
                f"above {self.her_max_trend_pct}% threshold. "
                "Agent is operating outside reliable envelope. "
                "Escalate rather than act autonomously."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # All checks passed
    return SREGateResult(
        approved=True,
        blocking_check=None,
        error_budget_pct=error_budget_pct,
        aqdd_depth=aqdd_depth,
        her_trend=her_trend_pct,
        recommendation="Autonomous action cleared. Proceed within blast radius.",
        checked_at=datetime.now(timezone.utc).isoformat()
    )

def to_audit_log(self, agent_id: str,
                 intended_action: str,
                 result: SREGateResult) -> dict:
    """
    Structured audit log entry for every gate check.
    Every autonomous action attempt — approved or blocked —
    should be logged. This is your agent action audit trail.
    """
    return {
        "trace_type": "pre_action_gate",
        "agent_id": agent_id,
        "intended_action": intended_action,
        "gate_approved": result.approved,
        "blocking_check": result.blocking_check,
        "sre_signals": {
            "error_budget_pct": result.error_budget_pct,
            "aqdd_depth": result.aqdd_depth,
            "her_trend_pct": result.her_trend,
        },
        "recommendation": result.recommendation,
        "checked_at": result.checked_at,
    }

How This Connects to the Full Arc
Post 4 introduced DQR, TIE, HER, AQDD as observability SLIs — things you watch.
Post 10 introduced ARO — who owns the agent when those SLIs breach.
Post 11 introduced RTD — the reasoning observability layer.
Post 12 introduced CUR — context budget as a reliability ceiling.
This post introduces the Pre-Action SRE Gate — where all of those signals become decision inputs rather than observability outputs. The agent reads your SRE state before acting, not just after.
Resilience requires explicit investment in circuit breakers, graceful degradation, and clear failure modes that preserve system integrity. Teams building agents must invest in resilience infrastructure before pushing to higher-criticality workloads. SourceForge
The Pre-Action Gate is that infrastructure. It's your agent's circuit breaker — not on retry loops or cost, but on system-level reliability state.
The Postmortem Template Gap
79% of organizations now have AI agents in production. Gartner warns 40% of those projects will be canceled due to poor risk controls. The incidents happening in that gap don't fit existing postmortem templates because current templates ask: what changed? who deployed? what failed? Kore.ai
They don't ask: what was the error budget state when the agent acted? Was AQDD elevated, meaning the approval layer was already overwhelmed? Had the agent's HER been trending up, meaning it was already in unreliable territory?
Those questions need to be in your postmortem template. Add a section: Agent Pre-Action State — error budget at time of action, AQDD depth, HER trend. If your postmortem can't answer those three questions, you don't have the data to prevent the same incident from happening again.
The code is in agentsre/pre_action_gate.py on GitHub. MIT licensed, zero external dependencies.
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer
github.com/Ajay150313/agentsre

Why Your AI Agent Monitoring is Wrong (And How to Fix It)

Ajay Devineni — Mon, 25 May 2026 11:35:24 +0000

As I discussed in my SLO Design article, traditional reliability metrics fail for agentic AI systems. Now let's look at how to actually implement semantic monitoring in production.
Your AI agent is running in production.
HTTP 200. Uptime 99.9%. All dashboards are green.
But it's making the wrong decision 30% of the time.
Your monitoring won't tell you.
The Gap
I spent six months figuring this out the hard way. Traditional SRE monitoring measures infrastructure. Network latency. Error rates. Uptime. It's designed for services that crash when they break. But agents don't crash. They degrade. Slowly. Silently.
An agent can be:

94% accurate (still 94%)
But losing confidence (0.92 to 0.41)
Compensating by calling tools 3x more (1.1x to 3.1x)
While humans reject more of its output (1% to 19%)
As work piles up waiting for approval (8 to 340 items)

Your monitoring sees "everything is fine."
You see $2M impact by the time you notice.
What We Actually Need to Measure Not infrastructure metrics. Semantic metrics.
Four things:
Decision Quality Rate (DQR)
Is the agent picking the right tool?
Healthy: 92%+
Threshold for action: <85%
Tool Invocation Efficiency **(TIE)**
Is it over-compensating by calling tools more than normal?
Healthy: 1.0-1.2x baseline
Threshold for action: >1.5x
Human Escalation Rate (HER)
Are humans rejecting its decisions?
Healthy: <2%
Threshold for action: >5%
Approval Queue Depth Drift (AQDD)
Is work piling up waiting for approval?
Healthy: <20 pending
Threshold for action: >50 pending
When any of these drift, semantic failure is 48 hours away.
Real Scenario
Tuesday 2pm: Agent starts degrading. DQR drops from 94% to 88%. TIE increases from 1.1x to 1.4x. Nothing alarming yet by traditional metrics.
Your infrastructure monitoring stays green.
Thursday 10am: DQR at 62%. TIE at 3.1x. Queue at 340 items.
Your first alert finally fires - from your infrastructure monitoring noticing error rates creeping up.
You've just lost 40+ hours of bad decisions.
With semantic SLIs, you would have known Tuesday at 2:15pm.
How We Built It
Semantic SLI monitoring system that:

Tracks what matters - DQR, TIE, HER, AQDD (not uptime)
Detects degradation early - 48 hours before traditional SLIs Suggests remediation - Not just "something's wrong" Automates response - Progressive autonomy constraints

When degradation detected:

Agent autonomy automatically constrained (FULL → GUIDED → SUPERVISED → BLOCKED)
Slack notification sent with context
Remediation steps suggested (prioritized by success rate)
Everything tracked for audit and learning

Code Example
pythonfrom agentsre.orchestration import FintechSREOrchestrator, AgentRole, AlertManager

Initialize orchestrator

orch = FintechSREOrchestrator()
orch.register_agent("payment-1", AgentRole.PAYMENT_PROCESSOR)

Initialize alerts

alerts = AlertManager()

def on_critical_alert(alert_dict):
send_to_slack(alert_dict)

alerts.slack_handler = on_critical_alert

Update metrics as agent runs

orch.update_metrics(
agent_id="payment-1",
dqr=62.0, # Decision quality degraded
tie=2.8, # Tool calls increased
her=15.0, # Escalations up
aqd=180, # Queue growing
confidence=0.42,
cost=0.0003
)

Create alert with remediation suggestions

alert = alerts.create_alert(
agent_id="payment-1",
reason="Semantic degradation detected",
triggered_metrics=["DQR", "TIE", "HER"],
current_values={
"dqr": 62.0,
"tie": 2.8,
"her": 15.0,
"aqd": 180
}
)

Get remediation steps

for step in alert.suggested_remediations[:3]:
print(f"→ {step.action} ({step.estimated_time_minutes}min)")
Output:
→ Review latest 10 agent decisions - identify pattern (15min)
→ Check upstream service - likely returning bad data (10min)
→ Agent over-compensating - check confidence scores (10min)
What This Means for SRE
You're not just detecting problems. You're understanding them.
Instead of:

"Error rate is high"
"Latency is up"
"Something's wrong"

You get:

"Agent decision quality dropped 15%, tool calls increased 2.8x, humans rejecting 15% of output, 180 items pending approval"
Suggested fix: Check upstream service (likely corrupting data)
Severity: CRITICAL

That's the difference between reactive and proactive reliability.
Open Source Built all this open source. MIT licensed.
Tested in production at scale. Works with LangChain, CrewAI, Bedrock.
GitHub: https://github.com/Ajay150313/agentsre
For Your Team
If you're running agents in production, you probably have this problem too. You just don't know it yet.
Try semantic SLIs. If you catch something you didn't know was degrading (most teams do), you'll know it was worth it.
The cost of not knowing? Sometimes it's $2M.

The Context Window Is RAM — Why Your Agent's SLIs Are Telling You It's Full

Ajay Devineni — Fri, 22 May 2026 02:18:21 +0000

The Microsoft team that built the Azure SRE Agent published something in January that I keep coming back to.
Six months into building it, they realized they weren't building an SRE agent. They were building a context engineering system that happens to do site reliability engineering. Better models were table stakes, but what moved the needle was what they controlled: disciplined context management. Kore.ai
That framing is exactly right. And it has a reliability implication that I haven't seen anyone write about directly.
The Problem
Your agent's context window is volatile working memory. Fast, expensive, and non-persistent. It's RAM, not storage. When the session ends, it's gone. When it fills up, quality degrades — not linearly, but in ways that are hard to predict and easy to miss.
As you fill the context window, model quality drops non-linearly. "Lost in the middle," "not adhering to my instructions," and long-context degradation show up well before the advertised token limits. More tokens don't just cost latency — they quietly erode accuracy. Kore.ai
That quiet erosion is the reliability failure mode. It doesn't throw an exception. It doesn't spike your error rate. Your agent keeps running. It just makes progressively worse decisions as the context fills.
And here's the part I want to be specific about: you already have the SLIs to catch this. You just haven't connected them to context state yet.
What Context Overflow Looks Like in Your SLIs
When an agent's context fills beyond its effective working range, three things happen in order:
DQR (Decision Quality Rate) drops first. The agent's decisions get worse because early instructions are now competing with thousands of tokens of recent tool output. An instruction from turn 3 gets buried under content that arrived after it — the agent isn't ignoring it, it's attending more reliably to recent content as the session grows. This is a passive decay process, not a model bug. incident.io
RTD (Reasoning Trace Depth) climbs next. The agent re-plans more because its earlier context — what it already established about the problem — is partially decayed. It's not re-planning because something changed. It's re-planning because it partially forgot what it already figured out.
TIE (Tool Invocation Efficiency) degrades last. The agent starts calling tools to reconstruct context it already had. It queries the same data sources again. It re-fetches runbooks it already read. Tool call count per task climbs above baseline while task quality continues to fall.
By the time TIE is visibly elevated, you're already well into the degradation window. DQR was the earlier signal. And DQR dropping in a long-running session, without any external trigger, is your context overflow signature.
The Architecture Fix
Mem0's 2026 benchmarks quantify the difference clearly: full-context baseline (everything packed into the window) scored 72.9% accuracy using 26,000 tokens per query at 17 second p95 latency. A two-layer memory architecture scored 91.6% accuracy using under 7,000 tokens at 1.4 second p95 latency. That's an 18.7 point accuracy improvement while using 4x fewer tokens and cutting latency by 91%. Yahoo Finance
The two-layer architecture is straightforward:
Working memory (context window): Only what's needed for the current decision. Active task state, recent tool results, current instructions. Managed actively — compressed, summarized, or paged out as the session grows.
Persistent memory (external store): Facts that persist across decisions and sessions. User preferences, established system state, prior investigation findings, runbook contents. Fetched into context when relevant, not kept resident the whole time.
The discipline is knowing what belongs in each layer and managing the boundary actively.
Connecting This to Your Production Readiness Checklist
Before a long-running agent goes to production, two questions need answers:
What is the expected context budget for a typical session? Not the model's maximum. The budget at which you've measured DQR starting to degrade for this specific agent on this specific task class. That number is your operational ceiling, not the advertised token limit.
What happens when the agent approaches that ceiling? Does it compress? Summarize and page out? Escalate to human? Or does it silently continue with degrading accuracy until something downstream notices?
If the answer to the second question is "it keeps going," that's your reliability gap. The context ceiling needs the same circuit breaker thinking as your token budget ceiling from the cost post.
python# agentsre/context_budget.py

from dataclasses import dataclass, field
from typing import Optional
import json
from datetime import datetime, timezone

@dataclass
class ContextBudgetTracker:
"""
Track context utilization against operational DQR ceiling.

The model's advertised token limit is not your operational limit.
Your operational limit is the token count at which DQR starts
to degrade for this agent on this task class. Establish that
baseline in shadow mode. Set your ceiling below it.

Attributes:
    agent_id: Agent being tracked
    task_class: Task type (DQR ceiling varies by task complexity)
    operational_ceiling_tokens: Tokens at which DQR degrades
        for this agent/task combination. NOT the model's max.
    warning_threshold_pct: Fraction of ceiling triggering warning
    current_tokens: Current context utilization
"""
agent_id: str
task_class: str
operational_ceiling_tokens: int
warning_threshold_pct: float = 0.75
current_tokens: int = 0
session_id: str = ""
compression_events: int = 0

@property
def utilization_pct(self) -> float:
    """Current context utilization as fraction of operational ceiling."""
    return self.current_tokens / self.operational_ceiling_tokens

@property
def budget_status(self) -> str:
    """
    OK — within safe operating range
    WARNING — approaching DQR degradation ceiling
    CRITICAL — at or above operational ceiling, DQR degrading
    """
    u = self.utilization_pct
    if u < self.warning_threshold_pct:
        return "OK"
    elif u < 1.0:
        return "WARNING"
    return "CRITICAL"

def update(self, current_tokens: int) -> dict:
    """
    Update current context utilization and return status record.
    Call this after each tool call or model response.

    Returns status record for logging to CloudWatch / Datadog.
    """
    self.current_tokens = current_tokens
    record = {
        "agent_id": self.agent_id,
        "session_id": self.session_id,
        "task_class": self.task_class,
        "current_tokens": self.current_tokens,
        "operational_ceiling": self.operational_ceiling_tokens,
        "utilization_pct": round(self.utilization_pct, 3),
        "budget_status": self.budget_status,
        "compression_events": self.compression_events,
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }
    return record

def record_compression(self) -> None:
    """Call when context compression or summarization fires."""
    self.compression_events += 1

def should_compress(self) -> bool:
    """True when context is approaching DQR degradation ceiling."""
    return self.utilization_pct >= self.warning_threshold_pct

def should_escalate(self) -> bool:
    """
    True when context is at or above operational ceiling.
    At this point DQR is actively degrading.
    Escalate to human or terminate session cleanly.
    """
    return self.utilization_pct >= 1.0

The Practical Baseline Protocol
Before you can set an operational context ceiling, you need to know where DQR actually starts to degrade for your specific agent on your specific task class. The steps:
Run the agent in shadow mode on a representative sample of tasks. Record DQR at 25%, 50%, 75%, and 100% of the model's advertised context limit. Find the inflection point — where DQR starts dropping. Set your operational ceiling at 80% of that inflection point. That's your warning threshold. At the ceiling, trigger compression or escalation, not continuation.
This is the same baseline protocol as HER and RTD. Thirty days of shadow mode, measure the metric, set the threshold. The only difference is that context budget degradation is session-scoped rather than task-scoped.
Why This Post Belongs in This Series
Post 4 established DQR as your output quality SLI. Post 9 established token budget as a cost circuit breaker. Post 11 introduced RTD as your reasoning observability layer.
This post connects all three: context window mismanagement is the common cause that degrades DQR, elevates RTD, and burns your token budget simultaneously. Fix the memory architecture and you see improvement across all three SLIs. That's not a coincidence — they're measuring the same failure from different angles.
The code is in agentsre/context_budget.py on GitHub. MIT licensed, zero external dependencies.
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer
github.com/Ajay150313/agentsre

Your OTel Traces Are Lying to You Observability for the Reasoning Layer

Ajay Devineni — Tue, 19 May 2026 02:12:12 +0000

Three weeks ago someone on the AWS Builders Slack posted something that stopped me cold. Their production AI agent had been running for six hours. CPU normal. Memory stable. Latency within SLO. Zero error rate in CloudWatch.
The agent was re-planning on every single task. One tool kept returning stale data. The agent recognized it, switched tools, got a different failure, re-planned again. It completed tasks — slowly, expensively, with degrading output quality. Nothing in the dashboard moved.
This is not an edge case. This is the default failure mode of agentic AI in production, and your current observability stack cannot see it.
Why OTel Misses the Problem
OpenTelemetry is the best thing that's happened to observability in a decade. Traces, metrics, logs — stable across all three signal types as of the 2026 CNCF milestone. Auto-instrumentation is production-grade. The ecosystem is mature.
And for agent reasoning behavior, it is the wrong level of abstraction.
OTel traces infrastructure execution. A trace shows you: this request arrived, it called this service, that service called this database, the database returned in 42ms, the response went back. Perfect for distributed systems.
An agent doesn't execute a fixed call graph. An agent reasons. It evaluates state, picks a tool, observes the result, decides whether to continue or re-plan, picks another tool. The reasoning path is dynamic. The same input can produce different call graphs on different runs depending on what the tools return.
The key shift is that once agent reasoning is exported into your observability stack, traces stop showing infrastructure execution and start showing reasoning behavior — but only if you're emitting the right data. Kore.ai
Most teams aren't. They're emitting infrastructure spans. The reasoning is invisible.

The Pattern: Silent Degradation via Re-Planning Loops
Here's what silent agent degradation looks like in a trace when you're not capturing reasoning:
span: agent-task-processor duration: 4.2s status: OK
span: tool-call-cloudwatch duration: 0.8s status: OK
span: tool-call-s3 duration: 0.3s status: OK
span: tool-call-cloudwatch duration: 0.8s status: OK
span: tool-call-dynamodb duration: 0.4s status: OK
Looks fine. Four tool calls, all successful, task completed.
Here's what's actually happening:
agent receives task
→ plans: use CloudWatch metric X
→ calls CloudWatch: returns stale data (tool succeeds, data is wrong)
→ agent evaluates result: doesn't match expected state
→ RE-PLANS: try DynamoDB instead
→ calls DynamoDB: schema mismatch (tool succeeds, data wrong format)
→ RE-PLANS: back to CloudWatch, different metric
→ calls CloudWatch: stale again
→ RE-PLANS: escalate to human
Four successful spans. Two re-planning cycles. One HER escalation. Zero errors in your monitoring.
This is your RSI (Retry Storm Index) in action — not at the HTTP retry level, but at the reasoning level.

Introducing Reasoning Trace Depth
I want to introduce a new observable to pair with RSI: Reasoning Trace Depth (RTD).
RTD = the number of re-planning cycles an agent goes through before either completing a task or escalating.
Baseline for a healthy agent on routine tasks: 0–1 re-planning cycles.
Warning threshold: 3+ re-planning cycles.
Critical threshold: 5+ re-planning cycles (agent is effectively stuck).
RTD is your earliest signal. It rises before HER (because the agent is still trying before escalating), before latency becomes visible to users, and before cost metrics show anomalous spend.
pythonfrom dataclasses import dataclass, field
from typing import List, Optional
import time

@dataclass
class AgentDecisionTrace:
"""
Structured reasoning trace for a single agent task execution.
Emitted once per task — NOT once per tool call.
This is your reasoning observability layer.
"""
agent_id: str
session_id: str
task_id: str
timestamp: str

# Reasoning behavior
initial_plan: str
tools_called: List[str] = field(default_factory=list)
replan_count: int = 0           # RTD — Reasoning Trace Depth
replan_reasons: List[str] = field(default_factory=list)

# Outcome
task_completed: bool = False
human_escalated: bool = False   # HER signal

# Cost signals
total_tool_calls: int = 0
latency_ms: int = 0

# Quality proxy (if available)
confidence_proxy: Optional[float] = None

def emit_decision_trace(trace: AgentDecisionTrace) -> dict:
"""
Emit structured decision trace to your log aggregator.
This sits ABOVE your OTel infrastructure spans.
One entry per agent task — your reasoning observability layer.
"""
record = {
"trace_type": "agent_decision",
"agent_id": trace.agent_id,
"session_id": trace.session_id,
"task_id": trace.task_id,
"timestamp": trace.timestamp,
"reasoning": {
"initial_plan": trace.initial_plan,
"replan_count": trace.replan_count, # RTD
"replan_reasons": trace.replan_reasons,
"tools_sequence": trace.tools_called
},
"outcome": {
"completed": trace.task_completed,
"human_escalated": trace.human_escalated, # HER
},
"cost": {
"tool_calls_total": trace.total_tool_calls,
"latency_ms": trace.latency_ms
}
}

# Flag for immediate attention
if trace.replan_count >= 3:
    record["alert"] = "RTD_WARNING"
if trace.replan_count >= 5:
    record["alert"] = "RTD_CRITICAL"

return record

The Three-Layer Observability Model for Agents
Your current stack has two layers. You need three.
Layer 1 — Infrastructure (you already have this)
OTel traces, Prometheus metrics, structured logs. Tool call latency, error rates, resource utilization. This is what Datadog, Grafana, and CloudWatch show you. It's correct and necessary. It just doesn't see reasoning.
Layer 2 — Control Plane (from Post 7 — RAR, RSI, DCS)
Routing accuracy, retry patterns at the orchestration level, decomposition quality. This is your agent behavior at the workflow level — are tasks being routed correctly? Is the orchestrator stable?
Layer 3 — Reasoning (what's missing)
RTD (Reasoning Trace Depth), re-plan reasons, plan-to-execution delta, decision confidence proxies. One structured log entry per agent task. This is the layer your dashboards don't have.
The diagnostic flow when something feels wrong but dashboards are green:

Check Layer 1: Is infrastructure healthy?
→ Yes → move to Layer 2
Check Layer 2: Is RSI elevated? Is RAR degraded?
→ RSI elevated → move to Layer 3
Check Layer 3: Is RTD above baseline?
→ RTD > 3 → agent is re-planning, find the tool/data source causing it
→ RTD normal, HER elevated → agent is escalating cleanly, check decision envelope

What This Looks Like in CloudWatch
pythonimport boto3

cw = boto3.client('cloudwatch', region_name='us-east-1')

def publish_rtd_metric(agent_id: str, rtd_value: int) -> None:
"""
Publish Reasoning Trace Depth to CloudWatch.
Alert when RTD exceeds 3 — agent is re-planning excessively.
"""
cw.put_metric_data(
Namespace='AgentSRE/Reasoning',
MetricData=[{
'MetricName': 'ReasoningTraceDepth',
'Dimensions': [{'Name': 'AgentId', 'Value': agent_id}],
'Value': float(rtd_value),
'Unit': 'Count'
}]
)
Set your alarm at RTD > 3 sustained over a 5-minute window. That's your early warning before HER spikes, before users feel latency, before cost anomalies appear in your billing dashboard.

The Connection to Your Existing SLI Framework
If you've been following this series:

Post 4 introduced HER — your human escalation signal. HER is what happens after the agent gives up re-planning.
Post 7 introduced RSI — your retry storm signal at the control plane level.
This post introduces RTD — the earlier, reasoning-level signal that predicts both RSI and HER before they breach.

RTD → feeds → RSI → feeds → HER
The three form a causal chain. If you're only watching HER, you're watching the end of the chain. RTD gives you the front.

The Practical Checklist
Before your next agent ships, add to your production-readiness checklist:
☐ Decision trace structured logging configured (one JSON entry per task, not per span)
☐ RTD metric emitting to CloudWatch / Prometheus
☐ RTD baseline established (30-day shadow mode — same as HER baseline protocol)
☐ RTD alarm set at threshold > 3
☐ RTD correlated to HER in your dashboards — rising RTD without rising HER means the agent is struggling but not yet escalating
Your OTel traces are correct. They're just answering the wrong question.
https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-observability-activity-7462294037518159872-iF29?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer | github.com/Ajay150313/agentsre

The AI Agent Cost Ceiling Problem: Why Your AWS Bill Is Your Reliability Alert

Ajay Devineni — Mon, 11 May 2026 21:16:09 +0000

Production AI agents fail on tool calls 3–15% of the time. That's not a failure rate you fix — it's a reality you design around.

The teams that have designed around it have circuit breakers: token budgets, retry limits, cost anomaly alerts wired to incident response.

The teams that haven't find out from their AWS bill.

This article is about the reliability infrastructure between those two outcomes.

The Retry Loop Failure Mode

When an AI agent calls a tool and gets an ambiguous response — not an error, not a success, just something unexpected — most agents do what they're designed to do: they try again. And again. And again.

Without a hard retry limit, this becomes a loop. Without a token budget cap, the loop has no ceiling. Without observability instrumentation specific to retry signatures, your standard dashboards show nothing unusual until the cost spike appears.

In documented production deployments, the cost spike is the first operational signal that something has gone wrong. By that point, if the agent has write permissions and has queued remediation actions, the incident may have worsened before anyone noticed the loop.

This is the reliability problem behind the cost problem. The bill is the symptom. The missing circuit breaker is the cause.

Why Standard SLIs Don't Catch It

Request latency: normal. The agent is responding within SLO. Error rate: zero. Every call returns something — just not what the agent expected. Availability: 100%. The agent is up and running.

The retry loop produces none of the infrastructure-layer signals your existing alerts are watching.

What it does produce is a Tool Invocation Efficiency (TIE) anomaly — your agent is making 4, 6, 8 tool calls per task when its baseline is 2. That ratio climbing is your early warning. It fires before the billing cycle closes. It fires before the incident escalates.

This is why TIE is a first-class SLI in the agentsre library. It catches what latency and error rate miss.

The Three Circuit Breakers

Every production AI agent needs three reliability controls specifically for the retry loop failure mode:

1. Hard Token Budget Per Session

Set a maximum token count per agent session. Not a soft recommendation in the system prompt — a hard limit enforced at the infrastructure layer. When the agent hits the limit, it stops executing and routes to your escalation path.

The budget should be sized at 3x your P95 task token usage. A task that normally uses 2,000 tokens gets a 6,000-token ceiling. Anything above that is a signal, not normal operation.

from agentsre import AgentSLICollector, TaskRecord

# Track token usage as part of your task record
collector.record(TaskRecord(
    task_id="t-001",
    task_class="incident-analysis",
    tool_calls=8,               # elevated — baseline is 2.3
    decision_confidence=0.71,
    completed=True,
))

# TIE will catch the retry signature before the bill does
results = collector.collect("incident-analysis")
for r in results:
    if r.breached:
        trigger_circuit_breaker(r)

2. Retry Loop Signature in Observability

A retry loop has a distinctive signature: tool call count per task climbing above baseline, task completion time extending beyond P99, and decision confidence declining across sequential attempts.

Configure a CloudWatch alarm on TIE drift: when tool calls per task exceed 2x baseline for 10 consecutive minutes, fire an alert. This is your early warning before the cost spike and before the incident escalates.

# CloudWatch alarm for retry loop detection
aws cloudwatch put-metric-alarm \
  --alarm-name "AgentRetryLoopDetected" \
  --metric-name "ToolInvocationEfficiency" \
  --namespace "AgentReliability" \
  --statistic Average \
  --period 300 \
  --threshold 2.0 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:REGION:ACCOUNT:AgentAlerts

3. Cost Anomaly as Incident Trigger

Wire your AWS Cost Anomaly Detection to your incident management system. An AI agent whose cost per hour doubles is experiencing a reliability event — treat it as one.

Set a cost anomaly threshold at 150% of your rolling 7-day average for the relevant Lambda functions and Bedrock invocations. When it fires, it routes to the same on-call channel as your availability alerts — because it is an availability signal.

The Numbers Behind This

40% of agentic AI projects are expected to be cancelled by 2027. Cost overruns and inadequate risk controls rank in the top three reasons. These are not independent failure modes — they're the same failure mode at different stages of the same incident.

The retry loop causes the cost overrun. The missing circuit breaker causes the retry loop. The missing circuit breaker exists because teams treat AI agent reliability as an application problem rather than an infrastructure problem requiring SRE governance.

What To Do Before Your Next Agent Goes Live

Three checks before any AI agent touches production:

Check 1: Does this agent have a hard token budget enforced at the infrastructure layer? Not a prompt instruction — a hard limit.

Check 2: Is TIE instrumented per task class with a 2x-baseline breach alert configured?

Check 3: Is cost anomaly detection wired to your incident management system for this agent's associated AWS resources?

If any answer is no — the agent is not production-ready. It is demo-ready.

The circuit breaker for the retry loop costs an afternoon to build. The absence of it costs the project.

Open-source implementation: github.com/Ajay150313/agentsre — the agentsre library instruments TIE, DQR, HER, and AQDD out of the box with AWS CloudWatch integration.

LinkedIn discussion: https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7459711021738307584-x6cv?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

What ceiling do you have today when an agent starts looping?

The Double-Exposure Problem: When AI Agents and AI-Generated Code Fail Together

Ajay Devineni — Fri, 08 May 2026 02:02:45 +0000

Amazon's March 2026 AI outages — two separate incidents within three days, totaling more than 6 million lost orders — have done something unusual for the SRE community: they've made a failure mode visible that most teams have been quietly carrying in their production systems without acknowledging.

The incidents were traced to AI-generated code changes deployed without adequate approval gates. Amazon's response was a 90-day code safety reset across 335 critical systems, with a new requirement that AI-assisted code changes be reviewed by senior engineers before deployment.

That response is SRE discipline. Applied reactively. This article is about applying it proactively — and about a compounding failure mode most teams haven't modeled yet.

The Double-Exposure Problem

The SRE concept of blast radius asks: when a component fails, what is the maximum scope of impact? Most blast radius models assume that the failing component is one thing — a service, a database, a network partition.

In 2026 production environments, a new blast radius scenario is emerging that most models don't account for:

What happens when your AI agent and the AI-generated code it runs on fail simultaneously?

This is the double-exposure problem. It has three components:

Exposure 1 — AI runtime behavior. Your AI agent operates non-deterministically. Its decisions, tool selections, and reasoning paths vary across invocations. Standard observability — latency, error rate, availability — does not instrument this layer. The semantic failure modes (wrong decisions, context drift, tool compensation) are invisible to your dashboards.

Exposure 2 — AI-generated code changes. Your CI/CD pipeline uses AI assistance to generate infrastructure changes, configuration updates, or application code. According to Lightrun's 2026 survey of 200 senior SRE and DevOps leaders, 43% of these changes require manual debugging in production even after passing QA. Not a single survey respondent expressed "very confidence" that AI-generated code would behave correctly in production.

Exposure 3 — The interaction.** When an AI-generated code change deploys to the same environment your agent is operating in, you have two non-deterministic systems interacting. The code change may alter the agent's tool environment, context window, or available action space in ways that manifest as behavioral drift — drift that your current instrumentation will miss because it's measuring infrastructure, not agent behavior.

The result: a production incident that looks like agent degradation. The root cause is a code change. The RCA takes hours because the investigation starts at the wrong layer.

Why Standard Observability Misses This

IEEE Spectrum described this failure class in their recent article on quiet AI failures: every monitoring dashboard reads healthy while users report that system decisions are becoming wrong.

This is structurally identical to what happens in the double-exposure scenario. A code change that subtly alters an agent's tool environment produces no infrastructure-layer signal. The agent's HTTP responses stay at 200. Latency stays within SLO. Error budget stays unburned.

What changes is the agent's Decision Quality Rate — the percentage of decisions falling within expected behavioral bounds. And Tool Invocation Efficiency — the ratio of tool calls per task completion. And eventually Human Escalation Rate — the percentage of tasks requiring intervention.

None of these are instrumented in a standard observability stack. All of them detect the double-exposure failure mode before it reaches user impact.

The Governance Framework

Amazon's 90-day reset is a retroactive version of what proactive SRE governance looks like. Here are the four components that matter, drawn from first principles rather than post-incident response:

1. The AI Code Change Approval Gate

Every code change touching an AI agent's runtime environment — its tools, configuration, action space, or infrastructure — should require explicit approval before deployment. Not because AI code generation is untrustworthy, but because non-deterministic code changes interacting with non-deterministic runtime systems have a compounding failure surface that standard CI/CD testing cannot fully cover.

This is not a new concept. Amazon has now required it. The cost of implementing it proactively is hours. The cost of discovering it's missing is incidents.

Implementation: A dedicated approval stage in your deployment pipeline for changes flagged as AI-generated or agent-environment-adjacent. This is distinct from your standard peer review — it specifically evaluates: does this change touch any agent's tool environment, context configuration, or action space?

2. Behavioral Baseline Snapshots Around Code Deployments

Apply the same framework version governance pattern to AI code changes: snapshot your agent's behavioral baselines before the change deploys, and compare post-deployment behavior against them.

Specifically, capture per-task-class TIE and DQR baselines immediately before any deployment that touches your agent's environment. Run the deployment in a shadow environment for a minimum review period. If TIE drifts more than 15% or DQR drops more than 15%, flag for human review before promoting to production.

This is the instrumentation that would have surfaced Amazon's failure earlier in the pipeline — not at the infrastructure layer, but at the behavioral layer where the actual impact manifested.

from agentsre import AgentSLICollector, TaskRecord
from agentsre.sprawl import FrameworkVersionGovernance

# Capture baseline before deployment
gov = FrameworkVersionGovernance(
    tie_drift_threshold=1.15,
    dqr_drift_threshold=0.85,
    min_shadow_samples=30,
)

gov.snapshot_baseline(
    agent_id="your-agent",
    task_class="your-task-class",
    framework_version="pre-ai-code-change",
    tie_values=current_tie_samples,
    dqr_values=current_dqr_samples,
)

# After shadow deployment — evaluate before promoting
result = gov.evaluate_upgrade(
    agent_id="your-agent",
    task_class="your-task-class",
    production_version="pre-ai-code-change",
    shadow_version="post-ai-code-change",
)

if result.decision == UpgradeDecision.BLOCK:
    block_deployment(result.block_reason)

3. A Blast Radius Model for Double-Exposure

Most blast radius models assume one failing component. Run the double-exposure calculation explicitly:

Which of your production services depend on AI agents?
Which code paths in those services are AI-generated?
If both the agent's semantic behavior and the underlying code fail simultaneously, what is the maximum scope of user impact?
What is the safe degradation sequence — which agent capabilities can you reduce autonomously, and in what order?

This calculation should exist as a named document, owned by a named person, reviewed quarterly. It is the blast radius equivalent of a fire drill — done in advance so the answer is known before the incident.

4. A Proactive Runbook — Not Amazon's Retroactive Reset

Amazon's 90-day reset is a retroactive runbook. Write yours proactively. A minimum viable AI code reliability runbook covers:

Detection: Which metrics signal that an AI code change has degraded agent behavior? (Answer: TIE drift, DQR drop, HER increase — not latency or error rate)
Attribution: How do you determine whether the degradation is agent behavior, code change, or model drift? (Answer: compare against behavioral baseline snapshots captured pre-deployment)
Containment: What is the fastest path to reverting the code change while maintaining partial agent operation? (Answer: the progressive autonomy constraint ladder — not a binary kill switch)
Recovery criteria: When is it safe to redeploy? (Answer: shadow behavioral baselines within ±15% of production baseline for 30 consecutive minutes)

The SRE Perspective on AI Code Generation

The Lightrun finding that 88% of SRE leaders need two to three redeploy cycles to verify an AI-generated fix suggests something straightforward: the testing and verification frameworks for AI-generated code have not kept pace with the adoption of AI code generation.

This is the same lag that produced Amazon's incidents. And it's the same lag that the SRE community has closed before — with microservices, with Kubernetes, with cloud-native architectures. Each time, capability arrived before governance. The SRE discipline developed the governance.

The governance for AI-generated code in agent environments exists. Error budgets, blast radius models, approval gates, behavioral baseline comparison — these are standard SRE tools. They need to be applied to a new layer of the stack.

The open-source implementation is at github.com/Ajay150313/agentsre. The FrameworkVersionGovernance module handles behavioral baseline capture and comparison. The progressive constraint ladder handles safe degradation. Both work for AI code change governance as directly as they do for framework upgrades.

Amazon spent 6.3 million lost orders learning this lesson. Most teams can learn it for the cost of an afternoon.

What To Do This Week

If you're running AI agents in production and using AI-assisted code generation in the same environment:

Today: Identify which code changes in your last 30 days touched your agent's tool environment, configuration, or action space. Determine whether any were AI-generated. If yes — were they reviewed specifically for agent-environment impact?

This week: Add an AI code change flag to your deployment pipeline. Start capturing TIE and DQR baselines around any deployment flagged as agent-environment-adjacent.

This month: Run the double-exposure blast radius calculation. Document the result. Assign an owner. Review it with your team.

The Amazon incidents happened in March. The Lightrun survey data was collected in January. IEEE Spectrum is calling quiet failure one of the defining challenges of the year.

The signal is clear. The governance frameworks exist.

Open-source implementation: github.com/Ajay150313/agentsre
LinkedIn discussion: https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-activity-7458330530212835328-36__?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

What does your current approval gate for AI-generated code look like? Or is this the first time you've run the double-exposure calculation?

The Agent Control Plane is an SRE Problem: Governing the Orchestration Layer Nobody is Watching

Ajay Devineni — Mon, 04 May 2026 23:57:35 +0000

IBM's Distinguished Engineer Chris Hay declared this week that "agent control planes and multi-agent dashboards become real in 2026." Gartner projects that 40% of enterprise applications will use task-specific AI agents by 2026. The orchestration infrastructure to manage all of those agents — the control plane — is becoming the most critical and least governed layer in production AI.

This article applies SRE discipline to the agent control plane: what it is, what failure modes it introduces, and what instrumentation it requires before it goes to production.

What Is an Agent Control Plane?

In 2026, an agent control plane is the orchestration layer that:

Receives tasks from humans or upstream systems
Decomposes them into subtasks
Routes subtasks to specialist agents
Manages retry, rescheduling, and priority queues across the agent fleet
Makes autonomous decisions about resource allocation when demand spikes

The control plane is distinct from the agents it manages. It is infrastructure — the same way a Kubernetes control plane is distinct from the pods it schedules.

This distinction matters for reliability: when the control plane degrades, it does not degrade one agent. It degrades the entire fleet simultaneously.

The Control Plane Failure Taxonomy

Control plane failures are uniquely difficult to detect because they do not look like single-agent failures. They look like correlated degradation across multiple agents — which standard observability interprets as coincidence or noise.

Failure Class 1: Routing Drift

The control plane misassigns tasks to suboptimal agents — sending high-complexity reasoning tasks to agents specialized for retrieval, or routing compliance-sensitive tasks through agents without the required tool access. Each individual agent appears healthy. The control plane's routing logic is the failure.

Observable signal: fleet-wide DQR drops across unrelated task classes simultaneously.

Failure Class 2: Retry Storms

When multiple downstream agents fail simultaneously, the control plane retries across its full routing table. Each retry generates additional tool calls. If the control plane does not implement backoff and circuit breaking at the routing layer, a partial agent outage generates a retry storm that saturates the entire MCP tool layer.

Observable signal: fleet-wide TIE spike not attributable to any single agent or task class.

Failure Class 3: Priority Queue Starvation

Under load, control planes must prioritize. If the priority algorithm fails — or if it was never set — low-priority tasks consume resources that high-priority tasks need. Users of business-critical workflows experience silent slowdown while batch jobs consume capacity.

Observable signal: AQDD breaches across multiple task classes with no corresponding error rate increase.

Failure Class 4: Decomposition Accuracy Degradation

As task complexity increases, the control plane's decomposition logic produces subtask sets that are incomplete, redundant, or contradictory. Individual agents execute their subtasks correctly. The composed result is wrong because the decomposition was wrong.

Observable signal: HER climbs fleet-wide — humans are intervening not because agents failed, but because the task decomposition produced nonsensical results.

The Three SLIs Your Control Plane Needs

I extend the agentsre SLI framework with three control plane-specific measurements:

1. Routing Accuracy Rate (RAR)

The percentage of task assignments that match the optimal agent for the task class, measured against a labeled evaluation set.

RAR(t, w) = (correct_assignments / total_assignments) × 100

Baseline during a 30-day calibration window. Alert when RAR drops >15% from baseline — this is the signal that routing logic has drifted, usually because a new task class was added without updating routing rules.

2. Retry Storm Index (RSI)

The ratio of retry-generated tool calls to primary-invocation tool calls across the fleet in a rolling window.

RSI(t, w) = retry_tool_calls / primary_tool_calls

Normal RSI baseline is typically 0.05–0.15 (5–15% of tool calls are retries). RSI > 0.50 indicates retry storm conditions. RSI > 1.0 means more retry traffic than primary traffic — the control plane is in a positive feedback loop.

3. Decomposition Completeness Score (DCS)

The percentage of decomposed subtask sets that, when executed, produce outputs covering all requirements of the original task.

DCS requires a completeness validator per task class.

This is the hardest to instrument — it requires semantic understanding of task requirements. Start with a rule-based validator for your highest-volume task classes before attempting ML-based validation.

The Control Plane Governance Model

Separate SLO Ownership

The control plane is not owned by the same person who owns the agents. It is a separate system with a separate error budget. The control plane SLO owner:

Is paged when RAR drops >15% from baseline
Is paged when RSI exceeds 0.50 for 10+ minutes
Owns the retry storm runbook
Reviews control plane decomposition logic on every new task class addition

The Retry Storm Runbook (minimum viable version)

Every production control plane needs this runbook before launch:

Detection: RSI > 0.50 sustained 10 minutes → page control plane owner
Immediate action: Reduce control plane retry limit from default (3) to 1
Circuit breaking: Identify failing agents via fleet-wide TIE spike attribution. Apply circuit breaker (open at 85% semantic validation rate)
Recovery: Restore retry limit only after RSI returns to < 0.20 for 15 consecutive minutes
Postmortem trigger: Any RSI > 1.0 event requires a postmortem within 48 hours

Control Plane Version Governance

Apply the same framework upgrade governance to control plane versions as to agent framework versions: snapshot RAR, RSI, and DCS baselines before any control plane update. Run shadow traffic. Block promotion if any metric drifts beyond threshold.

Implementation on AWS

The three control plane SLIs instrument naturally on Bedrock's orchestration layer:

RAR: Evaluate routing decisions by comparing agentId in Bedrock orchestration traces against a task-class-to-optimal-agent mapping in DynamoDB
RSI: Count RETRY events vs INVOKE events in Bedrock CloudWatch logs, published as a ratio metric per 5-minute window
DCS: Lambda validator comparing subtask outputs against original task requirements, triggered by task completion events via EventBridge

Full implementation is in the agentsre library: https://github.com/Ajay150313/agentsre

Connecting the Arc

This is the fifth layer of the AI-SRE reliability framework:

Single-agent SLIs (DQR, TIE, HER, AQDD)
A2A semantic boundary validation + circuit breaker
Agent Sprawl governance (fleet inventory, framework canary, deprecation alerting)
Agent Control Plane SLIs (RAR, RSI, DCS) — this article

Each layer adds governance to the next abstraction level of the same infrastructure problem: autonomous AI operating in production without adequate reliability discipline.

LinkedIn discussion:
https://www.linkedin.com/posts/ajay-devineni_agenticai-sre-controlplane-share-7457213748500475904-yi9g?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

What's the biggest control plane reliability gap in your environment?

Agent Sprawl is Your Next Production Incident: An SRE Response to Datadog's State of AI Engineering 2026

Ajay Devineni — Fri, 01 May 2026 01:20:45 +0000

Datadog published the State of AI Engineering 2026 report this week — real telemetry from over a thousand production environments. Read it. It is the most comprehensive look at AI in production available right now.

I want to respond from the reliability engineering perspective, because the data reveals a problem the report names but doesn't fully resolve: agent sprawl is now a production reliability crisis, and the SRE discipline does not yet have governance frameworks for it.

What the Data Shows

Three findings stand out from an SRE perspective:

Framework adoption doubled year over year. LangChain, LangGraph, Pydantic AI, Vercel AI SDK — up from 9% of organizations in early 2025 to nearly 18% by 2026. Services using agentic frameworks: more than doubled.

70%+ of organizations run three or more models. The share running more than six models nearly doubled. Teams are building model portfolios rather than committing to a single provider.

Teams add models faster than they retire them. Datadog calls this "LLM tech debt." Each overlapping model introduces its own quality, latency, and cost profile. The report is explicit: this becomes a governance problem.

These three findings combine to describe an environment growing faster than it can be governed. I call this Agent Sprawl.

Defining Agent Sprawl

Agent Sprawl — the condition where AI agent infrastructure complexity (frameworks, models, tool layers, orchestration patterns) grows faster than your ability to measure and govern its reliability.

It is structurally identical to the microservices sprawl problem SRE teams faced between 2015 and 2020. Teams added services faster than they added SLOs. The result: production incidents nobody could attribute because the dependency graph was too complex to observe.

Agent Sprawl has three specific manifestations:

1. Framework-Invisible Call Complexity

When you add LangChain, LangGraph, or any orchestration framework, it adds steps and paths you did not write — retry logic, fallback handlers, context window management, tool routing. All of this happens between your application code and your observability layer.

Your SLIs measure at the application boundary. Framework-added calls are invisible.

This means your Tool Invocation Efficiency (TIE) baseline — tool calls per task completion — is measuring a mix of your agent's behavior and your framework's behavior. When you upgrade the framework, both change simultaneously. You cannot separate them.

In practice, across regulated production environments I've studied: TIE baselines can drift 30–40% after a framework major version upgrade with no corresponding change in the agent's task logic. The baseline shift looks like agent degradation. It's actually framework overhead. Teams spend hours on a false RCA.

The fix: Instrument at the framework output layer, not the application layer. Capture tool invocations after framework processing. Then freeze your TIE baseline before any upgrade and compare shadow traffic before promoting.

2. Multi-Model SLO Orphaning

70% of organizations running 3+ models means 70% have at least two additional SLO ownership gaps they haven't acknowledged.

SLOs are set once — typically when the first model is deployed. As models 2, 3, 4, 5, 6 are added for specific task classes, latency profiles, or cost tiers, nobody revisits the SLO ownership model. Models run in production with no named owner, no baseline, no error budget.

When model 3 degrades, there is no owner to page, no baseline to compare against, no runbook to execute. The degradation surfaces as a customer complaint, not an alert.

The fix: Treat every model in your fleet like a microservice. Each model gets: a named owner (not a team — a person), a task-class-specific SLO, and a 30-day observation baseline before the SLO is enforced.

3. LLM Tech Debt as a Reliability Liability

Deprecated models running in agent chains create silent compatibility risks. When a provider announces deprecation, teams with models buried inside multi-step chains often miss the migration window. The model ages. Safety training falls behind. Decision Quality Rate declines slowly — too slowly to trigger a threshold alert — until accumulated drift surfaces as a production incident.

The fix: Treat model deprecation notices the same way you treat dependency CVEs. Automate alerts at 60, 30, and 7 days before end-of-life. Build the migration ticket at announcement time, not at expiry.

The Governance Framework Agent Sprawl Needs

The Agent Fleet Inventory

Before you can govern sprawl, you need to know what you're governing. Maintain a living inventory with, for each component: framework and version, model(s) used, task classes handled, named SLO owner, current TIE/DQR baselines, and deprecation dates.

from agentsre.sprawl import AgentFleetInventory, FleetComponent, ComponentType

inventory = AgentFleetInventory()
inventory.register(FleetComponent(
    component_id="anthropic.claude-sonnet-4-6",
    component_type=ComponentType.MODEL,
    agent_id="payment-processor",
    task_classes=["payment-routing", "fraud-detection"],
    slo_owner="owner@team.com",                    # named human — not a team
    baseline_established_at="2026-04-01",
    deprecation_date="2027-06-01",
    last_slo_review="2026-04-01",
    current_tie_baseline=2.4,
    current_dqr_baseline=91.2,
))

report = inventory.quarterly_review_report()
print(f"Fleet governance score: {report['fleet_governance_score']}/100")

Framework Version Governance — Canary Before Promotion

from agentsre.sprawl import FrameworkVersionGovernance

gov = FrameworkVersionGovernance(
    tie_drift_threshold=1.15,   # block if TIE drifts >15%
    dqr_drift_threshold=0.85,   # block if DQR drops >15%
    min_shadow_samples=50,
)

# Before upgrade: snapshot production baseline
gov.snapshot_baseline(
    agent_id="payment-processor",
    task_class="payment-routing",
    framework_version="langchain-0.2.x",
    tie_values=production_tie_samples,
    dqr_values=production_dqr_samples,
)

# After 48hrs shadow traffic:
result = gov.evaluate_upgrade(
    agent_id="payment-processor",
    task_class="payment-routing",
    production_version="langchain-0.2.x",
    shadow_version="langchain-0.3.x",
)

if result.decision == UpgradeDecision.BLOCK:
    rollback()   # framework added hidden overhead — don't promote

The Quarterly Multi-Model SLO Review

The review should take 30–60 minutes per quarter. For every model in fleet:

Verify named owner exists
Verify baseline is current (< 90 days old)
Check deprecation schedule against provider announcements
Review TIE per-model — models with rising TIE relative to task class baseline are drifting

Models scoring below 70 on the governance health score are flagged as governance debt requiring a 30-day remediation window.

The Datadog Report's Implicit Challenge

The State of AI Engineering 2026 describes an industry in rapid expansion. What it does not fully resolve is the SRE question: who governs all of this, and what does that look like in practice?

The SRE community has solved exactly this class of problem before — in distributed systems, in microservices, in cloud infrastructure. The discipline already exists. It needs to be applied to the AI agent layer now, before agent sprawl becomes agent chaos.

The Datadog data tells us the window is closing. Framework adoption doubles in a year. Multi-model fleets become the norm. Model debt accumulates.

Build the governance layer before the production incidents start.

Open-source implementation: [https://github.com/Ajay150313/agentsre]
LinkedIn discussion: [https://www.linkedin.com/posts/ajay-devineni_agenticai-sre-reliability-ugcPost-7455786901673902080-BCRM?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU]

What's your biggest agent sprawl challenge right now?

RAG vs MCP is the wrong debate — here's the right framing for production AI systems

Ajay Devineni — Tue, 28 Apr 2026 21:13:56 +0000

The question I keep seeing in every AI engineering forum right now:

"Should we use RAG or MCP?"

It's the wrong question. And the fact that it's being asked at all tells me the field hasn't yet settled on a shared mental model for agentic AI architecture.

Here's the framing I use — and why getting this wrong has real production consequences.

RAG and MCP operate at different layers

RAG (Retrieval-Augmented Generation) and MCP (Model Context Protocol) are not alternatives. They are not competitors. They solve fundamentally different problems in an agentic system.

Think of it this way:

RAG answers: what does the agent know?
MCP answers: what can the agent do?

One is a knowledge pattern. The other is an execution protocol. Comparing them is a category error — like asking whether you should use a database or an API. The answer is almost always: both, at the right layer.

What RAG actually is (and isn't)

RAG is a memory pattern. Before the model reasons, you fill its context window with relevant information retrieved from an external store — documents, knowledge bases, runbooks, historical data.

RAG is appropriate when:

The agent needs domain knowledge that isn't in its training data
The information is relatively stable (changes on the order of days or weeks, not seconds)
The query is about "what do we know" not "what is happening right now"

RAG is not appropriate when:

The agent needs to know the current state of a live system
The information changes faster than your retrieval pipeline can refresh
The agent needs to take an action, not just retrieve information

This last point is where teams get into trouble. Embedding stale infrastructure docs into a RAG pipeline and treating them as a substitute for live system data is one of the most common architecture mistakes I see in agentic AI deployments.

What MCP actually is (and isn't)

MCP is an execution protocol. It gives agents the ability to invoke tools, call external APIs, read live system state, and take actions in the world — all in a standardized, auditable way.

MCP is appropriate when:

The agent needs to act, not just reason
The information required is live — current system state, real-time data, dynamic context
You need auditability of what the agent did and why (decision lineage)

MCP is not appropriate as a substitute for knowledge retrieval. Routing every context-building query through a live MCP tool call introduces unnecessary latency, increases blast radius surface area, and creates tool dependency chains that are hard to reason about under failure.

The production architecture that actually works

RAG and MCP compose. They don't compete. Here is the pattern I recommend for agentic systems that need both knowledge and action:

User goal / trigger
       |
       v
RAG retrieval layer
  - Fetch relevant runbook sections
  - Fetch historical incident context
  - Fetch policy and compliance docs
       |
       v
Agent reasoning
  - Synthesize retrieved context
  - Classify decision (blast radius tier)
  - Determine required action
       |
       v
MCP execution layer
  - Invoke appropriate tool
  - Apply validation gates (LOW / HIGH / CRITICAL)
  - Emit decision lineage trace
  - Execute or route for human review

The boundary between RAG and MCP is the boundary between knowing and doing. Design it intentionally.

The SRE reliability implications

From a reliability engineering perspective, conflating RAG and MCP creates two distinct failure modes:

Failure mode 1: using RAG where MCP belongs
The agent makes decisions based on stale retrieved data about a live system. The information looked correct at retrieval time. By execution time, the system state has changed. The agent acts on a false picture of reality.

This is particularly dangerous in infrastructure automation, where a runbook that was accurate six months ago may describe a system that no longer exists in that form.

Failure mode 2: using MCP where RAG belongs
Every knowledge query goes through a live tool call. Latency climbs. Tool dependencies multiply. Each MCP call is a potential blast radius event. The agent becomes slow, brittle, and expensive to operate — not because it's doing more, but because it's routing the wrong workload through the wrong layer.

The SLO implications

If you've read my previous posts on agentic SLO design, this connects directly. Your SLOs need to be aware of which layer a failure occurred in:

A RAG retrieval failure (stale data, embedding drift, retrieval miss) has different blast radius than an MCP execution failure (wrong tool invoked, action taken on bad context).
Human Escalation Rate (HER) needs to be segmented by failure layer. Rising HER from RAG staleness looks different from rising HER from MCP tool errors — and the runbook responses are completely different.
Decision lineage traces should capture which documents were retrieved via RAG and which tool calls were made via MCP, so post-mortems can identify which layer caused a bad decision.

The decision framework

Before your team debates RAG vs MCP, answer these questions:

Is the agent retrieving knowledge or taking action? Knowledge → RAG. Action → MCP.
How fast does the information change? Stable → RAG. Live → MCP.
What is the blast radius if this goes wrong? High blast radius operations belong behind MCP validation gates regardless of how the context was retrieved.
Do you need an audit trail? MCP gives you decision lineage natively. RAG retrieval should be logged separately and linked to the agent's reasoning trace.

Closing thought

The RAG vs MCP debate is a sign that the field is still building its shared vocabulary for agentic AI architecture. That's fine — this is early. But the teams shipping production agents today can't wait for consensus.

Design the boundary between knowing and doing intentionally. SLO it separately. Trace both layers in your observability stack.

The question isn't which one to use. It's whether you've thought carefully about where each one belongs.

This post is part of an ongoing series on AI-SRE: applying production reliability engineering principles to agentic AI systems.

SLO design for agentic AI systems — beyond uptime metrics
MCP decision-lineage observability in production
Human Escalation Rate (HER) as a reliability signal for agentic systems

https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-rag-share-7454971617409150976--nbK?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet

Ajay Devineni — Thu, 23 Apr 2026 18:51:21 +0000

Google A2A Protocol turned one year old on April 9, 2026. Over 150 organizations are running it in production. It is live inside Amazon Bedrock AgentCore and Azure AI Foundry. IBM's competing Agent Communication Protocol merged into A2A rather than fight it. The Linux Foundation now governs the spec.

The protocol is production-grade. The reliability engineering discipline for it is not.

I have spent the past year building SRE frameworks for single-agent + MCP deployments in regulated financial services environments. When A2A entered the picture, I realized the failure surface I had been managing had changed completely. This article documents the new failure modes A2A introduces and the SRE patterns I believe are required to manage them.

The Two-Layer Stack and Why It Changes Everything

MCP and A2A solve different problems at different layers of the agent stack. This is well understood by now. What is not yet well understood is what the two-layer combination means for reliability engineering.

MCP (Model Context Protocol)** — the vertical layer. An agent connects to tools and data sources. The failure modes are familiar to any distributed systems engineer: tool unavailability, degraded response quality, latency spikes, authentication failures. The blast radius is bounded. One agent, one tool layer, one error budget.

A2A (Agent-to-Agent Protocol)** — the horizontal layer. Agents communicate with other agents across organizational and platform boundaries. An orchestrator agent delegates subtasks to specialist agents via JSON-RPC over HTTP. Those specialist agents may be built by different teams, running on different vendors, governed by different SLOs.

The reliability engineering challenge A2A creates is not technical — the protocol itself is well-designed. It is organizational and observational. When an orchestrator agent delegates to a sub-agent via A2A, and that sub-agent fails silently, who carries the error budget? How do you instrument the boundary? What does safe degradation look like when an entire reasoning capability disappears because a downstream agent is unavailable?

These questions have no consensus answers yet. This article is my attempt to start building them.

The A2A Failure Mode Taxonomy

After studying multi-agent failure patterns across production deployments, I categorize A2A-specific failures into four classes. The first two are detectable with existing tooling. The last two are not.

Class 1: Sub-Agent Unavailability

The downstream agent returns a 503 or connection timeout. This is the easiest failure to handle — it looks like a standard HTTP failure and can be caught by existing circuit breaker patterns. Your orchestrator agent should treat sub-agent unavailability exactly as it treats MCP tool unavailability: fall back to a degraded capability or route to a human escalation path.

Instrumentation: standard HTTP error rate monitoring at the A2A client layer.

Class 2: Sub-Agent Latency Degradation

The downstream agent responds, but slowly. In a multi-agent chain (Agent A → Agent B → Agent C), latency compounds. A 2-second degradation at Agent C becomes a 6-second degradation at Agent A's response time. Users experience this as the orchestrator being slow — but the problem is buried three hops down the chain.

Instrumentation: distributed tracing across A2A boundaries. Each A2A task invocation should carry a trace ID propagated from the orchestrator. Without this, your latency SLI for the orchestrator tells you nothing useful about where the latency is originating.

Class 3: Silent Task Result Corruption — ⚠️ Not detectable with standard tooling

The downstream agent returns HTTP 200 with a syntactically valid A2A task result, but the result is semantically wrong — incomplete reasoning, missing context fields, hallucinated data treated as factual output. The orchestrator agent receives this as a successful response and incorporates it into its own output.

Your error rate SLI stays at zero. Your latency SLI stays normal. Your user receives incorrect output from a system that reported 100% success.

This is the failure mode that existing observability stacks cannot detect. It requires what I call an A2A Semantic Boundary Validator — a lightweight evaluation function that runs at the A2A client layer on every incoming task result, checking the result against expected behavioral bounds for that sub-agent's task class.

The implementation pattern mirrors my Decision Quality Rate (DQR) SLI for single-agent systems: maintain a behavioral baseline per sub-agent per task class, and flag results that fall outside expected bounds as potential corruptions before they propagate upstream.

Class 4: Cascading Autonomy Amplification — ⚠️ The most dangerous failure mode

Agent A delegates to Agent B. Agent B, uncertain about the task, makes additional autonomous decisions to resolve the ambiguity — invoking more MCP tools than its baseline, delegating further to Agent C, modifying its task interpretation. Agent C does the same.

By the time a result returns to Agent A, the original task intent has been substantially transformed by a chain of autonomous interpretations — none of which were visible to the orchestrator, none of which crossed any error threshold, and none of which can be reconstructed without end-to-end decision lineage capture.

This failure mode is unique to multi-agent systems. Single-agent + MCP deployments cannot produce it. It requires agents talking to agents, each adding their own layer of autonomous interpretation to a task that was never explicitly respecified.

The SRE Framework for A2A: Five Additions to Your Existing Stack

If you have followed my previous work on SLOs for agentic AI, you already have Decision Quality Rate, Tool Invocation Efficiency, and Human Escalation Rate instrumented for your single-agent deployments. A2A requires five additional capabilities on top of that foundation.

1. A2A Boundary Tracing

Every A2A task delegation must carry a distributed trace ID originating from the orchestrator. This is not optional — without it, you cannot attribute latency, errors, or behavioral drift to the correct agent in a multi-agent chain.

Implementation: Propagate a x-trace-id header on every A2A HTTP request. Store the full delegation tree (which agent delegated to which, with what task parameters, at what timestamp) in your centralized trace store. On AWS, I use X-Ray for the distributed trace and a DynamoDB table for the delegation tree — X-Ray captures the HTTP-level trace, DynamoDB captures the semantic-level task delegation structure.

2. Per-Sub-Agent SLO Ownership

Every A2A sub-agent your orchestrator calls must have a designated SLO owner — a named human or team who is paged when that sub-agent's reliability degrades. In practice, this means:

For internal sub-agents: assign SLO ownership the same way you assign ownership to microservices
For external/third-party sub-agents: define a sub-agent reliability budget. If a third-party A2A agent degrades, your orchestrator should treat it as a dependency failure and activate your degraded-mode runbook — not wait for the vendor to page you

The org chart question — who owns the SLO when agents from different vendors collaborate via A2A — is the most important unresolved governance question in multi-agent reliability today.

3. A2A Semantic Boundary Validation

For each sub-agent your orchestrator calls, define the expected output schema and behavioral bounds. Implement a validator function that runs on every incoming A2A task result before the orchestrator acts on it.

Minimum validation layer:

Schema validation: does the result match the expected A2A task result structure?
Completeness check: are required fields populated?
Behavioral bound check: does the result fall within the baseline distribution for this sub-agent's task class?

Results that fail validation should not be silently dropped — they should trigger your escalation path and log the full task context for postmortem analysis.

4. The Agent Chain Circuit Breaker

In traditional microservices, a circuit breaker opens when downstream failure rate exceeds a threshold, preventing cascade failures. Multi-agent systems need an equivalent pattern, adapted for the non-deterministic nature of agent communication.

My implementation: an agent chain circuit breaker that tracks the running success rate of each A2A sub-agent invocation over a 15-minute rolling window. When the validated success rate drops below 85% (accounting for semantic validation failures, not just HTTP errors), the circuit opens and the orchestrator routes that task class to a degraded-mode handler — typically a simplified version of the task that can be completed with MCP tools the orchestrator controls directly, or an immediate human escalation.

5. End-to-End Decision Lineage for Multi-Agent Chains

In single-agent systems, decision lineage is the record of what tools an agent invoked and what reasoning it applied. In A2A multi-agent systems, decision lineage must span the entire delegation chain — capturing not just what the orchestrator decided, but what each sub-agent decided on its behalf.

This is the audit trail that satisfies SOC 2 Type II requirements for autonomous decision-making in regulated environments. Without it, you cannot demonstrate to auditors that you have oversight of decisions made by agents you deployed but didn't directly control.

Implementation: each A2A task result must include a decision_lineage field containing the sub-agent's tool invocations, reasoning path, and confidence metadata. The orchestrator appends this to its own decision lineage before logging the full chain to the immutable audit store.

The Organizational Question A2A Forces

Every SRE framework I've described above requires answers to an organizational question the industry hasn't resolved:

When an orchestrator agent delegates to a third-party sub-agent via A2A, and the sub-agent produces output that causes downstream harm — who is operationally responsible?

This is not a legal question (yet). It is an operational ownership question that every multi-agent team will face in 2026.

My position: the orchestrator owner carries responsibility for validating and acting on sub-agent output. The A2A protocol handles communication. It does not handle accountability. An orchestrator that blindly trusts A2A task results without semantic validation is the operational equivalent of a service that makes no network calls — in other words, it doesn't exist in any production-grade form.

Build the semantic boundary validation. Own the chain.

Where to Start

If you are moving from single-agent + MCP to multi-agent + A2A, I recommend this progression:

Week 1: Implement A2A boundary tracing with distributed trace ID propagation. You cannot debug what you cannot trace.

Week 2: Assign explicit SLO ownership to every A2A sub-agent your orchestrator calls. Even a spreadsheet with named owners is better than none.

Week 3-4: Build the semantic boundary validator for your highest-volume A2A task class. Start with schema and completeness validation before attempting behavioral bound checks.

Month 2: Instrument the agent chain circuit breaker. Set your initial threshold conservatively (85% validated success rate) and adjust based on 30 days of baseline data.

Month 3+: Build end-to-end decision lineage capture. This is the hardest piece and the most important for regulated environments.

Connecting the Arc

This article is part of a series on applying SRE discipline to agentic AI in production:

Why SRE Principles Are the Missing Layer in MCP Security
SLOs for Agentic AI: The Reliability Framework Production Teams Are Missing (published this week)
A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet (this article)

I shared the core argument on LinkedIn: https://www.linkedin.com/posts/ajay-devineni_agenticai-a2a-mcp-share-7453145380822605824-pMta?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

The SRE community spent a decade learning to run distributed microservices reliably. We're at day one for multi-agent systems with A2A. The failure modes are different. The organizational questions are harder. The instrumentation doesn't exist yet.

Build it now — before your agent chains are running at a scale where these gaps become production incidents.

SLO Design for Agentic AI Systems — Why Traditional Reliability Metrics Break (and What to Use Instead)

Ajay Devineni — Tue, 21 Apr 2026 18:05:55 +0000

The problem with applying traditional SLOs to AI agents

SLOs work beautifully when "good" is observable.

An API either returns 200 or it doesn't. Latency is measurable. Availability is binary. You instrument, you baseline, you commit to a number, and you burn down an error budget when reality diverges.

AI agents break every one of these assumptions.

After a quarter of running agentic systems against production infrastructure, here are the three failure modes I keep running into when teams apply traditional SLO thinking to agents.

Failure mode 1: Correctness is not observable at the response layer

A REST service fails loudly. A 500, a timeout, a malformed payload — your existing observability catches it.

An agent can produce a response that:

Parses correctly
Passes schema validation
Triggers no alerts

...and still be wrong in a way that compounds silently for hours.

Traditional error rate SLOs have zero visibility into this. Your dashboards stay green. The blast radius is growing.

What to do instead: Add a behavioral correctness signal. For every agent decision class, define a human-reviewable sample rate and track the delta between agent judgment and human override. That delta is data. It belongs in your SLO.

Failure mode 2: Latency SLOs punish safe agent behavior

A p99 latency SLO makes perfect sense for a stateless service.

It is actively dangerous for an agent.

Agents that pause to verify context, escalate ambiguous decisions to a human, or refuse to act on a poisoned tool output are doing exactly what you want them to do. A latency SLO penalizes them for it.

If you optimize against a latency target, you are implicitly optimizing for speed over safety. In agentic systems, that's how you get silent degradation and runbook violations at 2am.

What to do instead: Track decision latency distribution separately from response latency. Escalation paths should be excluded from latency SLO calculations or governed by a separate, explicitly higher target.

Failure mode 3: You cannot commit to a number you haven't earned

This one keeps coming up in conversations with other SRE leads.

Teams instrument an agent, run it for a week, and immediately try to commit to a 99.5% reliability target. Then they burn their error budget in the first real incident because the baseline was built on demo traffic.

Rule I enforce on my team: Minimum 30-day behavioral baseline before any agentic SLO is ratified. No exceptions. The baseline must cover:

Tool failure scenarios
Context window edge cases
At least one simulated prompt drift event
Real production traffic, not synthetic load

You cannot reliability-engineer what you have not yet measured.

What an agentic SLO actually looks like

After iterating on this for a quarter, I'm building agentic SLOs around three signal types that traditional SLOs don't capture:

Signal 1: Human Escalation Rate (HER)

HER = (decisions requiring human override) / (total agent decisions) × 100

This is your canary metric. Rising HER is often the first observable signal of:

Model drift
Context degradation
Prompt decay
Tool output poisoning

Set a threshold. Wire it to your on-call rotation. Page on it.

My current target: HER ≤ 8% over any 24-hour rolling window

Signal 2: Decision confidence distribution

Don't track a single average confidence score. Track the distribution.

When an agent is operating normally, confidence tends to be bimodal — high confidence on routine decisions, lower on edge cases. When the distribution collapses from bimodal to flat, something has shifted in the agent's environment.

That shift may not produce errors yet. But it will.

My current target: Decision confidence p10 ≥ 0.65

Signal 3: Blast radius exposure rate

BRER = (HIGH + CRITICAL tier changes per hour)

You can have a green error rate and a dangerous blast radius exposure rate at the same time.

This metric captures risk velocity — how fast your agent is accumulating unreversed high-impact changes. It belongs in your SLO alongside uptime.

My current target: CRITICAL tier changes ≤ 2/hour without explicit approval gate

The SLO I'm piloting

agent_slo:
  baseline_period: 30d
  signals:
    human_escalation_rate:
      threshold: "≤ 8%"
      window: "24h rolling"
      alert: page_on_call
    decision_confidence_p10:
      threshold: "≥ 0.65"
      window: "1h rolling"
      alert: warn
    critical_blast_radius_rate:
      threshold: "≤ 2/hour"
      gate: explicit_approval_required
  error_budget:
    calculated_from: [HER, confidence_p10, blast_radius_rate]
    not_from: [uptime, latency]
  review_cadence: weekly_baseline_review

The mindset shift

Traditional SLO: Is the system up?

Agentic SLO: Is the system trustworthy?

These are not the same question. Uptime is necessary but not sufficient. An agent can be 100% available and producing wrong decisions at scale.

The SRE community has the tooling, the culture, and the postmortem discipline to solve this. But we have to resist the temptation to copy-paste our existing SLO playbook onto a fundamentally different kind of system.

What's next

In the next post in this series, I'll walk through how I'm wiring these signals into OpenTelemetry alongside the decision-lineage layer from my previous MCP observability write-up — so a single trace can answer both "what happened" and "why the agent decided to do it."

If you're running agentic AI against production infrastructure and have built your own reliability signals, I'd genuinely like to hear what you're measuring. Drop it in the comments.

This post is part of an ongoing series on AI-SRE: applying production reliability engineering principles to agentic AI systems in regulated cloud-native environments.
Linkedin url https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7452416001553567744-BPgq?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

Your AI Agent Doesn't Have a Feature Problem. It Has an On-Call Rotation Problem. published: true

Ajay Devineni — Thu, 16 Apr 2026 16:06:25 +0000

Applying SRE principles to AI agents in production — ownership, observability, SLOs, runbooks, and the kill switch pattern.
I've spent a year closely studying how AI agents fail in the wild — across incidents, postmortems, and real operational patterns — and what I keep noticing is a gap nobody talks about. Teams celebrate capability. Nobody builds operational readiness.
Here's what that gap costs, and how to close it.

The Gap: AI Agents Are Treated Like Features, Not Services
In traditional SRE, every production service has:

✅ A named owner who carries the pager
✅ A defined SLO
✅ An on-call rotation
✅ A runbook
✅ A postmortem process

Most AI agents have a demo video and a Slack channel.
This is a category error. An agent is not a feature. It is an autonomous decision-making service operating at the speed of your automation. When it fails, it doesn't fail quietly like a broken button. It fails at the rate of your automation — and often with external side effects: emails sent, APIs called, records written.

The Failure Nobody Talks About
The failure everyone prepares for is the hard failure: an exception thrown, a timeout, a 500 error. These are easy to catch. CloudWatch alarm, SNS notification, done.
The failure nobody prepares for is the silent degradation.

The agent completes tasks. Dashboards stay green. But for the last 6 hours, its reasoning has been subtly wrong — selecting the wrong tools, misinterpreting scope, producing outputs that look correct and aren't.

This is the worst case. Not failure. Plausible, undetected, incorrect action at scale.
Traditional observability doesn't catch this. You need a new layer.

Introducing HER: Human Escalation Rate
The most useful signal I've seen for agent health is one most teams don't track:
HER = (decisions requiring human override / total decisions) × 100
HER is to AI agents what error rate is to APIs. It tells you whether the agent's judgment is holding up.
Here's a simple implementation:
pythondef publish_her_metric(agent_id: str, human_overrides: int, total_decisions: int):
her = (human_overrides / total_decisions) * 100 if total_decisions > 0 else 0

# Push to your metrics store
metrics.gauge(
    "agent.human_escalation_rate",
    her,
    tags=[f"agent_id:{agent_id}"]
)

# Alert if above threshold
if her > THRESHOLD:
    alert_oncall_owner(agent_id, her)

return her

When HER exceeds your threshold, a named human gets paged. Not a team. Not a Slack channel. A person.

Three Requirements Before Any Agent Goes to Production
Based on everything I've observed and learned, here's what I consider non-negotiable.

A Named Human Owner Who Gets Paged The ownership model matters more than the tooling. Every agent must have a named individual who is accountable when HER exceeds threshold. Shared ownership is no ownership. "The AI team owns it" means nobody owns it. Write it down: yamlagent: name: document-processor-v2 owner: ajay.devineni@company.com pager: +1-xxx-xxx-xxxx slack_handle: "@ajay" escalation_policy: p1-sre-rotation
A Runbook That Covers At Least Four Failure Modes Before any agent ships, a runbook must exist. Minimum coverage: Failure ModeWhat to look forImmediate actionTool failureTool error rate spikesCheck dependency health, assess in-flight tasksContext degradationOutput length increases, HER spikesInspect conversation history, rollback promptPrompt driftBehavioral baseline deviationFreeze deploys, compare prompt versionsBlast radius eventAgent operating outside defined scopeInvoke kill switch, audit side effects A runbook doesn't need to be 20 pages. It needs to be right and reachable at 2am.
A 30-Day Behavioral Baseline Before Any SLO Is Set This is the one most teams skip because it feels slow. You cannot commit to reliability you have not measured. Run your agent in shadow mode for 30 days — processing real inputs, generating real outputs, but reviewed before action. During that window, measure everything:

Task completion rate
Human escalation rate (baseline HER)
Tool call accuracy
Decision latency (p50/p95/p99)
Context window utilization
Output quality score variance across identical inputs

Only after 30 days do you write an SLO. The baseline IS the SLO foundation.
yaml# Example SLO written after baseline
agent_slo:
valid_from: "after-30d-baseline"
objectives:
- metric: task_completion_rate
target: 99.2%
baseline_observed: 99.6% # headroom built in intentionally

- metric: human_escalation_rate
  target: "< 3%"
  baseline_observed: 1.8%
  alert_threshold: 5%

The Kill Switch Pattern
Every production agent needs a kill switch — a mechanism to halt execution immediately, without a code deployment.
pythondef check_kill_switch(agent_id: str) -> bool:
"""
Checks a config store for kill switch status.
Works with SSM Parameter Store, LaunchDarkly,
or any feature flag system.
"""
status = config_store.get(f"agents/{agent_id}/kill-switch")
return status == "ACTIVE"

def agent_task_loop(agent_id: str, tasks: list):
for task in tasks:
# Check before EVERY decision, not just at startup
if check_kill_switch(agent_id):
log_halt(agent_id, task)
raise AgentHaltException("Kill switch active")

    execute(task)

The kill switch should be:

Flipable without a deployment (config store, not code)
Checked before every decision, not just at startup
Audited — log every check and every activation

What the Observability Stack Actually Looks Like
Agent Runtime
│
├──▶ Structured logs (JSON, one entry per decision)
│ └── Fields: session_id, tool_calls, human_override, confidence, latency
│
├──▶ Custom metrics
│ └── HER, tool error rate, context utilization, decision latency
│
├──▶ Distributed traces
│ └── End-to-end: input → LLM → tool calls → output
│
├──▶ Event stream (one event per agent decision)
│ └── Powers alerting rules and downstream audit
│
└──▶ Decision audit log (immutable)
└── S3 / blob store, retained for postmortem analysis
Every agent decision should emit a structured log entry:
json{
"timestamp": "2025-01-15T14:23:01Z",
"agent_id": "doc-processor-v2",
"session_id": "sess_abc123",
"tools_called": ["search", "summarize"],
"tool_success": [true, true],
"human_override": false,
"context_utilization_pct": 47.1,
"latency_ms": 3420,
"task_completed": true
}
This is your audit trail. This is what you bring to a postmortem.

The Postmortem Question Nobody Asks
After an incident with a traditional service, postmortems ask:

What failed?
Why did it fail?
How do we prevent recurrence?

For AI agents, there's a fourth question that almost nobody asks:
Was there a window where the agent was wrong, and we didn't know?
Silent degradation periods are invisible in traditional postmortems because the dashboards were green. Adding a behavioral baseline comparison to every postmortem template forces this question into the open.

Is Your Agent Production-Ready or Demo-Ready?
The SRE community spent 20 years learning how to operate distributed systems reliably. Those lessons — ownership, observability, SLOs, runbooks, postmortems — weren't invented in conference rooms. They were earned through outages.
AI agents are distributed systems with an additional dimension of unpredictability: they make decisions.
Before your next agent ships, run this checklist:

Named human owner with pager assigned
Runbook covering tool failure, context degradation, prompt drift, blast radius
HER metric instrumented and alerting
Kill switch implemented and tested
30-day shadow mode baseline completed
SLO written and derived from baseline data
Postmortem template updated to include behavioral baseline comparison

If any box is unchecked, your agent is demo-ready. Not production-ready.
Author: Ajay Devineni | Connect on LinkedIn