<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Siddharth Singh</title>
    <description>The latest articles on Forem by Siddharth Singh (@siddharth_singh_409bd5267).</description>
    <link>https://forem.com/siddharth_singh_409bd5267</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3836164%2Fed12b658-4232-401b-be5c-924bb828c22f.png</url>
      <title>Forem: Siddharth Singh</title>
      <link>https://forem.com/siddharth_singh_409bd5267</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/siddharth_singh_409bd5267"/>
    <language>en</language>
    <item>
      <title>Aurora Actions: User-Defined Background Automations for Incident Response</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Mon, 11 May 2026 17:49:20 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/aurora-actions-user-defined-background-automations-for-incident-response-1591</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/aurora-actions-user-defined-background-automations-for-incident-response-1591</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Aurora Actions are reusable, natural-language automations&lt;/strong&gt; that Aurora's agent executes in the background using all 22+ connected integrations. Available today on the main branch of &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three trigger types out of the box&lt;/strong&gt;: manual ("run now"), on incident completion (chain follow-up work after every RCA), and recurring schedule (Celery Beat–driven intervals).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same agent, same tools, different prompt scaffolding.&lt;/strong&gt; Actions reuse Aurora's existing LangGraph agent and 30+ tools (kubectl, aws, gcloud, az, Terraform, Confluence, Slack, GitHub) — they just run as background chat sessions with eager-loaded skills and no RCA mandate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt; is a first-class chat primitive.&lt;/strong&gt; Slash-command autocomplete in the chat input, "Run Action" dropdown on completed incidents, and full RBAC-gated CRUD UI in Settings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora Actions turn the agent into a programmable platform.&lt;/strong&gt; This is the building block for CI/CD auto-remediation, scheduled audits, and post-incident health checks — covered in &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;our CI/CD Auto-Remediation guide&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;We shipped one of the most-requested features in &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;'s history: &lt;strong&gt;Aurora Actions — user-defined background automations that run on Aurora's agent.&lt;/strong&gt; &lt;strong&gt;An Aurora Action is a named, natural-language instruction the user writes once and then triggers manually, on incident completion, or on a recurring schedule; Aurora's agent executes it as a background task with full access to every connected integration.&lt;/strong&gt; Where traditional incident management tools force you to pick from a fixed catalog of "automations" (close incident, post to Slack, run runbook), Actions are written in plain English and inherit the full reasoning capability of the agent.&lt;/p&gt;

&lt;p&gt;This post is for SRE and platform teams already running Aurora — or evaluating it — who want to understand what Actions actually do, where they fit on the agentic spectrum, and how to use them safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an Aurora Action?
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;Aurora Action&lt;/strong&gt; has four parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A name&lt;/strong&gt; — used as the slash-command handle (&lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt;) and as the dropdown label on incident cards.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A natural-language instruction&lt;/strong&gt; — the prompt the agent will execute. The same instruction the user would type into chat, except it can reference incident context placeholders when triggered post-incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A trigger type&lt;/strong&gt; — manual, on-incident-completion, or on-schedule (interval-based via &lt;a href="https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html" rel="noopener noreferrer"&gt;Celery Beat&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An on/off toggle&lt;/strong&gt; — actions can be disabled without deletion, with full RBAC for who can create, edit, or trigger them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The implementation is a thin layer over Aurora's existing chat agent. When an Action triggers, the executor service creates a background chat session with the action's instruction as the user message, runs it through the same LangGraph workflow that powers interactive chat, and persists the run history. The agent has full tool access (kubectl, cloud CLIs, Terraform, Slack, GitHub, Confluence, Memgraph, Weaviate) and eager-loaded skills — the only differences from interactive chat are scaffolded prompts and the absence of any RCA mandate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Most incident management automation today is &lt;strong&gt;workflow automation&lt;/strong&gt;: PagerDuty fires, Slack channel is created, status page is updated, runbook link is posted. The "automation" is a directed graph of static actions. There is no reasoning, no investigation, no judgment. Tools like Rootly, FireHydrant, and incident.io are excellent at this — but they don't &lt;em&gt;do&lt;/em&gt; anything an SRE wouldn't have to manually verify after the fact.&lt;/p&gt;

&lt;p&gt;Aurora's bet has always been the opposite: &lt;strong&gt;automate the investigation itself.&lt;/strong&gt; Aurora Actions extend that bet from one-shot incident investigations to recurring or post-incident workflows. A few concrete examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Noisy alert tuning&lt;/strong&gt; — "Every Friday at 5pm, review which Datadog alerts fired more than 20 times this week with mean time-to-acknowledge over 10 minutes. Open a Terraform PR to widen the thresholds or move them to a warning channel."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-incident health check&lt;/strong&gt; — "After every completed RCA, run a 15-minute observation on the affected service: check error rate, p99 latency, and pod restart count. Post results to #incident-followup."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduled infrastructure audit&lt;/strong&gt; — "Every Monday at 9am, audit IAM roles in the production AWS account that have not been used in 90 days. List candidates for removal in a Confluence page."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are runbook automation. Each requires the agent to query infrastructure, reason about results, and produce a structured output. Each one was previously the job of an on-call engineer doing follow-up between pages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Actions sit on the agentic capability spectrum
&lt;/h2&gt;

&lt;p&gt;In our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE comparison&lt;/a&gt;, we proposed a four-level spectrum for AI SRE capability. Actions don't change the level — they change &lt;em&gt;when the agent runs.&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;When the agent runs&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;Pre-Actions example&lt;/th&gt;
&lt;th&gt;With Actions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On alert&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Webhook from PagerDuty / Datadog / Grafana&lt;/td&gt;
&lt;td&gt;Aurora investigates the alert and produces an RCA&lt;/td&gt;
&lt;td&gt;Same — investigation flow is unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On user request&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Engineer asks a question in chat&lt;/td&gt;
&lt;td&gt;Aurora answers using tools&lt;/td&gt;
&lt;td&gt;Same — plus &lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt; shortcuts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;After every incident&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Incident state transitions to "resolved"&lt;/td&gt;
&lt;td&gt;Postmortem generated; engineer manually does follow-up checks&lt;/td&gt;
&lt;td&gt;Action runs automatically with incident context in scope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On a schedule&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Celery Beat cron&lt;/td&gt;
&lt;td&gt;No equivalent — required external scheduler + custom code&lt;/td&gt;
&lt;td&gt;Single source of truth: agent runs the prompt on cadence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The post-incident and scheduled triggers are the genuinely new capability. Before Actions, anything recurring or post-incident required gluing Aurora to an external scheduler, an external prompt store, and bespoke trigger code. Actions collapse all three into the product surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Actions work under the hood
&lt;/h2&gt;

&lt;p&gt;This is for the technically curious. A few architecturally interesting things from the implementation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Background chat sessions, not a separate runtime.&lt;/strong&gt; When an Action triggers, the executor service creates a regular chat session with the action's instruction as the seed message and dispatches it as a background Celery task. The agent doesn't know it's running an Action — it just runs the workflow. This means every capability the interactive agent has (tool calls, RAG, graph traversal, sub-agent orchestration) is available inside Actions for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Eager-loaded skills, no RCA mandate.&lt;/strong&gt; Interactive chat lazy-loads skills based on the user message. Background actions eager-load all skills because there is no human to clarify ambiguity. The system prompt also strips the "your job is to find root cause" framing — Actions can do anything the agent can do, not just investigate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. RLS context is preserved.&lt;/strong&gt; Aurora uses &lt;a href="https://www.postgresql.org/docs/current/ddl-rowsecurity.html" rel="noopener noreferrer"&gt;PostgreSQL row-level security&lt;/a&gt; for multi-tenancy. The executor explicitly sets RLS context (&lt;code&gt;org_id&lt;/code&gt;, &lt;code&gt;user_id&lt;/code&gt;) before running so background tasks see only their own org's data — even though they run under a service identity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Stale run cleanup is integrated.&lt;/strong&gt; Aurora's existing background-chat janitor already handles orphaned chat sessions from crashed pods. Action runs go through the same path, so a worker pod dying mid-action doesn't leave the run state inconsistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. RBAC is enforced at the route layer.&lt;/strong&gt; Action CRUD is gated by Aurora's Casbin-based RBAC. Org admins can restrict which roles can create or trigger actions — important because an Action with cloud-CLI access has real blast radius.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trigger types in detail
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Manual triggers
&lt;/h3&gt;

&lt;p&gt;The simplest case. An admin creates the action, an engineer triggers it from the Actions page or via &lt;code&gt;/action &amp;lt;name&amp;gt;&lt;/code&gt; in chat. Useful for codifying common operational tasks ("rotate ECS task definitions for service X", "scan Confluence for stale runbooks") into named, repeatable commands.&lt;/p&gt;

&lt;p&gt;The chat integration is worth calling out: &lt;code&gt;/action&lt;/code&gt; is implemented as an LLM tool call using the same pattern as Aurora's &lt;code&gt;/rca&lt;/code&gt; slash command. The agent processes the action dispatch and then continues responding to the rest of the user's message — so you can write "kick off the IAM audit and tell me what changed since last week" and the agent will dispatch the audit action &lt;em&gt;and&lt;/em&gt; answer your question in the same turn.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-incident-completion triggers
&lt;/h3&gt;

&lt;p&gt;When an incident transitions to "resolved", any action with this trigger type runs against the incident context. The incident's metadata, RCA, and timeline are available to the action's agent without the user having to paste anything in. This is the trigger that turns Aurora from a reactive tool ("investigate this page") into a continuous one ("investigate, then run health checks, then file the postmortem").&lt;/p&gt;

&lt;h3&gt;
  
  
  Scheduled triggers
&lt;/h3&gt;

&lt;p&gt;Interval-based, driven by &lt;a href="https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html" rel="noopener noreferrer"&gt;Celery Beat&lt;/a&gt;. Choose a cadence (every N minutes / hours / days), and the action runs without user involvement. This is the building block for the CI/CD auto-remediation and scheduled audit use cases — and it's why we're calling this post and the &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;CI/CD Auto-Remediation guide&lt;/a&gt; sister posts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actions don't do (and why)
&lt;/h2&gt;

&lt;p&gt;A few capability decisions worth being explicit about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No external webhook triggers&lt;/strong&gt; in this release. We could have added "trigger on arbitrary webhook" but it overlaps with the existing alert-triggered investigation flow. We may add it if we see demand for triggers from systems that don't go through PagerDuty / Datadog / Grafana.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No agent-authored Actions&lt;/strong&gt; yet. The agent can't create or modify Actions on its own. Self-modification is a serious security boundary; we'd want approval gating and audit logging before opening that door. (See our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety guide&lt;/a&gt; for the threat model.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No conditional / DAG composition&lt;/strong&gt; in this release. Actions are single-prompt for now. If you need a multi-step workflow, write a single prompt that describes the steps — the agent is good at sequencing. We'll add explicit composition if the natural-language form proves limiting.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Safety: what to think about before enabling
&lt;/h2&gt;

&lt;p&gt;Every Action is a small program with access to your cloud environment. A few rules we use ourselves:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start read-only.&lt;/strong&gt; Actions inherit Aurora's tool permissions. If your tool config restricts write actions (no &lt;code&gt;kubectl apply&lt;/code&gt;, no &lt;code&gt;aws ec2 terminate-instances&lt;/code&gt;), Actions inherit that posture. Keep it that way for the first few weeks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use scheduled triggers conservatively.&lt;/strong&gt; A daily audit is cheap. A 5-minute polling loop with cloud CLI calls is not. Watch the LLM bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit who can create Actions.&lt;/strong&gt; RBAC defaults to org-admin-only creation. Leave it there unless you have a clear reason to widen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pin the model.&lt;/strong&gt; Action prompts can be sensitive to model behavior. Pin a known-good model per action (gpt-5.5, claude-sonnet-4.6, opus-4.7, etc.) using Aurora's per-org model dropdown until you have confidence in cross-model stability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review action runs weekly.&lt;/strong&gt; Every action has a run-history view. Spend 10 minutes a week reading the agent's traces for your scheduled actions — anomalous reasoning is the leading indicator of prompt drift or tool drift.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to ship your first Action
&lt;/h2&gt;

&lt;p&gt;A six-step recipe.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Pick a recurring task you currently do manually
&lt;/h3&gt;

&lt;p&gt;Anything you do every week or after every incident. Examples: stale-PR review, alert-noise audit, on-call handover summary. The smaller and more deterministic, the better for v1.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Write the prompt as if you were typing it into chat
&lt;/h3&gt;

&lt;p&gt;Don't translate to "automation language." Write it the way you would write a chat message to a smart junior SRE. "Look at..." "Check whether..." "Open a PR that..."&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Create the Action with a manual trigger
&lt;/h3&gt;

&lt;p&gt;Settings → Actions → New Action. Paste the prompt, set trigger = manual, leave it disabled if you want to review before enabling. Trigger it once and watch the run.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Inspect the run trace
&lt;/h3&gt;

&lt;p&gt;Click the run in the history view. Read every tool call. Look for: tool misuse (wrong cloud account), excessive tool calls (3 attempts at the same thing), hallucinated paths or resource IDs. Iterate on the prompt until the trace is clean for three consecutive runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Promote to the right trigger type
&lt;/h3&gt;

&lt;p&gt;If the action makes sense after every incident → on-incident-completion. If it's a routine sweep → on-schedule with the longest cadence that still meets your need. Only use short cadences when you have a clear cost and blast-radius understanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Add it to your team's incident review
&lt;/h3&gt;

&lt;p&gt;Treat agent runs the same way you treat human runs: include them in your weekly incident review. Look for actions that produced wrong output, actions that nobody read the output of, and actions that produced output nobody acted on. Delete or downgrade as needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora Actions vs traditional incident-management automation
&lt;/h2&gt;

&lt;p&gt;The category most people compare us to is "workflow automation in incident-management SaaS" — Rootly, FireHydrant, incident.io. The comparison is informative but ultimately category-different:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Aurora Actions&lt;/th&gt;
&lt;th&gt;Rootly / FireHydrant / incident.io workflows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Authoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Natural language&lt;/td&gt;
&lt;td&gt;DSL or visual builder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes — LLM agent&lt;/td&gt;
&lt;td&gt;No — fixed conditional graph&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool reach&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud CLIs, kubectl, Terraform, Slack, Confluence, GitHub, RAG, infra graph&lt;/td&gt;
&lt;td&gt;Slack, status pages, Zoom, runbook links, ticket creation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scheduled execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Celery Beat)&lt;/td&gt;
&lt;td&gt;Limited (some support timed reminders)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Post-incident chaining&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes — full incident context available&lt;/td&gt;
&lt;td&gt;Yes — but limited to workflow actions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Apache 2.0, self-hosted)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free (self-hosted; LLM tokens only)&lt;/td&gt;
&lt;td&gt;Per-user SaaS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest framing: traditional incident-management tools automate the &lt;em&gt;process around&lt;/em&gt; the incident. Aurora Actions automate &lt;em&gt;what happens inside the agent&lt;/em&gt;. Both have value; they cover non-overlapping work. If you live in PagerDuty and use Rootly for incident channels, Aurora Actions sit alongside that — they don't replace it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Aurora Actions is the foundation for several capabilities on our roadmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DAG composition&lt;/strong&gt; — explicit multi-step Action chains where each step is itself an Action.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval gates&lt;/strong&gt; — Actions that pause for human approval before destructive tool calls (already supported in chat; explicit Action-level gating coming).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD auto-remediation hooks&lt;/strong&gt; — first-class integration with GitHub Actions, Jenkins, and ArgoCD so a failing pipeline becomes a triggered Aurora investigation. (Background and detailed write-up in our &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;CI/CD Auto-Remediation guide&lt;/a&gt;.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action marketplace&lt;/strong&gt; — community-contributed Actions you can install with one click. Bring-your-own prompt store.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We'll publish each of these as they ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Aurora
&lt;/h2&gt;

&lt;p&gt;Aurora is fully open source under Apache 2.0. Self-host with &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Docker Compose or Helm&lt;/a&gt;. Actions ship in the next tagged release after &lt;a href="https://github.com/Arvo-AI/aurora/releases" rel="noopener noreferrer"&gt;aurora-oss-1.2.15&lt;/a&gt; (April 15, 2026); the feature is available on &lt;code&gt;main&lt;/code&gt; today.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;arvo-ai.github.io/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compare against alternatives:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/aurora-vs-traditional-incident-management-tools" rel="noopener noreferrer"&gt;Aurora vs traditional incident-management tools&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Mon, 11 May 2026 17:32:08 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/cicd-auto-remediation-the-complete-guide-for-sre-and-platform-teams-2026-3f70</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/cicd-auto-remediation-the-complete-guide-for-sre-and-platform-teams-2026-3f70</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Most teams do not yet auto-remediate inside CI/CD.&lt;/strong&gt; Per &lt;a href="https://blog.jetbrains.com/teamcity/2026/04/ai-in-devops/" rel="noopener noreferrer"&gt;JetBrains' AI Pulse coverage (April 2026)&lt;/a&gt;, &lt;strong&gt;78.2% of respondents don't use AI in CI/CD workflows at all&lt;/strong&gt; — even though AI is now widely used elsewhere in the development lifecycle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD auto-remediation is an architectural pattern, not a product category.&lt;/strong&gt; It combines progressive delivery (canary, blue-green), automated metric-driven rollback, and AI-assisted root-cause-and-fix. Owned components, not a single SKU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three layers, four maturity levels.&lt;/strong&gt; We propose the &lt;strong&gt;CI/CD Auto-Remediation Maturity Spectrum (CARM)&lt;/strong&gt;: L0 (manual), L1 (rollback), L2 (rollback + diagnostic), L3 (rollback + diagnostic + remediation), L4 (closed-loop with policy gates).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source stack is mature.&lt;/strong&gt; &lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt;, &lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;, and metric-driven &lt;code&gt;AnalysisTemplates&lt;/code&gt; cover L1–L2 with no AI. AI agents like &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; extend to L3 with Actions-based remediation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DORA's bar is real.&lt;/strong&gt; Top-performing teams keep change failure rate low and recover from failed deployments in under one hour (&lt;a href="https://dora.dev/guides/dora-metrics/" rel="noopener noreferrer"&gt;DORA program guidance&lt;/a&gt;). Auto-remediation is how non-elite teams close the gap.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Of the &lt;a href="https://rootly.com/ai-sre-guide" rel="noopener noreferrer"&gt;46+ AI SRE products&lt;/a&gt; and dozens of progressive-delivery tools shipping in 2026, only a handful explicitly target the pattern this guide is about. &lt;strong&gt;CI/CD auto-remediation is the practice of having your delivery pipeline automatically detect, diagnose, and recover from failure — without paging a human — using a combination of progressive-delivery primitives, metric-driven rollback policies, and (increasingly) AI agents that propose or apply fixes.&lt;/strong&gt; It is not the same as auto-deploy. It is not the same as canary rollout. It is the closing of the loop between "the pipeline noticed something is wrong" and "the system is back in a good state" — without an engineer in the middle.&lt;/p&gt;

&lt;p&gt;This guide is for SRE and platform teams who already run continuous delivery and want to push toward the auto-remediation end of the spectrum. By the end, you should be able to: position your current setup on the CARM maturity spectrum, identify the next concrete step, and pick a credible tool stack to get there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why auto-remediation matters in 2026
&lt;/h2&gt;

&lt;p&gt;Three numbers explain the demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. AI is shipping more code, faster.&lt;/strong&gt; Per &lt;a href="https://blog.jetbrains.com/teamcity/2026/04/ai-in-devops/" rel="noopener noreferrer"&gt;JetBrains' AI Pulse coverage on the TeamCity blog (April 2026)&lt;/a&gt;, AI tools are now used by a large majority of developers in their daily work. The &lt;a href="https://getdx.com/blog/change-failure-rate/" rel="noopener noreferrer"&gt;DX 2026 change-failure-rate analysis&lt;/a&gt; puts a number on it: with 91% of developers having adopted AI and 20%+ of merged code now AI-authored, &lt;strong&gt;code velocity has gone up while quality has gone in the opposite direction.&lt;/strong&gt; More deployments per day means more chances to break production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The pipeline itself is the new bottleneck.&lt;/strong&gt; &lt;a href="https://blog.jetbrains.com/teamcity/2025/10/the-state-of-cicd/" rel="noopener noreferrer"&gt;JetBrains' 2025 State of CI/CD survey&lt;/a&gt; documents widespread frustration with slow and unreliable CI/CD pipelines as a leading contributor to developer burnout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. AI in CI/CD specifically lags adoption.&lt;/strong&gt; Per &lt;a href="https://blog.jetbrains.com/teamcity/2026/04/ai-in-devops/" rel="noopener noreferrer"&gt;JetBrains' AI Pulse coverage (April 2026)&lt;/a&gt;, &lt;strong&gt;78.2% of respondents don't use AI in CI/CD workflows at all&lt;/strong&gt; — even though most use AI everywhere else in the development lifecycle. The gap isn't capability; it's trust and integration. AI in IDEs is low-risk; AI in pipelines touches production. Teams want the impact but won't take the blast radius until the architecture is right.&lt;/p&gt;

&lt;p&gt;Auto-remediation is the architecture that closes that gap. It bounds the agent's reach (only inside the delivery pipeline), gives it deterministic guardrails (progressive delivery and metric-driven rollback), and produces a clear contract: detect, diagnose, fix-or-rollback, log.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "auto-remediation" actually means
&lt;/h2&gt;

&lt;p&gt;It is easiest to define by negation. Auto-remediation is &lt;strong&gt;not&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auto-deploy.&lt;/strong&gt; Auto-deploy ships code on merge. Auto-remediation is what happens &lt;em&gt;after&lt;/em&gt; a problem appears.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Canary release.&lt;/strong&gt; Canary is the &lt;em&gt;detection mechanism&lt;/em&gt; — it surfaces problems early by shifting traffic gradually. Remediation is the &lt;em&gt;response&lt;/em&gt; — rolling back, hotfixing, or reverting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-healing infrastructure.&lt;/strong&gt; Self-healing systems like Kubernetes restart pods. Auto-remediation includes that plus &lt;em&gt;change-driven&lt;/em&gt; failure recovery: rolling back a bad deploy, rolling forward a fix, or pausing the pipeline while a human investigates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AIOps.&lt;/strong&gt; AIOps platforms surface alerts and correlations. Auto-remediation closes the loop by &lt;em&gt;acting&lt;/em&gt; on them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The minimum viable definition: &lt;strong&gt;a pipeline transition from a degraded state back to a healthy state, triggered by automated detection, executed by automated action, observed and logged for human review.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The CI/CD Auto-Remediation Maturity Spectrum (CARM)
&lt;/h2&gt;

&lt;p&gt;There is no single industry-standard maturity model for auto-remediation. We use the following five-level spectrum — derived from how teams actually evolve.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;What happens on failed deploy&lt;/th&gt;
&lt;th&gt;Tools that get you here&lt;/th&gt;
&lt;th&gt;Trust required&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L0 — Manual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipeline fails. PagerDuty pages the on-call. Engineer investigates, decides to roll back or hotfix, executes manually.&lt;/td&gt;
&lt;td&gt;None — this is the default for most teams.&lt;/td&gt;
&lt;td&gt;None — humans do everything.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1 — Automated Rollback&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipeline detects health-check failure (error rate, latency, smoke test). Automatically rolls back to the previous version. Pages a human after the fact.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt;, &lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;, &lt;a href="https://spinnaker.io/" rel="noopener noreferrer"&gt;Spinnaker&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Trust that the health metric reflects user-visible failure.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2 — Rollback + Diagnostic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;L1 plus: AI agent runs an investigation when rollback fires. Produces an RCA before the human starts. Page goes out with context, not blank.&lt;/td&gt;
&lt;td&gt;L1 stack + &lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;, &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Trust that the diagnostic is right enough to bias human reasoning.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3 — Rollback + Diagnostic + Remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;L2 plus: agent proposes (or in some cases applies) a fix — a PR, a config change, an alert threshold update. Human reviews and merges.&lt;/td&gt;
&lt;td&gt;L2 stack + Aurora Actions, HolmesGPT Operator mode&lt;/td&gt;
&lt;td&gt;Trust that the agent's fix is correct, scoped, and reviewable.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L4 — Closed-loop with policy gates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;L3 plus: certain &lt;em&gt;low-risk, well-understood&lt;/em&gt; fixes are applied automatically inside policy guardrails (alert threshold widening, log-only changes, retry loops). Destructive or high-risk changes still gated.&lt;/td&gt;
&lt;td&gt;L3 stack + policy engine (OPA, Casbin, Kyverno) + audit logging&lt;/td&gt;
&lt;td&gt;Trust the policy gate definitions more than the agent.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most teams in 2026 are at &lt;strong&gt;L0 or L1&lt;/strong&gt;. The leap from L1 to L2 is the single highest-leverage move available because it preserves human-in-the-loop decision-making while removing the "blank-page" delay that explains a large share of MTTR. The 2024-2025 DORA research &lt;a href="https://dora.dev/guides/dora-metrics/" rel="noopener noreferrer"&gt;renamed MTTR to Failed Deployment Recovery Time (FDRT)&lt;/a&gt; precisely because the metric is more meaningful when scoped to change-driven failures — which is exactly the failure mode auto-remediation targets.&lt;/p&gt;

&lt;h2&gt;
  
  
  L1: Automated rollback (where most serious teams should be)
&lt;/h2&gt;

&lt;p&gt;This is the foundation. Without L1, AI-assisted remediation at L2-L3 has nowhere to act.&lt;/p&gt;

&lt;p&gt;The two Apache 2.0 incumbents are &lt;strong&gt;Argo Rollouts&lt;/strong&gt; and &lt;strong&gt;Flagger.&lt;/strong&gt; Both run in Kubernetes; both implement metric-driven progressive delivery with automated rollback. They differ in invasiveness.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;&lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt;&lt;/th&gt;
&lt;th&gt;&lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNCF status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Part of &lt;a href="https://www.cncf.io/projects/argo/" rel="noopener noreferrer"&gt;Argo&lt;/a&gt; (Graduated, Dec 2022)&lt;/td&gt;
&lt;td&gt;Part of &lt;a href="https://www.cncf.io/projects/flux/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt; (Graduated, Nov 2022)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Replaces &lt;code&gt;Deployment&lt;/code&gt; with &lt;code&gt;Rollout&lt;/code&gt; CRD&lt;/td&gt;
&lt;td&gt;Wraps existing &lt;code&gt;Deployment&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitOps pairing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ArgoCD&lt;/td&gt;
&lt;td&gt;FluxCD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Analysis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;AnalysisTemplate&lt;/code&gt; querying Prometheus, Datadog, CloudWatch, etc.&lt;/td&gt;
&lt;td&gt;Service-mesh metrics + custom webhooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Automated rollback&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metric-threshold breach → revert&lt;/td&gt;
&lt;td&gt;Metric-threshold breach → revert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Traffic shaping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native + ingress + service mesh&lt;/td&gt;
&lt;td&gt;Service-mesh first (Istio, Linkerd, App Mesh)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Invasiveness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher (changes resource type)&lt;/td&gt;
&lt;td&gt;Lower (transparent wrapper)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Webhooks for custom logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Experiment&lt;/code&gt; resource + analysis runs&lt;/td&gt;
&lt;td&gt;Pre-/post-/during-rollout hooks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Pick Argo Rollouts&lt;/strong&gt; if you already use ArgoCD and want explicit per-step canary control. &lt;strong&gt;Pick Flagger&lt;/strong&gt; if you use a service mesh and want progressive delivery to be transparent to existing manifests.&lt;/p&gt;

&lt;p&gt;For non-Kubernetes pipelines, equivalent capability lives in &lt;strong&gt;Spinnaker&lt;/strong&gt; (multi-cloud, mature), &lt;strong&gt;Harness&lt;/strong&gt; (commercial), and feature-flag platforms like &lt;strong&gt;LaunchDarkly&lt;/strong&gt; (when "rollback" can be a flag flip).&lt;/p&gt;

&lt;p&gt;A minimal Argo Rollouts AnalysisTemplate for HTTP error rate, simplified from the &lt;a href="https://argoproj.github.io/argo-rollouts/features/analysis/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AnalysisTemplate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error-rate&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service-name&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error-rate&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;successCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result[0] &amp;lt;= &lt;/span&gt;&lt;span class="m"&gt;0.01&lt;/span&gt;
      &lt;span class="na"&gt;failureLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
      &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus.monitoring:9090&lt;/span&gt;
          &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[1m]))&lt;/span&gt;
            &lt;span class="s"&gt;/ sum(rate(http_requests_total{service="{{args.service-name}}"}[1m]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three failed 30-second windows → rollback. This is L1 in 30 lines of YAML.&lt;/p&gt;

&lt;h2&gt;
  
  
  L2: Rollback + automated diagnostic
&lt;/h2&gt;

&lt;p&gt;L1 gets you out of an outage fast. It does not tell you &lt;em&gt;why&lt;/em&gt; the deploy failed. The human gets paged with a rollback notification and starts from zero.&lt;/p&gt;

&lt;p&gt;L2 fills that gap with an AI agent that runs when rollback fires. The agent queries the cluster state, the application logs, the rollout metrics, and the changed code — and produces an RCA before the human starts typing.&lt;/p&gt;

&lt;p&gt;Three credible open-source options exist as of 2026 (compared in detail in our &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt; guide):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt;&lt;/strong&gt; — rule-based scanner with LLM explanations. Best for low-blast-radius first deployment; explains &lt;em&gt;why&lt;/em&gt; a resource is unhealthy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;&lt;/strong&gt; — ReAct-loop AI agent (CNCF Sandbox). 30+ observability integrations. Read-only by default. Strong fit for cluster-scoped investigation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;&lt;/strong&gt; — LangGraph supervisor agent. Multi-cloud (AWS / Azure / GCP / OVH / Scaleway). Generates postmortems. Opens remediation PRs with human approval.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wiring up L2 is straightforward: configure your AI SRE's webhook to receive the rollback event (Argo Rollouts emits Kubernetes events; you can route them via &lt;a href="https://argoproj.github.io/argo-rollouts/features/notifications/" rel="noopener noreferrer"&gt;Argo Notifications&lt;/a&gt; to the agent). The agent investigates and posts results to the on-call Slack channel before the human acknowledges the page.&lt;/p&gt;

&lt;h2&gt;
  
  
  L3: Diagnostic + agent-proposed remediation
&lt;/h2&gt;

&lt;p&gt;L3 is where AI starts proposing fixes, not just diagnosis. The pattern that works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pipeline fails → automated rollback (L1).&lt;/li&gt;
&lt;li&gt;Agent investigates → RCA produced (L2).&lt;/li&gt;
&lt;li&gt;Agent proposes a fix as a &lt;strong&gt;pull request&lt;/strong&gt;, with the RCA as the PR description, the diff scoped to one file, and tests where possible.&lt;/li&gt;
&lt;li&gt;Human reviews PR. If correct, merges. If wrong, comments and rejects.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works because the pull request is the natural human-review surface. The agent doesn't touch production directly; it touches the repository, which already has CI, code review, and a merge gate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; is built precisely for this pattern. A post-incident-completion Action with a prompt like "Open a PR widening alert thresholds for the three noisiest alerts in this incident" converts the human follow-up step into automated PR creation. The human review surface stays exactly the same as for human-authored PRs.&lt;/p&gt;

&lt;p&gt;The HolmesGPT equivalent ships as &lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;"Operator mode"&lt;/a&gt; — the agent can write to GitHub when explicitly enabled.&lt;/p&gt;

&lt;h2&gt;
  
  
  L4: Closed-loop with policy gates
&lt;/h2&gt;

&lt;p&gt;L4 is the contentious one. It involves the agent making changes &lt;em&gt;without&lt;/em&gt; human approval — but only inside a tightly scoped policy.&lt;/p&gt;

&lt;p&gt;The pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;policy engine&lt;/strong&gt; (&lt;a href="https://www.openpolicyagent.org/" rel="noopener noreferrer"&gt;Open Policy Agent&lt;/a&gt;, &lt;a href="https://kyverno.io/" rel="noopener noreferrer"&gt;Kyverno&lt;/a&gt;, Casbin) defines which classes of remediation can run automatically.&lt;/li&gt;
&lt;li&gt;The agent proposes a fix. The policy engine evaluates whether the fix matches a permitted class.&lt;/li&gt;
&lt;li&gt;If yes → apply automatically with audit logging. If no → route to L3 (PR for human review).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Permitted classes that are usually safe at L4: widening an alert threshold by less than 2x, restarting a pod, scaling a deployment within preset bounds, adding a retry loop to a network call, suppressing a noisy log line.&lt;/p&gt;

&lt;p&gt;Permitted classes that are usually &lt;em&gt;not&lt;/em&gt; safe at L4: any data-plane change, any production traffic routing change, any secret or RBAC change, any change touching the policy engine itself.&lt;/p&gt;

&lt;p&gt;The reason L4 is contentious is that the policy gate is now a high-value target. An attacker who can broaden the policy can broaden the agent's blast radius. The same threat model we walk through in our &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety guide&lt;/a&gt; applies, plus an additional layer: the policy engine must be operated with the same rigor as the orchestration plane itself.&lt;/p&gt;

&lt;p&gt;Almost no production teams in 2026 run pure L4. The credible deployments are &lt;strong&gt;L3 with hardcoded L4 exceptions&lt;/strong&gt; for two or three well-understood remediation classes. That's where to aim.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common pitfalls
&lt;/h2&gt;

&lt;p&gt;A short list of failure modes we have seen — in our own work and in customer deployments.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Auto-remediating &lt;em&gt;into&lt;/em&gt; a worse state.&lt;/strong&gt; The classic failure is auto-scaling a service to handle elevated error rates that are themselves caused by a downstream dependency. The service scales, hammers the dependency harder, and the dependency collapses. &lt;strong&gt;Fix:&lt;/strong&gt; never auto-remediate without dependency-graph awareness. Aurora uses &lt;a href="https://memgraph.com/" rel="noopener noreferrer"&gt;Memgraph&lt;/a&gt; for this; HolmesGPT uses its toolset structure; pure-L1 stacks should require manual escalation when the failure crosses service boundaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trusting the AnalysisTemplate metric too much.&lt;/strong&gt; A 1% error rate threshold on a P99-tail service is meaningless if your real failure mode is request-stalled-not-failed. &lt;strong&gt;Fix:&lt;/strong&gt; model what user-visible failure actually looks like, not what the cleanest Prometheus query produces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Letting the agent run unbounded retries.&lt;/strong&gt; AI agents that hit a "this didn't work" signal will often retry — sometimes thousands of times — burning tokens and triggering downstream rate limits. &lt;strong&gt;Fix:&lt;/strong&gt; cap the agent's tool-call budget. Aurora's executor enforces this by default; verify your agent does the same.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skipping the post-mortem.&lt;/strong&gt; Auto-remediation that "just worked" without a clear human review of what happened is a slow path to brittleness. Every auto-remediation event should produce a postmortem the on-call reads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conflating auto-remediation with "self-healing infra".&lt;/strong&gt; Kubernetes pod restarts are not auto-remediation. They are a runtime affordance. Auto-remediation is the response to a &lt;em&gt;change-driven&lt;/em&gt; failure — the deploy, the config push, the schema migration. Keep the categories separate.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  A pragmatic 90-day path to auto-remediation
&lt;/h2&gt;

&lt;p&gt;For a team currently at L0 or L1.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 1–14: instrument and detect
&lt;/h3&gt;

&lt;p&gt;Pick your three highest-traffic services. Add or harden:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Synthetic checks that exercise the user-visible path.&lt;/li&gt;
&lt;li&gt;One Prometheus error-rate metric per service with a clear threshold.&lt;/li&gt;
&lt;li&gt;A canary or blue-green rollout primitive (&lt;a href="https://argoproj.github.io/argo-rollouts/" rel="noopener noreferrer"&gt;Argo Rollouts&lt;/a&gt; or &lt;a href="https://fluxcd.io/flagger/" rel="noopener noreferrer"&gt;Flagger&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Goal at end of week 2: a controlled bad deploy auto-rolls back without human intervention.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 15–45: wire in the agent
&lt;/h3&gt;

&lt;p&gt;Deploy one of &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, &lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;, or &lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt; in read-only mode. Configure rollback events to webhook the agent. Have it post an RCA to your incident channel within five minutes of rollback.&lt;/p&gt;

&lt;p&gt;Goal at end of week 6: every rollback comes with a written diagnostic before the human acknowledges.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 46–75: add agent-proposed remediation
&lt;/h3&gt;

&lt;p&gt;Enable PR-creation for the agent (&lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt; on-incident-completion trigger, or HolmesGPT Operator mode). Constrain initial scope to one repo and one class of fix (alert thresholds, retry loops, log suppression). Review every PR for the first two weeks.&lt;/p&gt;

&lt;p&gt;Goal at end of week 11: agent opens correct PRs in 70%+ of fired rollbacks. False-positive PRs are caught at code review.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 76–90: policy-gate one fix class for L4
&lt;/h3&gt;

&lt;p&gt;Pick the safest class — usually alert threshold widening when an alert fired more than N times in M hours with mean TTA above some bound. Define an OPA / Kyverno policy that permits &lt;em&gt;only that class.&lt;/em&gt; Wire the agent to apply directly when the policy permits, raise a PR otherwise.&lt;/p&gt;

&lt;p&gt;Goal at end of week 12: one L4 lane open for one fix class with full audit trail.&lt;/p&gt;

&lt;p&gt;This is the conservative path. Aggressive teams have moved faster, but we have not seen anyone skip steps successfully.&lt;/p&gt;

&lt;h2&gt;
  
  
  The DORA reality check
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://dora.dev/guides/dora-metrics/" rel="noopener noreferrer"&gt;DORA program's published guidance&lt;/a&gt; is blunt about what good looks like. Historical State of DevOps Reports have consistently shown the same shape of distribution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Change Failure Rate&lt;/strong&gt;: top performers maintain low single-digit percentages; lower performers see substantially higher rates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failed Deployment Recovery Time (FDRT)&lt;/strong&gt;: top performers recover in under one hour; lower performers can take days to weeks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DORA's research has also consistently found that &lt;strong&gt;speed and stability reinforce each other rather than trade off&lt;/strong&gt; — the fastest teams are also the most stable, per &lt;a href="https://dora.dev/insights/dora-metrics-history/" rel="noopener noreferrer"&gt;DORA's history of metrics&lt;/a&gt; and successive State of DevOps Reports. Auto-remediation is one of the small number of capabilities that moves teams across these tiers without requiring deeper organizational change. The L1→L2 jump alone reduces FDRT meaningfully because the human is no longer reconstructing context — the agent has already done it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is heading
&lt;/h2&gt;

&lt;p&gt;Two predictions, each with a reasonable evidence base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The L2 → L3 transition becomes table-stakes within 18 months.&lt;/strong&gt; AI-authored PRs from agents are already merging in production at multiple companies in our network. Once the review surface is the same as for human-authored PRs (which it already is via GitHub / Bitbucket / GitLab), there is no organizational reason not to use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. L4 stays narrow.&lt;/strong&gt; The threat surface of agent-applied changes is genuinely scary, and the per-incident savings of going from L3 to L4 are smaller than the savings from L1 to L2. Expect L4 to be the place where one or two well-understood fix classes get automated, while everything else stays L3.&lt;/p&gt;

&lt;p&gt;The teams who win in 2026-2027 are the ones who get to credible L3 first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Aurora fits
&lt;/h2&gt;

&lt;p&gt;Aurora is the AI agent layer of an auto-remediation stack — it covers L2 (investigation), L3 (PR-based remediation via &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions&lt;/a&gt;), and the agent half of L4 (policy-gated remediation). It does not replace Argo Rollouts or Flagger at L1; those remain the foundation. Aurora is the difference between rolling back blind and rolling back with a written RCA and a draft PR.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;arvo-ai.github.io/aurora&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora Actions launch:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/aurora-actions-background-automations" rel="noopener noreferrer"&gt;Aurora Actions: User-Defined Background Automations&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OSS comparison:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;Aurora vs HolmesGPT vs K8sGPT&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety architecture:&lt;/strong&gt; &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI Agent kubectl Safety&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/cicd-auto-remediation-complete-guide" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>AI Agent kubectl Safety: Sandboxed Execution for Production</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Wed, 06 May 2026 20:44:12 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/ai-agent-kubectl-safety-sandboxed-execution-for-production-48d0</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/ai-agent-kubectl-safety-sandboxed-execution-for-production-48d0</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Giving an AI agent kubectl access is an architecture decision, not a permission flag.&lt;/strong&gt; Per-permission gates fail under prompt injection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OWASP ranks "Excessive Agency" as LLM06 in the &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/" rel="noopener noreferrer"&gt;2025 Top 10 for LLM Applications&lt;/a&gt;&lt;/strong&gt; and "Tool Misuse and Exploitation" as ASI02 in the &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/" rel="noopener noreferrer"&gt;2026 Top 10 for Agentic Applications&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Kubernetes ecosystem already has an answer&lt;/strong&gt;: &lt;a href="https://github.com/kubernetes-sigs/agent-sandbox" rel="noopener noreferrer"&gt;k8s-sigs/agent-sandbox&lt;/a&gt; provides a declarative API for isolated agent runtimes using gVisor or Kata Containers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real precedent exists.&lt;/strong&gt; &lt;a href="https://thehackernews.com/2025/06/zero-click-ai-vulnerability-exposes.html" rel="noopener noreferrer"&gt;EchoLeak (CVE-2025-32711)&lt;/a&gt;, CVSS 9.3, was the first publicly documented zero-click prompt-injection data exfiltration in a production LLM system. The kubectl analogue would be cluster-wide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora runs every &lt;code&gt;kubectl&lt;/code&gt; command in a pod-isolated process&lt;/strong&gt; via its &lt;code&gt;terminal_run&lt;/code&gt; primitive, with an environment-variable allowlist that strips secrets, signature-matcher and LLM-judge guardrails, and per-invocation cloud credentials.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Of the &lt;a href="https://rootly.com/ai-sre-guide" rel="noopener noreferrer"&gt;46+ products marketed as "AI SRE" in 2026&lt;/a&gt;, only a handful publicly document their kubectl execution architecture — and the gap between vendors that handle this well and vendors that handle it badly is the single largest unspoken risk in the category. &lt;strong&gt;AI agent kubectl safety is the architectural discipline of letting an AI agent run &lt;code&gt;kubectl&lt;/code&gt; (or any cloud CLI) against production without inheriting cluster-wide blast radius if the agent is compromised.&lt;/strong&gt; It is not the same as RBAC scoping, and it is not the same as a human approval prompt — both are necessary but neither is sufficient on its own.&lt;/p&gt;

&lt;p&gt;When &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/" rel="noopener noreferrer"&gt;OWASP published its 2025 Top 10 for LLM Applications&lt;/a&gt;, it ranked &lt;strong&gt;Prompt Injection (LLM01)&lt;/strong&gt; as the top risk and &lt;strong&gt;Excessive Agency (LLM06)&lt;/strong&gt; as one of the most consequential — defining it across three root causes: excessive functionality, excessive permissions, and excessive autonomy. In December 2025, OWASP followed up with a &lt;a href="https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/" rel="noopener noreferrer"&gt;dedicated Top 10 for Agentic Applications&lt;/a&gt; that names &lt;strong&gt;Tool Misuse and Exploitation (ASI02)&lt;/strong&gt; and &lt;strong&gt;Identity and Privilege Abuse (ASI03)&lt;/strong&gt; as primary attack surfaces.&lt;/p&gt;

&lt;p&gt;Translation: if you give an AI agent the ability to run &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, or &lt;code&gt;gcloud&lt;/code&gt; commands against production, you have a security architecture problem — not a permissions problem. This guide walks through the threat model, the emerging Kubernetes sandboxing standard, and how to evaluate any AI SRE on its kubectl safety.&lt;/p&gt;

&lt;h2&gt;
  
  
  What can go wrong when AI agents run kubectl?
&lt;/h2&gt;

&lt;p&gt;Any LLM-driven agent that executes commands inherits the security properties of the LLM, the harness, and the runtime. Three real-world precedents illustrate the failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EchoLeak (CVE-2025-32711)&lt;/strong&gt; — Microsoft 365 Copilot, CVSS 9.3 critical, &lt;a href="https://thehackernews.com/2025/06/zero-click-ai-vulnerability-exposes.html" rel="noopener noreferrer"&gt;patched in June 2025&lt;/a&gt;. Discovered by Aim Security, it was the first publicly documented zero-click indirect prompt-injection data exfiltration in a production LLM system. A crafted email sat in Outlook; when the user later asked Copilot for an unrelated summary, the email's hidden instructions fired and exfiltrated SharePoint, OneDrive, and Teams data. Research paper: &lt;a href="https://arxiv.org/abs/2509.10540" rel="noopener noreferrer"&gt;arXiv:2509.10540&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MITRE ATLAS prompt-injection techniques&lt;/strong&gt; — &lt;a href="https://atlas.mitre.org/" rel="noopener noreferrer"&gt;MITRE ATLAS&lt;/a&gt; catalogues real-world adversary techniques against AI systems, including indirect prompt injection that turns an LLM with tool access into an attacker-controlled execution surface. The framework specifically documents techniques for exfiltration via AI agent tool invocation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Session Smuggling&lt;/strong&gt; — Palo Alto Unit 42 (November 2025) demonstrated rogue agents exploiting trust in the Agent-to-Agent (A2A) protocol with multi-turn manipulation. Documented in OWASP's Agentic Top 10.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these specifically targeted kubectl-running agents in production — but the class is the same and the blast radius would be larger. An agent that can run &lt;code&gt;kubectl delete&lt;/code&gt; is one prompt-injection payload away from a cluster-wide outage.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Attack Surfaces of Agentic kubectl
&lt;/h2&gt;

&lt;p&gt;Most teams think of kubectl agent safety as a single problem ("can the agent be tricked?"). It's actually four distinct attack surfaces, each requiring its own mitigation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Surface&lt;/th&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;th&gt;Why permission-scoping alone fails&lt;/th&gt;
&lt;th&gt;Mitigation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Prompt injection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hidden instructions in logs, alerts, runbooks, or chat coerce the agent&lt;/td&gt;
&lt;td&gt;Compromised agent acts within its granted permissions, which is exactly what permission-scoping permits&lt;/td&gt;
&lt;td&gt;Sandboxed runtime; never trust LLM output derived from data the LLM read&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Credential leakage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Executed command reads &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;, &lt;code&gt;VAULT_TOKEN&lt;/code&gt;, &lt;code&gt;KUBECONFIG&lt;/code&gt; from inherited env&lt;/td&gt;
&lt;td&gt;Permissions live on credentials; if the credential leaks, the permission set leaks with it&lt;/td&gt;
&lt;td&gt;Per-invocation short-lived credentials (STS, Service Principal); explicit env allowlist that strips secrets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Blast radius escalation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Legitimate command runs against wrong namespace, region, or cluster&lt;/td&gt;
&lt;td&gt;Permissions don't model "right action, wrong target"&lt;/td&gt;
&lt;td&gt;Default read-only; dependency-graph awareness; human approval for destructive writes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Audit trail gaps&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Logs capture commands without the agent's reasoning&lt;/td&gt;
&lt;td&gt;Permission systems audit "who ran what," not "why"&lt;/td&gt;
&lt;td&gt;Per-investigation transcripts that link reasoning → tool calls → outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Attack Surface 1: Prompt injection
&lt;/h3&gt;

&lt;p&gt;The agent reads a log line, alert payload, runbook, or chat message that contains hidden instructions. The LLM cannot reliably distinguish data from instructions in the same channel — this is the fundamental property OWASP's LLM01 captures. Even frontier models do not eliminate it. Anthropic has publicly stated that "no browser agent is immune to prompt injection" and publishes &lt;a href="https://www.anthropic.com/news/prompt-injection-defenses" rel="noopener noreferrer"&gt;defense benchmarks&lt;/a&gt; showing measurable but imperfect attack-prevention rates across computer-use, bash tool use, and MCP workflows. The implication for kubectl-running agents is clear: &lt;strong&gt;the LLM is not the security boundary. The runtime is.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mitigation: never trust LLM output that originates from data the LLM also read. Sandbox the execution layer so even a successful injection has limited blast radius.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attack Surface 2: Credential leakage
&lt;/h3&gt;

&lt;p&gt;If the agent runs commands with credentials inherited from the host process environment (&lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;, &lt;code&gt;KUBECONFIG&lt;/code&gt;, &lt;code&gt;VAULT_TOKEN&lt;/code&gt;), a successful command-injection or shell escape exposes everything the agent process has access to. Long-lived static credentials make this catastrophic.&lt;/p&gt;

&lt;p&gt;Mitigation: per-invocation credential scoping. AWS STS AssumeRole, Azure Service Principal sessions, GCP short-lived tokens. Strip everything else from the child process environment with an explicit allowlist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attack Surface 3: Blast radius escalation
&lt;/h3&gt;

&lt;p&gt;Even legitimate, non-injected commands can have outsized effects. &lt;code&gt;kubectl delete pod&lt;/code&gt; on the wrong namespace. &lt;code&gt;aws ec2 terminate-instances&lt;/code&gt; against a misidentified region. The agent doesn't need to be compromised — it just needs to be wrong.&lt;/p&gt;

&lt;p&gt;Mitigation: read-only by default, write actions behind explicit human approval, and dependency-graph awareness so the agent can compute blast radius before acting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attack Surface 4: Audit trail gaps
&lt;/h3&gt;

&lt;p&gt;When an investigation runs across 20+ tool invocations, traditional audit systems (CloudTrail, Kubernetes audit logs) record what was run but not why. A reviewer six months later cannot tell whether a &lt;code&gt;kubectl scale&lt;/code&gt; was a legitimate response to a load spike or an injected instruction.&lt;/p&gt;

&lt;p&gt;Mitigation: structured per-investigation transcripts that capture agent reasoning alongside tool calls. The right log isn't "kubectl was run" — it's "in response to alert X, the agent hypothesized Y, ran kubectl Z, and observed W."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "human approval" alone is not enough
&lt;/h2&gt;

&lt;p&gt;The most common safety story in the AI SRE space is "the agent suggests; humans approve." That is necessary but not sufficient.&lt;/p&gt;

&lt;p&gt;The problem with approval gates as the only line of defense:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Decision fatigue.&lt;/strong&gt; An agent that handles 50 alerts a week generates dozens of approval prompts. Humans rubber-stamp.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval ≠ understanding.&lt;/strong&gt; Engineers approve commands they don't fully understand because the agent's reasoning sounds plausible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Injected intent looks legitimate.&lt;/strong&gt; A prompt-injection payload can produce a recommendation that &lt;em&gt;reads&lt;/em&gt; exactly like a normal RCA. The approver has no signal that the underlying instruction came from an attacker.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Approval gates are critical, but they need to sit on top of an already-sandboxed runtime — not be the only protection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Permission scoping vs sandboxed execution: what's the difference?
&lt;/h2&gt;

&lt;p&gt;These two terms get conflated. They aren't the same thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permission scoping&lt;/strong&gt; restricts what an agent's identity can do. RBAC roles, IAM policies, kubeconfig contexts. It's necessary, but it operates at the cluster-API layer — meaning a successful prompt injection can still use every permission the agent has.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sandboxed execution&lt;/strong&gt; isolates the &lt;em&gt;runtime&lt;/em&gt; in which commands execute. If the agent's process is compromised, the sandbox limits what the compromised process can do regardless of the credentials it holds. The compromised process can't read other pods' files, can't reach other nodes, can't escalate to the host kernel.&lt;/p&gt;

&lt;p&gt;The defensible architecture combines both: tight permission scoping (small RBAC role, short-lived credentials) + runtime isolation (sandboxed execution).&lt;/p&gt;

&lt;h2&gt;
  
  
  How sandboxed kubectl actually works
&lt;/h2&gt;

&lt;p&gt;The Kubernetes ecosystem standardized on this pattern in 2025–2026.&lt;/p&gt;

&lt;h3&gt;
  
  
  k8s-sigs/agent-sandbox
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/kubernetes-sigs/agent-sandbox" rel="noopener noreferrer"&gt;k8s-sigs/agent-sandbox&lt;/a&gt; is a formal Kubernetes SIG Apps subproject that launched at KubeCon Atlanta in November 2025. It provides a declarative Kubernetes API for "isolated, stateful, singleton workloads" — built specifically for AI agent runtimes that may execute untrusted, LLM-generated code.&lt;/p&gt;

&lt;p&gt;Core CRDs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Sandbox&lt;/code&gt; — an isolated pod-equivalent with stronger boundaries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SandboxTemplate&lt;/code&gt; — reusable configuration&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SandboxClaim&lt;/code&gt; — request a sandbox for a workload&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SandboxWarmPool&lt;/code&gt; — pre-created sandboxes that bring cold-start under one second&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://kubernetes.io/blog/2026/03/20/running-agents-on-kubernetes-with-agent-sandbox/" rel="noopener noreferrer"&gt;Kubernetes blog post from March 2026&lt;/a&gt; makes the architectural claim explicit: "Isolation achieved via runtime-level sandboxing (gVisor/Kata), not just container-level namespaces."&lt;/p&gt;

&lt;h3&gt;
  
  
  gVisor
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://gvisor.dev/" rel="noopener noreferrer"&gt;gVisor&lt;/a&gt; is a Google-maintained user-space application kernel that provides kernel-level isolation without full virtualization. Architecture: &lt;strong&gt;Sentry&lt;/strong&gt; (a kernel emulator written in Go) intercepts roughly 200 Linux syscalls; &lt;strong&gt;Gofer&lt;/strong&gt; brokers filesystem access over 9P. The OCI runtime is &lt;code&gt;runsc&lt;/code&gt;, drop-in compatible with &lt;code&gt;runc&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;gVisor runs in production at Google for App Engine standard, Cloud Functions, Cloud Run, and Cloud ML Engine. GKE Sandbox productizes it for GKE node pools. It is one of two named isolation backends in agent-sandbox (the other being Kata Containers, which uses lightweight VMs).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this matters for AI SRE
&lt;/h3&gt;

&lt;p&gt;An AI SRE that runs &lt;code&gt;kubectl&lt;/code&gt; against production is exactly the kind of workload agent-sandbox was built for. It executes LLM-generated commands. It needs file system isolation, syscall isolation, and per-invocation credential scoping. It benefits enormously from a warm pool that reduces cold-start latency.&lt;/p&gt;

&lt;p&gt;If you are evaluating an AI SRE in 2026, this is one of the right questions to ask: &lt;em&gt;what isolation backend does the agent use when it executes commands?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Aurora's pod-isolated execution works
&lt;/h2&gt;

&lt;p&gt;Aurora's approach predates agent-sandbox and follows the same architectural principles.&lt;/p&gt;

&lt;p&gt;When Aurora's agent runs a &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, or &lt;code&gt;gcloud&lt;/code&gt; command, it doesn't use &lt;code&gt;subprocess.run()&lt;/code&gt; directly. It uses an internal primitive called &lt;code&gt;terminal_run&lt;/code&gt;, defined in &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;&lt;code&gt;server/utils/terminal/terminal_run.py&lt;/code&gt;&lt;/a&gt;. The module's docstring is explicit:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Drop-in replacement for subprocess.run() that executes in terminal pods. This module provides a terminal_run() function that mimics subprocess.run() API but executes commands in isolated terminal pods via kubectl exec. Safety guardrails (signature matcher + LLM judge) run automatically unless the caller passes &lt;code&gt;trusted=True&lt;/code&gt; for known-safe internal operations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Three properties matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Pod-isolated execution.&lt;/strong&gt; When the &lt;code&gt;ENABLE_POD_ISOLATION&lt;/code&gt; flag is set (the default in Kubernetes deployments), every external command runs inside a separate terminal pod via &lt;code&gt;kubectl exec&lt;/code&gt;. The agent's own process never executes the command directly. A successful command-injection in the agent's reasoning loop does not give an attacker access to the agent host.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Two-stage safety guardrails.&lt;/strong&gt; Before any non-trusted command runs, two checks fire automatically: a deterministic signature matcher that rejects known-dangerous patterns, and an LLM judge that evaluates the proposed command against the investigation context. The &lt;code&gt;trusted=True&lt;/code&gt; flag bypasses both — used only for known-safe internal operations like configured connector calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Sanitized environment allowlist.&lt;/strong&gt; Aurora's &lt;code&gt;terminal_exec_tool&lt;/code&gt; module defines an explicit &lt;code&gt;_SAFE_ENV_KEYS&lt;/code&gt; set: &lt;code&gt;PATH&lt;/code&gt;, &lt;code&gt;HOME&lt;/code&gt;, &lt;code&gt;USER&lt;/code&gt;, &lt;code&gt;SHELL&lt;/code&gt;, &lt;code&gt;TERM&lt;/code&gt;, &lt;code&gt;LANG&lt;/code&gt;, &lt;code&gt;TMPDIR&lt;/code&gt;, &lt;code&gt;SSL_CERT_FILE&lt;/code&gt;, plus &lt;code&gt;ENABLE_POD_ISOLATION&lt;/code&gt; itself. Everything else — including &lt;code&gt;VAULT_TOKEN&lt;/code&gt;, &lt;code&gt;DATABASE_URL&lt;/code&gt;, &lt;code&gt;SECRET_KEY&lt;/code&gt;, and any cloud credentials — is stripped from the child process environment. A compromised command cannot read the agent's secrets via &lt;code&gt;env&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Cloud credentials are handled separately. Aurora calls &lt;code&gt;generate_contextual_access_token&lt;/code&gt; and &lt;code&gt;generate_azure_access_token&lt;/code&gt; per invocation. AWS uses STS AssumeRole via cross-account roles (&lt;a href="https://github.com/Arvo-AI/aurora/tree/main/server/connectors/aws_connector" rel="noopener noreferrer"&gt;&lt;code&gt;aurora-cross-account-role.yaml&lt;/code&gt;&lt;/a&gt;) — short-lived credentials, not long-lived access keys. Azure uses Service Principal sessions. GCP uses OAuth-derived tokens.&lt;/p&gt;

&lt;p&gt;For agents that need to reach customer Kubernetes clusters Aurora can't access directly, a separate &lt;a href="https://github.com/Arvo-AI/aurora/tree/main/kubectl-agent" rel="noopener noreferrer"&gt;&lt;code&gt;kubectl-agent&lt;/code&gt;&lt;/a&gt; binary deploys via Helm into the customer's cluster and connects outbound over WebSocket. No inbound network access required, no kubeconfig sharing, no static credentials at rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to evaluate an AI SRE's kubectl safety model
&lt;/h2&gt;

&lt;p&gt;Eight questions to ask any AI SRE vendor or open-source project before enabling production access:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Where does the command actually execute?&lt;/strong&gt; Same process as the agent? Same host? Separate container? Sandboxed runtime (gVisor/Kata)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What credentials does the command inherit from the host environment?&lt;/strong&gt; Specifically: can the executed command read your agent's vault token, database URL, or other host secrets?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are credentials short-lived or static?&lt;/strong&gt; STS / Service Principal sessions, or long-lived access keys?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the default read-only?&lt;/strong&gt; What flag, configuration, or RBAC role enables write access?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What happens between "agent decides to run X" and "X runs"?&lt;/strong&gt; Is there a deterministic policy check? An LLM judge? A human approval prompt? All three?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are destructive actions specifically gated?&lt;/strong&gt; What's the definition of "destructive" — vendor-defined or operator-configurable?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What does the audit trail capture?&lt;/strong&gt; Just the commands, or the agent's reasoning + the commands together?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's the blast radius of a single successful prompt injection?&lt;/strong&gt; Walk through the worst case explicitly with the vendor.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a vendor can't answer these clearly, the architecture isn't ready for production write access.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions in 2026
&lt;/h2&gt;

&lt;p&gt;This is a young problem space. Several questions are not yet resolved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standardization.&lt;/strong&gt; k8s-sigs/agent-sandbox is the leading candidate for a standard, but Knative Sandbox, container-level approaches, and microVM-based runtimes (Firecracker) are all in play.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cloud isolation.&lt;/strong&gt; Sandboxing a Kubernetes pod is a solved problem. Sandboxing a process that calls &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, &lt;code&gt;gcloud&lt;/code&gt; across cloud APIs from a single agent is harder — the credentials and trust boundaries change per provider.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval UX at scale.&lt;/strong&gt; Engineers can't approve 200 actions per week. The right UI for batch approval, policy-based pre-approval, and rollback-only autonomy is still being figured out.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Expect significant movement on all three through 2026 and into 2027.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora's approach in summary
&lt;/h2&gt;

&lt;p&gt;If you operate an AI SRE in production, the safety questions are non-negotiable. Aurora's answer is: pod-isolated execution by default, deterministic + LLM-judge guardrails before any non-trusted command, environment-variable allowlist that strips secrets, per-invocation cloud credentials via STS/Service Principal/short-lived tokens, and human approval for destructive write operations. The full architecture is open source under Apache 2.0 — auditable in the &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For background on the agent and tool model, see the &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;complete guide to AI SRE&lt;/a&gt;, the &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;open-source AI SRE comparison&lt;/a&gt;, or the explainer on &lt;a href="https://www.arvoai.ca/blog/what-is-agentic-incident-management" rel="noopener noreferrer"&gt;agentic incident management&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>opensource</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT (2026)</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Wed, 06 May 2026 20:38:19 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt-2026-5g26</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt-2026-5g26</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Three credible open-source AI SREs exist in 2026&lt;/strong&gt;: Aurora (Arvo AI), HolmesGPT (Robusta + Microsoft, CNCF Sandbox), and K8sGPT (CNCF Sandbox). All three are Apache 2.0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only one is a true multi-step agent.&lt;/strong&gt; HolmesGPT runs an iterative ReAct loop. K8sGPT is a rule-based scanner that uses an LLM only to explain findings. Aurora is a multi-step LangGraph agent with cross-cloud execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only Aurora handles multi-cloud&lt;/strong&gt; out of the box (AWS, Azure, GCP, OVH, Scaleway, plus Kubernetes). HolmesGPT covers Kubernetes plus 30+ observability integrations. K8sGPT is Kubernetes-only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only Aurora generates remediation pull requests.&lt;/strong&gt; HolmesGPT can open PRs with suggested fixes in Operator mode; K8sGPT is strictly read-only with no write actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All three support BYO LLM&lt;/strong&gt;, including local inference via Ollama for air-gapped deployments — the differentiator over commercial AI SREs.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Of the &lt;a href="https://rootly.com/ai-sre-guide" rel="noopener noreferrer"&gt;46+ companies offering "AI SRE" products in 2026&lt;/a&gt;, only a handful are open source — and only three are credible enough to deploy in production: &lt;strong&gt;Aurora&lt;/strong&gt;, &lt;strong&gt;HolmesGPT&lt;/strong&gt;, and &lt;strong&gt;K8sGPT&lt;/strong&gt;. &lt;strong&gt;An open-source AI SRE is an AI agent that performs incident investigation, root cause analysis, and (sometimes) remediation under a permissive license that allows self-hosting, source-code audit, and modification.&lt;/strong&gt; They get lumped together in marketing, but architecturally these three are different products solving different parts of the incident response problem.&lt;/p&gt;

&lt;p&gt;This guide compares them on the things that actually matter: agent architecture, execution model, integration scope, and where you can deploy them. By the end, you should be able to pick the right one for your stack — or know whether you need all three.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an open-source AI SRE?
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;open-source AI SRE&lt;/strong&gt; is an AI agent that performs site reliability engineering work — alert triage, incident investigation, root cause analysis, remediation — under a permissive license that allows self-hosting, source-code audit, and modification. Three properties are non-negotiable:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt;: Apache 2.0, MIT, or equivalent. Source-available licenses (BSL, SSPL) do not count for most production teams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hostable&lt;/strong&gt;: runs entirely inside your environment without phoning home to a vendor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-driven&lt;/strong&gt;: uses large language models, not just static rules or regex. (This is what separates "AI SRE" from older AIOps tools.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The reason this category matters: incident data is some of the most sensitive telemetry an organization produces. Self-hosted, audit-able AI is the only model that works for regulated industries, air-gapped environments, or any team that doesn't want production telemetry leaving their perimeter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why open source matters for AI SRE
&lt;/h2&gt;

&lt;p&gt;Three reasons buyers in 2026 are explicitly asking for open-source AI SRE:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data sovereignty.&lt;/strong&gt; Incident telemetry includes log lines, configuration values, deployment IDs, and sometimes payloads. SaaS AI SREs send all of it to their backend and to a third-party LLM. Self-hosted means it stays in your VPC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit transparency.&lt;/strong&gt; Regulators and security teams want to know exactly what the agent does on production systems. Source code answers that question; vendor marketing does not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost predictability.&lt;/strong&gt; Per-user or per-incident pricing can balloon quickly. Open-source costs scale with infrastructure and LLM tokens — and Ollama-local inference can flatten the LLM bill entirely.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is real: you operate the system yourself. For teams already operating Kubernetes and observability stacks, that's marginal effort. For teams without that operational maturity, a commercial AI SRE is often the right call.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the three compare
&lt;/h2&gt;

&lt;p&gt;This is the only table you need. Verified from each project's GitHub repo, official docs, and source as of May 2026.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Aurora&lt;/th&gt;
&lt;th&gt;HolmesGPT&lt;/th&gt;
&lt;th&gt;K8sGPT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub stars&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;201&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;2,366&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;7,737&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latest release&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/Arvo-AI/aurora/releases" rel="noopener noreferrer"&gt;v1.1.1 (Mar 2026)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt/releases" rel="noopener noreferrer"&gt;0.26.0 (Apr 2026)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt/releases" rel="noopener noreferrer"&gt;v0.4.32 (Apr 2026)&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CNCF status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Independent&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.cncf.io/projects/holmesgpt/" rel="noopener noreferrer"&gt;Sandbox (Oct 2025)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Sandbox&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Built by&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Arvo AI&lt;/td&gt;
&lt;td&gt;Robusta + Microsoft&lt;/td&gt;
&lt;td&gt;k8sgpt-ai community&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LangGraph supervisor + sub-agents&lt;/td&gt;
&lt;td&gt;ReAct loop (&lt;code&gt;ToolCallingLLM&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Rule-based scanner + LLM explainer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-step reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (single-shot per analyzer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud providers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS, Azure, GCP, OVH, Scaleway&lt;/td&gt;
&lt;td&gt;Kubernetes + AWS via MCP&lt;/td&gt;
&lt;td&gt;Kubernetes only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kubernetes execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;kubectl&lt;/code&gt; in sandboxed pods&lt;/td&gt;
&lt;td&gt;Read-only &lt;code&gt;kubectl get&lt;/code&gt;/&lt;code&gt;describe&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Read-only via Kube API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Other integrations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;22+ (PagerDuty, Datadog, Grafana, Slack, Confluence, Bitbucket, Jenkins, etc.)&lt;/td&gt;
&lt;td&gt;30+ toolsets (Prometheus, Grafana, Datadog, Loki, Jira, etc.)&lt;/td&gt;
&lt;td&gt;None — Kubernetes-only by design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge base / RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Weaviate vector search over runbooks + postmortems&lt;/td&gt;
&lt;td&gt;Yes (via toolsets)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dependency graph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Memgraph (cross-cloud blast radius)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Postmortem generation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, exports to Confluence&lt;/td&gt;
&lt;td&gt;Investigation reports only&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pull request remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GitHub + Bitbucket with human approval gate&lt;/td&gt;
&lt;td&gt;GitHub PRs in Operator mode&lt;/td&gt;
&lt;td&gt;None — strictly read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP server&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (340+ endpoints, 6 named tools)&lt;/td&gt;
&lt;td&gt;Yes (consumes MCP servers)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM providers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OpenAI, Anthropic, Google, Vertex, OpenRouter, Ollama&lt;/td&gt;
&lt;td&gt;OpenAI, Anthropic, Azure OpenAI, Bedrock, Gemini, Vertex, Ollama&lt;/td&gt;
&lt;td&gt;OpenAI, Azure, Cohere, Bedrock, SageMaker, Gemini, Vertex, HuggingFace, WatsonX, LocalAI, Ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Air-gapped support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Ollama + image tarballs)&lt;/td&gt;
&lt;td&gt;Yes (Ollama)&lt;/td&gt;
&lt;td&gt;Yes (LocalAI / Ollama)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Docker Compose or Helm&lt;/td&gt;
&lt;td&gt;Binary, API server, K8s Operator, Python SDK&lt;/td&gt;
&lt;td&gt;Go binary, K8s operator&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The OSS AI SRE Maturity Spectrum
&lt;/h2&gt;

&lt;p&gt;A useful way to position these tools is on a four-level spectrum of agent capability. Each level is strictly more capable than the one below — and each requires more architectural work to deploy safely.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;What the agent does&lt;/th&gt;
&lt;th&gt;Tools at this level&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1 — Diagnostic Explainer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reads system state, finds anomalies via deterministic rules, uses an LLM only to explain findings in natural language. No multi-step reasoning. Strictly read-only.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;K8sGPT&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2 — Read-Only Investigator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs an iterative ReAct loop. Picks tools dynamically. Investigates across multiple data sources (metrics, logs, traces, K8s state). Read-only by design.&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;HolmesGPT&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3 — Investigation + Suggestion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Everything in L2, plus opens pull requests with suggested fixes. Humans review and merge. No autonomous writes to infrastructure.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;HolmesGPT (Operator mode)&lt;/strong&gt;, &lt;strong&gt;Aurora&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L4 — Investigation + Approved Remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Everything in L3, plus can execute approved remediation actions (rollbacks, restarts, scale changes) inside guardrails — typically a sandboxed runtime with explicit human approval for destructive operations.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Aurora&lt;/strong&gt; (with Bitbucket connector's &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;human approval gate&lt;/a&gt; for destructive actions)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No open-source tool today operates as a fully autonomous L5 (closed-loop remediation without human approval) — and that's by design. Most serious teams want explicit gates before agents touch production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora vs HolmesGPT — which should you choose?
&lt;/h2&gt;

&lt;p&gt;Aurora and HolmesGPT are the two genuinely agentic options. The choice depends on your blast radius.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick HolmesGPT when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your stack is heavily Kubernetes + Prometheus + Grafana and your incidents live there.&lt;/li&gt;
&lt;li&gt;You want a tool that already integrates with 30+ observability sources, including Loki, AlertManager, NewRelic, Datadog APM, OpsGenie, and Slack.&lt;/li&gt;
&lt;li&gt;You value CNCF governance and a steep ecosystem velocity.&lt;/li&gt;
&lt;li&gt;You don't need cross-cloud (AWS APIs, Azure resources, GCP services) reasoning out of the box.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pick Aurora when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You operate across multiple clouds (AWS + Azure, GCP + AWS, etc.) and need an agent that can correlate incidents across providers.&lt;/li&gt;
&lt;li&gt;You want auto-generated postmortems exported to Confluence.&lt;/li&gt;
&lt;li&gt;You want the agent to draft remediation PRs against your codebase.&lt;/li&gt;
&lt;li&gt;You need a graph-based blast radius model (Memgraph) for dependency analysis.&lt;/li&gt;
&lt;li&gt;You want an MCP server so your IDE assistants (Cursor, Claude Desktop, Windsurf) can query live incident state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, some teams run both: HolmesGPT for in-cluster Kubernetes triage, Aurora for cross-cloud investigation and postmortem generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aurora vs K8sGPT — which should you choose?
&lt;/h2&gt;

&lt;p&gt;This is closer to "which tool category do you need?" than a head-to-head.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick K8sGPT when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want the absolute simplest entry point to AI for Kubernetes — a single Go binary you can install with Homebrew and run as &lt;code&gt;k8sgpt analyze --explain&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Your needs stop at "explain why this pod is broken" rather than multi-step incident investigation.&lt;/li&gt;
&lt;li&gt;You want the maturity of a 7.7k-star CNCF Sandbox project with rule-based analyzers that won't hallucinate causes (because they are deterministic before the LLM ever sees them).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pick Aurora when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need agentic investigation, not just diagnostic explanation.&lt;/li&gt;
&lt;li&gt;You operate beyond Kubernetes — cloud APIs, Terraform, monitoring tools, runbooks.&lt;/li&gt;
&lt;li&gt;You want auto-generated postmortems and remediation PRs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These two are complements, not competitors. Many teams run K8sGPT as a lightweight first-line scanner and Aurora (or HolmesGPT) for full incident investigation.&lt;/p&gt;

&lt;h2&gt;
  
  
  HolmesGPT vs K8sGPT — head-to-head
&lt;/h2&gt;

&lt;p&gt;Despite both being CNCF Sandbox projects targeting Kubernetes, these are different categories.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;HolmesGPT&lt;/th&gt;
&lt;th&gt;K8sGPT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it is&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-step AI agent&lt;/td&gt;
&lt;td&gt;Rule-based scanner with LLM explanations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;When it shines&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Investigating an alert end-to-end across signals&lt;/td&gt;
&lt;td&gt;Diagnosing why a specific resource is unhealthy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds to minutes (multi-step)&lt;/td&gt;
&lt;td&gt;Sub-second per analyzer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher (multiple calls per investigation)&lt;/td&gt;
&lt;td&gt;Lower (one explanation per finding)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hallucination risk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher (agent reasons across signals)&lt;/td&gt;
&lt;td&gt;Lower (deterministic before LLM)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best fit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;On-call engineers handling alerts&lt;/td&gt;
&lt;td&gt;Platform teams running periodic cluster audits&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;K8sGPT's anonymization feature (which masks resource names and labels before sending to the LLM) is a meaningful privacy advantage that HolmesGPT does not match.&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to use open-source AI SRE
&lt;/h2&gt;

&lt;p&gt;Honest take: open-source AI SRE is the right answer for most engineering-led, security-conscious teams. It's the wrong answer when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You don't have the operational capacity to run another stateful service in production.&lt;/li&gt;
&lt;li&gt;You want vendor support with SLAs and a phone number to call at 3 AM.&lt;/li&gt;
&lt;li&gt;Your team is small enough that the LLM-API bill of an investigation-heavy agent will exceed the per-seat price of a SaaS AI SRE.&lt;/li&gt;
&lt;li&gt;You need certifications (SOC2, ISO 27001) at the AI-vendor layer rather than at the cloud-provider layer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to pilot an open-source AI SRE in your team
&lt;/h2&gt;

&lt;p&gt;A six-step, low-risk pilot for any of the three tools:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick one cluster and one observability source.&lt;/strong&gt; Don't try to cover everything at once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install in read-only mode first.&lt;/strong&gt; All three tools default to read-only — keep it that way for the first two weeks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connect one alert source.&lt;/strong&gt; PagerDuty, Datadog, or Grafana — pick the one that's already firing real alerts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run for two weeks alongside human on-call.&lt;/strong&gt; Compare the agent's RCA conclusions to what your engineers determined. Track accuracy and time-to-RCA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feed it your historical context.&lt;/strong&gt; Aurora and HolmesGPT both support runbook + postmortem ingestion. Agents become dramatically more useful with organizational memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expand carefully.&lt;/strong&gt; Add more clusters, then enable remediation suggestions, then (only after trust) approved automated actions for specific low-risk patterns.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting started with Aurora
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is the multi-cloud, multi-tool option among open-source AI SREs. To run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init
make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aurora supports any LLM provider — OpenAI, Anthropic, Google, OpenRouter, or local models via Ollama for air-gapped deployments.&lt;/p&gt;

&lt;p&gt;For the technical side of running an agent that executes &lt;code&gt;kubectl&lt;/code&gt; against production, see the companion piece on &lt;a href="https://www.arvoai.ca/blog/ai-agent-kubectl-safety" rel="noopener noreferrer"&gt;AI agent kubectl safety and sandboxed execution&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/open-source-ai-sre-aurora-vs-holmesgpt-vs-k8sgpt" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>kubernetes</category>
      <category>sre</category>
    </item>
    <item>
      <title>AI SRE: The Complete Guide for Engineering Teams in 2026</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Fri, 24 Apr 2026 21:37:36 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/ai-sre-the-complete-guide-for-engineering-teams-in-2026-51ba</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/ai-sre-the-complete-guide-for-engineering-teams-in-2026-51ba</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaway:&lt;/strong&gt; An &lt;strong&gt;AI SRE (AI Site Reliability Engineer)&lt;/strong&gt; is an autonomous AI agent that triages alerts, investigates incidents, performs root cause analysis, and generates postmortems without step-by-step human direction. &lt;a href="https://www.gartner.com/en/documents/7242030" rel="noopener noreferrer"&gt;Gartner projects&lt;/a&gt; that by 2029, 70% of enterprises will deploy agentic AI agents to operate their IT infrastructure, up from less than 5% in 2025. This guide explains what an AI SRE actually does, how it differs from AIOps and traditional SRE, and how to evaluate the commercial and open-source tools available in 2026.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An &lt;strong&gt;AI SRE&lt;/strong&gt; is an autonomous software agent that performs site reliability engineering work — alert triage, incident investigation, root cause analysis, postmortem generation, and in some cases guided remediation — using large language models and production tooling to operate with minimal human direction. Unlike chatbots or copilots, an AI SRE decides what to investigate, which systems to query, and how to synthesize findings into actionable outcomes.&lt;/p&gt;

&lt;p&gt;The category crystallized in 2026. Microsoft made &lt;a href="https://techcommunity.microsoft.com/blog/appsonazureblog/announcing-general-availability-for-the-azure-sre-agent/4500682" rel="noopener noreferrer"&gt;Azure SRE Agent generally available on March 10, 2026&lt;/a&gt;. Komodor reports being named a Representative Vendor in Gartner's 2026 Market Guide for AI SRE Tooling. Open-source options like &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, K8sGPT, and HolmesGPT emerged as credible alternatives to commercial platforms.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an AI SRE?
&lt;/h2&gt;

&lt;p&gt;An AI SRE (AI Site Reliability Engineer) is an autonomous AI agent that performs SRE work — alert triage, incident investigation, root cause analysis, postmortem generation, and guided remediation — without requiring step-by-step human direction.&lt;/p&gt;

&lt;p&gt;Three characteristics distinguish an AI SRE from earlier generations of operations tooling:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Autonomy.&lt;/strong&gt; An AI SRE decides which tools to use and what data to gather. It is not a runbook that executes predefined steps; it is an agent that plans a multi-step investigation based on the specific alert.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access to production.&lt;/strong&gt; An AI SRE reads real infrastructure signals — metrics, logs, traces, Kubernetes events, cloud API responses, deployment history — rather than working only from summaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesis.&lt;/strong&gt; An AI SRE produces structured outputs: a root cause analysis, a timeline, a blast radius assessment, a postmortem, or a remediation PR. It does not stop at "the error rate is elevated."&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why AI SRE Emerged in 2026
&lt;/h2&gt;

&lt;p&gt;The conditions that made AI SRE viable came together between 2024 and 2026:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert volume outpaced human capacity.&lt;/strong&gt; PagerDuty's State of Digital Operations data shows the average on-call engineer receives roughly 50 alerts per week, with only 2–5% requiring real human intervention. A &lt;a href="https://oneuptime.com/blog/post/2026-03-05-alert-fatigue-ai-on-call/view" rel="noopener noreferrer"&gt;2024 Catchpoint study cited by OneUptime&lt;/a&gt; found that 70% of SRE teams list alert fatigue as a top-three operational concern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-cloud became the default.&lt;/strong&gt; According to the &lt;a href="https://resources.flexera.com/web/pdf/Flexera-State-of-the-Cloud-Report-2025.pdf" rel="noopener noreferrer"&gt;Flexera 2025 State of the Cloud Report&lt;/a&gt;, organizations use an average of 2.4 public cloud providers, and 70% operate a hybrid cloud strategy. Correlating incidents across AWS, Azure, and GCP by hand is increasingly impractical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Change velocity rose faster than reliability tooling.&lt;/strong&gt; The &lt;a href="https://cloud.google.com/devops/state-of-devops" rel="noopener noreferrer"&gt;2025 DORA State of AI-Assisted Software Development report&lt;/a&gt; found that incidents per PR increased 242.7% as AI coding assistants accelerated delivery — without a matching improvement in incident response capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM tool use matured.&lt;/strong&gt; Agent frameworks like LangGraph made it practical to give a language model 30+ tools and let it chain them into a coherent investigation. Claude, GPT-5, and Gemini 2.5+ reached enough reliability at structured tool use to be trusted with read-only production access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gartner codified the category.&lt;/strong&gt; In &lt;a href="https://www.gartner.com/en/documents/7242030" rel="noopener noreferrer"&gt;Predicts 2026: AI Agents Will Transform IT Infrastructure and Operations&lt;/a&gt;, Gartner projected that by 2029, 70% of enterprises will deploy agentic AI to operate IT infrastructure, up from less than 5% in 2025.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Does an AI SRE Work?
&lt;/h2&gt;

&lt;p&gt;An AI SRE runs a repeatable loop for every alert it receives:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Alert ingestion.&lt;/strong&gt; A monitoring tool (PagerDuty, Datadog, Grafana, BigPanda) fires a webhook. The AI SRE receives the payload and begins investigation without waiting for a human to acknowledge the page.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context gathering.&lt;/strong&gt; The agent reads the recent state: pod status, metric trends, deployment history, recent configuration changes, related alerts within a time window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hypothesis formation.&lt;/strong&gt; Using the alert semantics plus the gathered context, the agent proposes one or more candidate causes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidence collection.&lt;/strong&gt; The agent selects from its tool inventory — running &lt;code&gt;kubectl describe&lt;/code&gt;, querying metrics, searching a vector knowledge base of past postmortems — to test each hypothesis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root cause synthesis.&lt;/strong&gt; The agent produces a structured RCA: what failed, why, what the blast radius is, which services are affected, whether a recent change likely caused it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation (optional).&lt;/strong&gt; Some AI SREs stop at recommendations. Others generate a PR, roll back a deployment, or restart a service — typically behind a human approval gate for destructive actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem generation.&lt;/strong&gt; The agent assembles a draft postmortem with timeline, contributing factors, impact, and action items, ready for human review and export to Confluence or another docs system.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A trustworthy AI SRE is transparent about this loop — surfacing the evidence it considered, the hypotheses it ruled out, and its confidence in the final answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI SRE vs Traditional SRE vs AIOps
&lt;/h2&gt;

&lt;p&gt;The three categories are often conflated but address different problems.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Traditional SRE&lt;/th&gt;
&lt;th&gt;AIOps&lt;/th&gt;
&lt;th&gt;AI SRE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary function&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human engineers manage reliability&lt;/td&gt;
&lt;td&gt;Anomaly detection, alert correlation&lt;/td&gt;
&lt;td&gt;Autonomous incident investigation and RCA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Investigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual (human reads logs, queries systems)&lt;/td&gt;
&lt;td&gt;Suggests related alerts&lt;/td&gt;
&lt;td&gt;Agent runs multi-step investigation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Root cause analysis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hours, depends on engineer's expertise&lt;/td&gt;
&lt;td&gt;Correlation hints, not causation&lt;/td&gt;
&lt;td&gt;Structured RCA in minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Engineer runs kubectl, aws CLI, dashboards&lt;/td&gt;
&lt;td&gt;Reads pre-ingested telemetry&lt;/td&gt;
&lt;td&gt;Dynamically selects from 20–40+ tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Remediation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human-driven&lt;/td&gt;
&lt;td&gt;Typically suggestions only&lt;/td&gt;
&lt;td&gt;Agentic execution, often with approval gates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge transfer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runbooks, tribal knowledge&lt;/td&gt;
&lt;td&gt;Alert correlation models&lt;/td&gt;
&lt;td&gt;RAG over runbooks and past postmortems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core technology&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Humans plus monitoring dashboards&lt;/td&gt;
&lt;td&gt;ML models for anomaly detection&lt;/td&gt;
&lt;td&gt;LLM agents with tool calling&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The short version: &lt;strong&gt;AIOps tells you what is anomalous. An AI SRE tells you why it is happening and, increasingly, fixes it.&lt;/strong&gt; Traditional SRE is the human discipline both categories augment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Capabilities Should an AI SRE Have?
&lt;/h2&gt;

&lt;p&gt;Serious AI SREs in 2026 share a consistent capability stack:&lt;/p&gt;

&lt;h3&gt;
  
  
  Autonomous multi-step investigation
&lt;/h3&gt;

&lt;p&gt;The agent must plan and execute investigations without requiring humans to choose tools or pass data between steps. Simple tool-calling is not enough — the agent needs memory across steps and the ability to revise hypotheses as evidence arrives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Broad tool access with safe execution
&lt;/h3&gt;

&lt;p&gt;kubectl, aws, az, gcloud, metric queries, log search, deployment history, IaC state. &lt;strong&gt;How tools are executed matters&lt;/strong&gt;: running kubectl on the agent host is a production risk. &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;, for example, runs CLI commands in sandboxed Kubernetes pods with per-invocation credential scoping, not on the agent host.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cross-cloud and cross-platform reach
&lt;/h3&gt;

&lt;p&gt;With the Flexera 2025 average at 2.4 public clouds per organization, an AI SRE that works only inside AWS or only inside Kubernetes will miss the majority of real incidents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Knowledge base retrieval
&lt;/h3&gt;

&lt;p&gt;Past postmortems, runbooks, and docs searchable by the agent via vector search (RAG). The knowledge your senior SRE built up should be available to the agent on day one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Infrastructure dependency graph
&lt;/h3&gt;

&lt;p&gt;When a database fails, the agent needs to know which services depend on it. Graph databases like Memgraph are a common choice for modeling cross-service and cross-cloud relationships.&lt;/p&gt;

&lt;h3&gt;
  
  
  Postmortem generation
&lt;/h3&gt;

&lt;p&gt;Structured timeline, contributing factors, blast radius, action items — produced during the investigation, not written manually afterward.&lt;/p&gt;

&lt;h3&gt;
  
  
  Remediation with guardrails
&lt;/h3&gt;

&lt;p&gt;Generating PRs, rolling back deployments, restarting services. Destructive actions should require human approval. Aurora's Bitbucket connector, added in &lt;a href="https://github.com/Arvo-AI/aurora/releases/tag/v1.1.0" rel="noopener noreferrer"&gt;v1.1.0&lt;/a&gt;, requires explicit human approval before agents can write.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM flexibility
&lt;/h3&gt;

&lt;p&gt;OpenAI, Anthropic, Google, and local models via Ollama for air-gapped deployments. Vendor lock-in on LLM is a real risk as model quality and pricing evolve rapidly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI SRE Landscape in 2026
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Commercial platforms
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://azure.microsoft.com/en-us/products/sre-agent/" rel="noopener noreferrer"&gt;Azure SRE Agent&lt;/a&gt;&lt;/strong&gt; — Microsoft's first-party agent, &lt;a href="https://techcommunity.microsoft.com/blog/appsonazureblog/announcing-general-availability-for-the-azure-sre-agent/4500682" rel="noopener noreferrer"&gt;generally available since March 10, 2026&lt;/a&gt;. Deep Azure integration, adjustable autonomy from "review recommendations" to "fully automated," billed via Azure Agent Units on pay-as-you-go.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://rootly.com/ai-sre" rel="noopener noreferrer"&gt;Rootly AI SRE&lt;/a&gt;&lt;/strong&gt; — AI layer built on top of a mature incident management platform. Transparent chain-of-thought reasoning. SOC2 since January 2022. Depends on external observability tools for telemetry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://komodor.com/ai-sre-2026/" rel="noopener noreferrer"&gt;Komodor Klaudia&lt;/a&gt;&lt;/strong&gt; — Kubernetes-specialized AI SRE. Komodor reports Klaudia achieves 95% accuracy across real-world incident scenarios and that Komodor was named a Representative Vendor in Gartner's 2026 Market Guide for AI SRE Tooling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://incident.io/ai-sre" rel="noopener noreferrer"&gt;incident.io AI SRE&lt;/a&gt;&lt;/strong&gt; — Multi-agent AI investigation integrated into an incident response platform, with code fix suggestions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.traversal.com/" rel="noopener noreferrer"&gt;Traversal&lt;/a&gt;&lt;/strong&gt; — Focused on large distributed systems using causal ML. Traversal reports a 38% MTTR reduction at DigitalOcean. Supports on-prem and bring-your-own model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolve.ai&lt;/strong&gt; — Pushes toward high-autonomy resolution with guardrails.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Open-source AI SRE options
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt;&lt;/strong&gt; — Apache 2.0, self-hosted, multi-cloud (AWS via STS AssumeRole, Azure via Service Principal, GCP, OVH, Scaleway, Kubernetes). LangGraph-orchestrated agents with 30+ tools, Memgraph dependency graph, Weaviate RAG, postmortem export to Confluence, PR generation via GitHub and Bitbucket. Works with any LLM (OpenAI, Anthropic, Google, OpenRouter, Ollama).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/k8sgpt-ai/k8sgpt" rel="noopener noreferrer"&gt;K8sGPT&lt;/a&gt;&lt;/strong&gt; — Open-source CLI for scanning Kubernetes clusters and explaining failures in plain English. Narrower scope than a full AI SRE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/robusta-dev/holmesgpt" rel="noopener noreferrer"&gt;HolmesGPT&lt;/a&gt;&lt;/strong&gt; — Open-source cross-stack SRE agent covering Kubernetes, Prometheus, logs, and Slack workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coroot (Community Edition)&lt;/strong&gt; — Kubernetes observability plus AI-assisted RCA. Community Edition is free; commercial tier is priced transparently from $1 per monitored CPU core per month.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Open-Source vs Commercial AI SRE
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Consideration&lt;/th&gt;
&lt;th&gt;Open-Source&lt;/th&gt;
&lt;th&gt;Commercial&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data residency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully self-hosted; incident data stays in your environment&lt;/td&gt;
&lt;td&gt;Usually SaaS; incident data leaves your perimeter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free software; you pay for infra and LLM API usage&lt;/td&gt;
&lt;td&gt;Per-seat or per-incident pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM choice&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bring any provider, including local via Ollama&lt;/td&gt;
&lt;td&gt;Often bundled or restricted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audit transparency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Source code available; you can audit how the agent behaves&lt;/td&gt;
&lt;td&gt;Typically black-box&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Support and managed ops&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Community plus self-managed&lt;/td&gt;
&lt;td&gt;Vendor support, SLAs, managed infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time to deploy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Longer — self-hosting has setup cost&lt;/td&gt;
&lt;td&gt;Shorter — SaaS onboarding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Customization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fork, modify, add tools&lt;/td&gt;
&lt;td&gt;Limited to what the vendor exposes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For regulated industries (finance, healthcare, government), air-gapped deployments, or teams already operating their own Kubernetes, open-source AI SRE is often the right fit. For teams prioritizing fastest time to value, commercial platforms win.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Evaluate an AI SRE Tool
&lt;/h2&gt;

&lt;p&gt;If you are piloting an AI SRE in 2026, these are the questions to answer before committing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;How does the agent actually execute commands?&lt;/strong&gt; Host process, container, sandboxed pod? Read-only or write? What credentials does it use?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Which alerts can it investigate today?&lt;/strong&gt; Ask for specific integrations by name (PagerDuty, Datadog, CloudWatch) and test with your own alert payloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What happens when it is wrong?&lt;/strong&gt; How does the agent surface low-confidence answers? Can you see the evidence it gathered?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can it handle multi-cloud?&lt;/strong&gt; If you run on more than one cloud, does it correlate across providers or investigate each in isolation?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does it learn from past incidents?&lt;/strong&gt; Does it ingest your existing runbooks and postmortems? How?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What is the remediation model?&lt;/strong&gt; Suggestions only? PRs with human approval? Direct execution? Where are the guardrails?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Which LLM does it use — and can you change it?&lt;/strong&gt; LLM cost and quality move quickly. Lock-in is a risk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where does your incident data go?&lt;/strong&gt; Self-hosted, vendor cloud, LLM provider? Read the data flow carefully.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Limitations of AI SREs in 2026
&lt;/h2&gt;

&lt;p&gt;The category is real but not a silver bullet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Novel failure modes.&lt;/strong&gt; Agents excel at recognizing patterns similar to past incidents. Genuinely new failures still often require human judgment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Organizational root causes.&lt;/strong&gt; "The deploy pipeline does not validate environment variables" is the kind of root cause an AI SRE can surface. "We do not have enough staff to maintain this service" is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM cost at scale.&lt;/strong&gt; Complex investigations can consume hundreds of LLM calls. Local inference via Ollama mitigates this but requires GPU infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool coverage gaps.&lt;/strong&gt; An AI SRE can only investigate systems it has tools for. Legacy systems, internal tooling, and unusual stacks require custom connectors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust-building takes time.&lt;/strong&gt; Teams typically start with the agent in "observe" mode, graduate to "suggest," and only later enable autonomous remediation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://cloud.google.com/devops/state-of-devops" rel="noopener noreferrer"&gt;DORA 2025 report&lt;/a&gt; is instructive: AI improves throughput but can increase instability in teams without strong platform engineering foundations. AI SRE tools amplify existing practices more than they fix broken ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Pilot an AI SRE in Your Team
&lt;/h2&gt;

&lt;p&gt;A low-risk pilot follows six steps. Expect it to take four to six weeks end-to-end.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick one service and one alert source.&lt;/strong&gt; Do not try to cover everything at once. Choose a service your team knows well and a monitoring tool you already use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy the AI SRE in read-only mode.&lt;/strong&gt; Connect it to alerts, read-only cloud credentials, and your existing observability tools. Do not grant write permissions yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run for two weeks, compare to human RCA.&lt;/strong&gt; Let the agent investigate every incident that fires. Compare its root cause conclusions to what the on-call engineer eventually determined.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure accuracy and time-to-RCA.&lt;/strong&gt; Two metrics matter: was the agent's root cause correct, and how much faster was it than the human?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expand scope gradually.&lt;/strong&gt; Add more services, enable remediation suggestions, then (only after trust is established) approved automated actions for specific low-risk patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feed historical context.&lt;/strong&gt; Ingest your existing runbooks and past postmortems into the agent's knowledge base. Agents become dramatically more useful with organizational memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting Started with Aurora
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is an open-source (Apache 2.0) AI SRE built by Arvo AI. It autonomously investigates incidents across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes, integrating with 22+ tools including PagerDuty, Datadog, Grafana, Slack, Bitbucket, and Confluence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init
make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aurora works with any LLM provider — OpenAI, Anthropic, Google Gemini, OpenRouter, or local models via Ollama for air-gapped deployments. See the &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;full documentation&lt;/a&gt; or the &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;original post on arvoai.ca&lt;/a&gt; for more context.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://www.arvoai.ca/blog/ai-sre-complete-guide" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>ai</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Opsgenie 2026: Features, Pricing, EOL &amp; Alternatives</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Tue, 21 Apr 2026 17:36:17 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/opsgenie-2026-features-pricing-eol-alternatives-1bm0</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/opsgenie-2026-features-pricing-eol-alternatives-1bm0</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR — Opsgenie is ending.&lt;/strong&gt; Atlassian stopped new Opsgenie signups on &lt;strong&gt;June 4, 2025&lt;/strong&gt; and will shut the service down permanently on &lt;strong&gt;April 5, 2027&lt;/strong&gt;. Any data not migrated by that date will be deleted. Atlassian's official migration paths are Jira Service Management (JSM) Operations and Compass. Many teams are using the forced migration as a chance to evaluate alternatives — especially AI-powered options that weren't available when Opsgenie was originally adopted.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Opsgenie is an alerting and on-call management platform that was acquired by Atlassian in 2018. For years it was one of the most widely adopted tools in the SRE stack, sitting alongside PagerDuty and xMatters. In March 2025 Atlassian announced that Opsgenie's capabilities would be absorbed into Jira Service Management and Compass, and that the standalone product would be retired.&lt;/p&gt;

&lt;p&gt;This guide covers what Opsgenie is, how it works, what it costs, the exact end-of-life timeline, what happens to your data when it shuts down, the official migration paths, and the current landscape of alternatives. Every claim is linked to an official source.&lt;/p&gt;

&lt;p&gt;Last updated: April 21, 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Opsgenie?
&lt;/h2&gt;

&lt;p&gt;Opsgenie is a cloud-based incident alerting and on-call management platform for DevOps and SRE teams. It routes alerts from 200+ monitoring tools to the right on-call responders via SMS, voice, email, push, Slack, and Microsoft Teams. &lt;a href="https://www.atlassian.com/software/opsgenie" rel="noopener noreferrer"&gt;Atlassian acquired Opsgenie in 2018&lt;/a&gt; and will retire the standalone product on April 5, 2027.&lt;/p&gt;

&lt;p&gt;The tool was founded in 2012 and its capabilities are being absorbed into &lt;a href="https://www.atlassian.com/software/jira/service-management" rel="noopener noreferrer"&gt;Jira Service Management&lt;/a&gt; and &lt;a href="https://www.atlassian.com/software/compass" rel="noopener noreferrer"&gt;Compass&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Opsgenie at a glance vs top alternatives
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Opsgenie (retiring)&lt;/th&gt;
&lt;th&gt;JSM Operations&lt;/th&gt;
&lt;th&gt;PagerDuty&lt;/th&gt;
&lt;th&gt;Aurora&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Available after April 2027&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Starting price&lt;/td&gt;
&lt;td&gt;N/A (closed)&lt;/td&gt;
&lt;td&gt;Per-agent&lt;/td&gt;
&lt;td&gt;$21/user/mo&lt;/td&gt;
&lt;td&gt;Free (OSS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in AI RCA&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Add-on ($699+/mo)&lt;/td&gt;
&lt;td&gt;Yes (agentic)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-call + escalations&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Via integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Opsgenie End-of-Life Timeline (Official)
&lt;/h2&gt;

&lt;p&gt;Atlassian announced the end of Opsgenie in &lt;a href="https://www.atlassian.com/blog/announcements/evolution-of-it-operations" rel="noopener noreferrer"&gt;The Evolution of IT Operations&lt;/a&gt;. The three critical dates are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Milestone&lt;/th&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;End of Sale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;June 4, 2025&lt;/td&gt;
&lt;td&gt;No new signups, upgrades, or downgrades on standalone Opsgenie&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;End of Support / Shutdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;April 5, 2027&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Opsgenie service is turned off; REST APIs stop responding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Deletion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;April 5, 2027&lt;/td&gt;
&lt;td&gt;All unmigrated alerts, schedules, escalation policies, integrations, and incidents are permanently deleted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Existing customers can continue using Opsgenie through April 5, 2027, but cannot expand their footprint. After migration, Opsgenie and the new JSM or Compass instance can run in parallel for up to 120 days, after which Opsgenie is automatically switched off (&lt;a href="https://support.atlassian.com/opsgenie/docs/what-happens-when-opsgenie-is-turned-off/" rel="noopener noreferrer"&gt;official source&lt;/a&gt;).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Opsgenie REST APIs will continue to work until April 5, 2027. However, Atlassian recommends updating all API endpoints before Opsgenie is turned off to avoid any disruptions." — &lt;a href="https://support.atlassian.com/opsgenie/docs/what-happens-when-opsgenie-is-turned-off/" rel="noopener noreferrer"&gt;Atlassian Support&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Opsgenie Features
&lt;/h2&gt;

&lt;p&gt;Opsgenie's core feature set is mature — this is a 13-year-old product. Here is what it currently provides, verified from Atlassian's documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integrations
&lt;/h3&gt;

&lt;p&gt;Opsgenie ships with &lt;a href="https://www.atlassian.com/software/opsgenie/integrations" rel="noopener noreferrer"&gt;over 200 integrations&lt;/a&gt; with monitoring, ticketing, chat, and ITSM tools. Most are bidirectional — alerts flow in, and acknowledgement or closure events flow back.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Channel Notifications
&lt;/h3&gt;

&lt;p&gt;Supported notification channels, per &lt;a href="https://support.atlassian.com/opsgenie/docs/send-voice-and-sms-notifications/" rel="noopener noreferrer"&gt;Atlassian documentation&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SMS&lt;/strong&gt; — Aggregated at a minimum 1-minute interval; users can acknowledge or close alerts via reply&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice calls&lt;/strong&gt; — Capped at 2 minutes; dial-pad actions (1 = read, 2 = close, 3 = acknowledge, 4 = escalate)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Email&lt;/strong&gt; — With inline action buttons&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Push notifications&lt;/strong&gt; — iOS and Android with swipe-to-ack/close&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack&lt;/strong&gt; — &lt;a href="https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-slack-app/" rel="noopener noreferrer"&gt;Bidirectional integration&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Microsoft Teams&lt;/strong&gt; — &lt;a href="https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-microsoft-teams/" rel="noopener noreferrer"&gt;Bidirectional integration&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  On-Call Management
&lt;/h3&gt;

&lt;p&gt;Opsgenie supports daily, weekly, and custom rotation types including follow-the-sun, with ad-hoc overrides, "Take on-call for an hour" self-service, and a "No-One" participant for scheduled gaps (&lt;a href="https://support.atlassian.com/opsgenie/docs/manage-on-call-schedules-and-rotations/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Escalation Policies
&lt;/h3&gt;

&lt;p&gt;Default escalation is 5 minutes, then 10 minutes, repeatable up to 20 times per alert. Acknowledgement or closure stops the policy (&lt;a href="https://support.atlassian.com/opsgenie/docs/how-do-escalations-work-in-opsgenie/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Heartbeat Monitoring
&lt;/h3&gt;

&lt;p&gt;A "dead man's switch" — if an expected HTTP ping doesn't arrive within the configured interval (minimum 1 minute), Opsgenie fires an alert. Available on &lt;strong&gt;Standard and Enterprise plans only&lt;/strong&gt; (&lt;a href="https://support.atlassian.com/opsgenie/docs/check-system-health-with-opsgenie-heartbeats/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Alert Deduplication, Suppression, and Grouping
&lt;/h3&gt;

&lt;p&gt;Opsgenie uses an &lt;code&gt;alias&lt;/code&gt; field to deduplicate alerts — identical alias values increment a counter on the existing alert instead of creating a new one. The counter stops logging at 100 occurrences, but deduplication continues (&lt;a href="https://support.atlassian.com/opsgenie/docs/what-is-alert-de-duplication/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Delay policies can hold notifications for a fixed time, until a deduplication threshold is reached, or until an occurrence rate threshold triggers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Routing Rules
&lt;/h3&gt;

&lt;p&gt;Each team can have &lt;strong&gt;up to 100 routing rules&lt;/strong&gt;, evaluated top-down with first-match semantics. Free and Essentials plans are limited to &lt;strong&gt;1 routing rule&lt;/strong&gt; and can only route by priority or tags. Standard and Enterprise plans support full-field routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reporting by Plan
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Report&lt;/th&gt;
&lt;th&gt;Essentials&lt;/th&gt;
&lt;th&gt;Standard&lt;/th&gt;
&lt;th&gt;Enterprise&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Notifications + API Usage&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly Overview (Looker)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Advanced reporting / MTTA / MTTR&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Team Reports&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Global Reports + Looker dashboards&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post-Incident Analysis&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Source: &lt;a href="https://www.atlassian.com/software/opsgenie/advanced-reporting-and-analytics" rel="noopener noreferrer"&gt;Opsgenie Advanced Reporting&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mobile App
&lt;/h3&gt;

&lt;p&gt;Opsgenie's &lt;a href="https://www.atlassian.com/software/opsgenie/mobile-app" rel="noopener noreferrer"&gt;iOS and Android apps&lt;/a&gt; support swipe-to-acknowledge from the lock screen and iOS Critical Alerts that override Do Not Disturb and silent mode.&lt;/p&gt;

&lt;h3&gt;
  
  
  SSO / SAML
&lt;/h3&gt;

&lt;p&gt;SSO is available on &lt;strong&gt;Standard and Enterprise plans only&lt;/strong&gt;, with supported providers including Google, Azure AD, Okta, OneLogin, Ping Identity, and Microsoft AD FS (&lt;a href="https://support.atlassian.com/opsgenie/docs/configure-sso-for-opsgenie/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Compliance
&lt;/h3&gt;

&lt;p&gt;Opsgenie is covered under Atlassian's Trust program with &lt;strong&gt;SOC 2 Type II (annual), ISO/IEC 27001, ISO/IEC 27018, CSA, and TISAX AL2&lt;/strong&gt; certifications, plus a pre-signed GDPR DPA (&lt;a href="https://www.atlassian.com/software/opsgenie/security" rel="noopener noreferrer"&gt;official page&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Residency
&lt;/h3&gt;

&lt;p&gt;Opsgenie is offered in &lt;strong&gt;US and EU&lt;/strong&gt; regions, both hosted on AWS (&lt;a href="https://support.atlassian.com/opsgenie/docs/opsgenies-data-residency/" rel="noopener noreferrer"&gt;official docs&lt;/a&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Who Should Use Opsgenie in 2026?
&lt;/h2&gt;

&lt;p&gt;With end-of-sale already behind us, Opsgenie is only relevant to &lt;strong&gt;existing subscribers&lt;/strong&gt; planning their exit. New teams cannot sign up. The question for existing subscribers is whether to stay with Atlassian (migrate to JSM or Compass) or evaluate alternatives.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stay with Atlassian (migrate to JSM Operations)&lt;/strong&gt; if you are already a Jira Service Management customer, need ITSM workflows (change, problem, incident), and are comfortable with the Premium-tier price increase.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stay with Atlassian (migrate to Compass)&lt;/strong&gt; if you are a DevOps or SRE team that wants alerting paired with a software component catalog and service ownership model, not ITSM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Switch to a dedicated alerting tool&lt;/strong&gt; (PagerDuty, ilert, Squadcast) if you want deeper alerting features and do not need Atlassian platform integration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Switch to AI-powered incident management&lt;/strong&gt; (incident.io, Rootly, Aurora) if you want autonomous investigation and root cause analysis, not just alert routing.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Opsgenie Pricing (Standalone, 100-User Reference)
&lt;/h2&gt;

&lt;p&gt;Pricing below is for standalone Opsgenie with 100 users — sourced from the &lt;a href="https://www.atlassian.com/software/opsgenie/pricing" rel="noopener noreferrer"&gt;official Opsgenie pricing page&lt;/a&gt;. New signups are closed, so these numbers apply only to existing customers on legacy plans.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plan&lt;/th&gt;
&lt;th&gt;Monthly&lt;/th&gt;
&lt;th&gt;Annual&lt;/th&gt;
&lt;th&gt;Routing Rules&lt;/th&gt;
&lt;th&gt;Heartbeats&lt;/th&gt;
&lt;th&gt;SSO&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0 (up to 5 users)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Essentials&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$11.55/user/mo&lt;/td&gt;
&lt;td&gt;$9.45/user/mo&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~ $29/user/mo&lt;/td&gt;
&lt;td&gt;Discounted&lt;/td&gt;
&lt;td&gt;100 per team&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enterprise&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~ $39/user/mo&lt;/td&gt;
&lt;td&gt;Discounted&lt;/td&gt;
&lt;td&gt;100 per team&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Enterprise-exclusive features include Incident Command Center (built-in video chatroom tied to incidents), Stakeholders (notification-only users), Service Subscriptions, Incident Templates, and Post-Incident Analysis.&lt;/p&gt;

&lt;p&gt;Incoming call routing is charged separately: &lt;strong&gt;$0.10 per minute&lt;/strong&gt; for US/Canada and &lt;strong&gt;$0.35 per minute&lt;/strong&gt; internationally after the free tier.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Happens When Opsgenie Is Turned Off
&lt;/h2&gt;

&lt;p&gt;On April 5, 2027, Atlassian will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Disable the Opsgenie web application, mobile apps, and REST APIs&lt;/li&gt;
&lt;li&gt;Delete all data that was &lt;strong&gt;not migrated&lt;/strong&gt; to JSM or Compass — alerts, on-call schedules, escalation policies, integrations, incidents, notes, attachments&lt;/li&gt;
&lt;li&gt;Stop accepting any incoming webhooks or notifications&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; unlike the legacy Opsgenie Enterprise plan, JSM automatically deletes alert data after a retention window. Once alert data is deleted in JSM, it cannot be recovered. Export anything you need for compliance or audit before migration (&lt;a href="https://support.atlassian.com/opsgenie/docs/what-happens-when-opsgenie-is-turned-off/" rel="noopener noreferrer"&gt;official source&lt;/a&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Opsgenie Migration Paths: JSM vs Compass
&lt;/h2&gt;

&lt;p&gt;Atlassian offers two official migration destinations. Both share the same underlying Operations engine — schedules, alerts, and policies sync bidirectionally — but the wrapping product and pricing differ (&lt;a href="https://support.atlassian.com/opsgenie/docs/managing-operations-in-compass-and-jira-service-management-at-the-same-time/" rel="noopener noreferrer"&gt;managing operations across both&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Jira Service Management (JSM) Operations
&lt;/h3&gt;

&lt;p&gt;JSM Operations is the ITSM-centric path — alerts are paired with change, problem, and incident workflows. JSM pricing (&lt;a href="https://www.atlassian.com/software/jira/service-management/pricing" rel="noopener noreferrer"&gt;official page&lt;/a&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;JSM Plan&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Outbound Webhooks&lt;/th&gt;
&lt;th&gt;Incident Command Center&lt;/th&gt;
&lt;th&gt;Post-Incident Reviews&lt;/th&gt;
&lt;th&gt;99.9% SLA&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0 (up to 3 agents)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-agent&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Premium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-agent&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enterprise&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Contact sales&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;99.95%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Opsgenie features that &lt;strong&gt;do not carry over&lt;/strong&gt; to JSM Operations, per &lt;a href="https://support.atlassian.com/jira-service-management-cloud/docs/start-shifting-from-opsgenie-to-jira-service-management/" rel="noopener noreferrer"&gt;Atlassian's shifting guide&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incoming Call Routing integration is not supported&lt;/li&gt;
&lt;li&gt;Stakeholder role — custom Opsgenie roles default to User&lt;/li&gt;
&lt;li&gt;Alert creation rules from Opsgenie do not migrate&lt;/li&gt;
&lt;li&gt;Legacy &lt;code&gt;api.opsgenie.com/v1/services&lt;/code&gt; endpoint stops working&lt;/li&gt;
&lt;li&gt;Chat integrations must be reconnected manually&lt;/li&gt;
&lt;li&gt;The old Opsgenie mobile app stops working — responders switch to the Jira mobile app&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Compass
&lt;/h3&gt;

&lt;p&gt;Compass is positioned as a software component catalog + alerting platform aimed at DevOps, SRE, and Platform Engineering teams rather than ITSM. Compass pricing (&lt;a href="https://www.atlassian.com/software/compass/pricing" rel="noopener noreferrer"&gt;official page&lt;/a&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Compass Plan&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Alerting&lt;/th&gt;
&lt;th&gt;Heartbeats&lt;/th&gt;
&lt;th&gt;99.9% SLA&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0 (up to 3 full users)&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$8/user/mo&lt;/td&gt;
&lt;td&gt;Yes (150+ integrations)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Premium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$25/user/mo&lt;/td&gt;
&lt;td&gt;Advanced&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Migration Friction
&lt;/h3&gt;

&lt;p&gt;Real complaints from the &lt;a href="https://community.atlassian.com/forums/Jira-Service-Management/Replacement-for-Opsgenie/qaq-p/2967670" rel="noopener noreferrer"&gt;Atlassian Community&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Price increases&lt;/strong&gt; — JSM Premium is widely reported as more expensive than standalone Opsgenie Standard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature parity gaps&lt;/strong&gt; — some users need JSM &lt;em&gt;and&lt;/em&gt; Compass together to match Opsgenie's alert processing depth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;120-day forced cutover&lt;/strong&gt; — Opsgenie auto-shuts-down 120 days after migration begins; Atlassian has &lt;a href="https://community.atlassian.com/forums/Jira-Service-Management/Extend-120-day-window-to-shutdown-of-Opsgenie-after-migration/qaq-p/3084093" rel="noopener noreferrer"&gt;declined requests&lt;/a&gt; to extend the window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Split paths confusion&lt;/strong&gt; — some features only exist in JSM, others only in Compass, forcing customers to choose or buy both&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One user put it bluntly: "Switching to Compass seems like buying a new car just to listen to the radio."&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Teams Are Evaluating Alternatives Instead of Migrating
&lt;/h2&gt;

&lt;p&gt;The forced migration has created a rare evaluation moment. Teams that adopted Opsgenie in 2018 are re-evaluating the entire category with three shifts in mind:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AI-native incident management has arrived.&lt;/strong&gt; Products like Aurora, incident.io AI SRE, Rootly AI, and PagerDuty Advance didn't exist when most Opsgenie contracts were signed. Per &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-10-29-gartner-survey-54-percent-of-infrastructure-and-operations-leaders-are-adopting-artificial-intelligence-to-cut-costs" rel="noopener noreferrer"&gt;Gartner (October 2025)&lt;/a&gt;, 54% of I&amp;amp;O leaders are now adopting AI in operations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On-call burnout is a hiring and retention problem.&lt;/strong&gt; The &lt;a href="https://www.catchpoint.com/learn/sre-report-2025" rel="noopener noreferrer"&gt;Catchpoint SRE Report 2025&lt;/a&gt; found that roughly 70% of SREs cite on-call stress as a direct cause of burnout, and toil rose to 30% of SRE work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Downtime costs have climbed.&lt;/strong&gt; &lt;a href="https://www.pagerduty.com/newsroom/study-cost-of-incidents/" rel="noopener noreferrer"&gt;PagerDuty's 2024 research&lt;/a&gt; put the average cost of a major incident at $794,000, or $4,537 per minute. &lt;a href="https://itic-corp.com/itic-2024-hourly-cost-of-downtime-part-2/" rel="noopener noreferrer"&gt;ITIC's 2024 survey&lt;/a&gt; found 97% of large enterprises say an hour of downtime costs them over $100,000.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Against this backdrop, "like-for-like Opsgenie replacement" is no longer the only question — many teams are asking whether the replacement should also do autonomous investigation, not just alerting.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"By 2030, 75% of IT work will be human plus AI, 25% will be AI-only, and zero percent will be human-only." — &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-11-10-gartner-survey-finds-artificial-intelligence-will-touch-all-information-technology-work-by-2030" rel="noopener noreferrer"&gt;Gartner CIO survey of 700+ CIOs, 2025&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Top Opsgenie Alternatives in 2026
&lt;/h2&gt;

&lt;p&gt;Verified pricing and capabilities from each vendor's official site. Last checked April 2026.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;Starting price&lt;/th&gt;
&lt;th&gt;Free plan&lt;/th&gt;
&lt;th&gt;Open source&lt;/th&gt;
&lt;th&gt;AI-native&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aurora by Arvo AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0 self-hosted&lt;/td&gt;
&lt;td&gt;Yes (OSS)&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Yes (agentic)&lt;/td&gt;
&lt;td&gt;OSS teams wanting alerting + autonomous RCA in one stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PagerDuty&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.pagerduty.com/pricing/incident-management/" rel="noopener noreferrer"&gt;$21/user/mo&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;14-day trial&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (PagerDuty Advance, $415+/mo)&lt;/td&gt;
&lt;td&gt;Enterprises wanting the incumbent with AI add-ons&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ilert&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Up to €49/user/mo Scale&lt;/td&gt;
&lt;td&gt;Yes (5 responders)&lt;/td&gt;
&lt;td&gt;Partial (MCP server)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;EU-based teams requiring GDPR data residency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Squadcast&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.squadcast.com/pricing" rel="noopener noreferrer"&gt;$9/user/mo Pro&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Yes (5 users)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Small SRE teams on tight budgets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rootly OnCall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;From $20/user/mo&lt;/td&gt;
&lt;td&gt;Trial&lt;/td&gt;
&lt;td&gt;Partial (MCP, Agents JSON)&lt;/td&gt;
&lt;td&gt;Yes (AI SRE standalone)&lt;/td&gt;
&lt;td&gt;Teams wanting modular IR + on-call + AI SRE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;incident.io On-call&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$19 base + $10 add-on&lt;/td&gt;
&lt;td&gt;Trial&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (AI SRE)&lt;/td&gt;
&lt;td&gt;Slack-native incident coordination with AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FireHydrant Signals&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Usage-based&lt;/td&gt;
&lt;td&gt;Trial&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (AI Copilot)&lt;/td&gt;
&lt;td&gt;Teams preferring pay-per-alert over per-seat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;xMatters&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.xmatters.com/pricing" rel="noopener noreferrer"&gt;$39/user/mo base&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Yes (10 users)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Everbridge customers needing codeless workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grafana OnCall OSS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;AGPLv3 (archived)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Not recommended&lt;/strong&gt; — archived March 24, 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Product Notes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;PagerDuty&lt;/strong&gt; — Most mature alerting product. PagerDuty Advance adds AI agents (SRE, Scribe, Shift) but requires a paid base plan and a separate $415+/mo Advance subscription. AIOps features require a $699+/mo add-on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ilert&lt;/strong&gt; — EU-hosted with a clear GDPR and data-sovereignty story; the &lt;a href="https://www.ilert.com/product/ilert-ai" rel="noopener noreferrer"&gt;AI SRE&lt;/a&gt; opts out of LLM training on customer data. Free tier includes 5 responders.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Squadcast&lt;/strong&gt; — &lt;a href="https://www.solarwinds.com/company/newsroom/press-releases/solarwinds-acquires-squadcast-unifying-observability-and-incident-response" rel="noopener noreferrer"&gt;Acquired by SolarWinds on March 3, 2025&lt;/a&gt;. Roadmap now driven by SolarWinds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rootly&lt;/strong&gt; — Rootly AI Labs launched February 20, 2026; Rootly MCP GA April 2, 2026. Rootly sells IR, On-Call, and AI SRE as standalone products.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;incident.io&lt;/strong&gt; — &lt;a href="https://incident.io/blog/introducing-ai-sre" rel="noopener noreferrer"&gt;$62M Series B&lt;/a&gt; funded the launch of AI SRE — an always-on agent that investigates alerts, drafts PRs, and can autoresolve incidents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FireHydrant&lt;/strong&gt; — &lt;a href="https://firehydrant.com/blog/firehydrant-to-be-acquired-by-freshworks/" rel="noopener noreferrer"&gt;Acquisition by Freshworks expected to close Q1 2026&lt;/a&gt;; FireHydrant will become the incident layer inside Freshservice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grafana OnCall&lt;/strong&gt; — &lt;a href="https://grafana.com/blog/grafana-oncall-maintenance-mode/" rel="noopener noreferrer"&gt;Entered maintenance mode March 11, 2025 and archived March 24, 2026&lt;/a&gt;. Do not start new deployments. Grafana is consolidating on a unified Cloud IRM app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Splunk On-Call (VictorOps)&lt;/strong&gt; — Pricing not publicly listed. &lt;a href="https://newsroom.cisco.com/c/r/newsroom/en/us/a/y2024/m03/cisco-completes-acquisition-of-splunk.html" rel="noopener noreferrer"&gt;Cisco completed its $28B Splunk acquisition in March 2024&lt;/a&gt;; no official EOL announcement as of April 2026, but the product has seen minimal public investment since.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Aurora Integrates with Opsgenie and JSM Operations
&lt;/h2&gt;

&lt;p&gt;Aurora is &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;open-source agentic incident management&lt;/a&gt; that works alongside Opsgenie (and the JSM Operations successor). Most AI incident tools have already deprecated Opsgenie support ahead of the 2027 shutdown — Aurora supports both so teams can run their migration on their own timeline. The integration is &lt;a href="https://arvo-ai.github.io/aurora/docs/integrations/opsgenie-jsm/" rel="noopener noreferrer"&gt;fully documented in Aurora's docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Aurora does with Opsgenie alerts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bidirectional authentication&lt;/strong&gt; — Accepts either a native Opsgenie GenieKey (US or EU region) or a JSM Operations Atlassian API token. Credentials are encrypted in HashiCorp Vault.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Webhook ingestion&lt;/strong&gt; — Receives Create, Acknowledge, Close, and custom alert actions. Only Create triggers an investigation, preventing duplicates from acknowledgement webhooks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert correlation&lt;/strong&gt; — Aurora's AlertCorrelator groups incoming alerts with existing incidents by service, title, and time proximity. Correlated alerts attach to the parent incident instead of spawning a new one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Priority mapping&lt;/strong&gt; — Opsgenie priorities map deterministically: P1 → critical, P2 → high, P3 → medium, P4/P5 → low.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service extraction&lt;/strong&gt; — Aurora reads alerts for a &lt;code&gt;service:xxx&lt;/code&gt; tag first, then falls back to the source and entity fields.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous RCA&lt;/strong&gt; — On alert creation, Aurora creates an incident record, generates an AI summary, and launches a LangGraph-orchestrated agent that queries your cloud infrastructure to find the root cause.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bidirectional JSM commenting&lt;/strong&gt; — For JSM Operations users, Aurora posts an "RCA in progress" comment back onto the linked Jira incident and updates it with findings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chatbot query surface&lt;/strong&gt; — Engineers can ask Aurora in natural language: &lt;em&gt;"Who is on-call right now?"&lt;/em&gt;, &lt;em&gt;"Show me P1 alerts from the last 24 hours"&lt;/em&gt;, &lt;em&gt;"Get details for alert ABC-123"&lt;/em&gt;. Aurora queries 8 Opsgenie resource types (alerts, alert details, incidents, incident details, services, on-call, schedules, teams) via parallel API calls.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;"Most AI investigation tools only work with PagerDuty. We built Aurora to meet SRE teams where they already live — including Opsgenie and JSM — so AI-powered RCA isn't gated on migrating your alerting stack first." — Noah Casarotto-Dinning, CEO at Arvo AI&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How to Migrate Off Opsgenie Before April 5, 2027
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt; administrator access to your Opsgenie account, access to your monitoring stack, and a target destination decided (JSM Operations, Compass, or a third-party alternative).&lt;/p&gt;

&lt;h3&gt;
  
  
  If You Are Staying with Atlassian
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inventory your Opsgenie config.&lt;/strong&gt; Document integrations, escalation policies, routing rules, heartbeats, on-call schedules, and custom roles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Choose JSM Operations vs Compass.&lt;/strong&gt; Pick JSM if you need ITSM workflows (change, problem, incident); pick Compass if you want alerting tied to a service catalog.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify feature parity.&lt;/strong&gt; Review the &lt;a href="https://support.atlassian.com/jira-service-management-cloud/docs/start-shifting-from-opsgenie-to-jira-service-management/" rel="noopener noreferrer"&gt;Atlassian shifting guide&lt;/a&gt; for features that do not migrate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export historical data.&lt;/strong&gt; Alert data in JSM auto-deletes after a retention window — export anything needed for audit or compliance first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run the in-product migration tool.&lt;/strong&gt; Atlassian provides a guided migration that copies your data to JSM or Compass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-authenticate chat integrations.&lt;/strong&gt; Re-authorize Slack and Microsoft Teams — OAuth grants do not transfer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update API endpoints.&lt;/strong&gt; Every consumer of the legacy Opsgenie REST API must be repointed to the new JSM Operations endpoints before April 5, 2027.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replan the mobile rollout.&lt;/strong&gt; The standalone Opsgenie mobile app stops working — responders move to the Jira mobile app.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Close Opsgenie within 120 days.&lt;/strong&gt; After migration, Opsgenie runs in parallel for up to 120 days, then auto-shuts down.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  If You Are Evaluating Alternatives
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Shortlist two or three alternatives&lt;/strong&gt; using the comparison table above.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run a 90-day parallel trial&lt;/strong&gt; alongside Opsgenie — most vendors offer free trials.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate the integrations that matter&lt;/strong&gt; — especially monitoring tool webhooks and your chat platform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure MTTR and on-call satisfaction&lt;/strong&gt; against your Opsgenie baseline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decide before Atlassian's 120-day cutover window closes&lt;/strong&gt; on any migration you start with JSM or Compass.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;When is Opsgenie being shut down?&lt;/strong&gt;&lt;br&gt;
Atlassian will shut down Opsgenie permanently on April 5, 2027. End of sale was June 4, 2025 — no new signups, upgrades, or downgrades are allowed. On April 5, 2027 the service will be disabled and any data that has not been migrated to Jira Service Management or Compass will be permanently deleted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I still buy Opsgenie in 2026?&lt;/strong&gt;&lt;br&gt;
No. Atlassian closed new Opsgenie sales on June 4, 2025. Existing customers can continue using their current Opsgenie subscription until April 5, 2027 but cannot upgrade, downgrade, or add new users beyond their existing plan limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What are the official Opsgenie migration paths?&lt;/strong&gt;&lt;br&gt;
Atlassian offers two paths: Jira Service Management (JSM) Operations for ITSM teams needing change, problem, and incident workflows, and Compass for DevOps/SRE teams wanting alerting paired with a service catalog. Both share the same Operations engine, so schedules, alerts, and policies sync if you use both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Will my Opsgenie data be preserved after migration?&lt;/strong&gt;&lt;br&gt;
Only data you explicitly migrate through Atlassian's in-product migration tool is preserved. Unlike legacy Opsgenie Enterprise, JSM automatically deletes alert data after a retention window — so you must export anything needed for compliance or audit before migration. Some features like alert creation rules and custom roles do not carry over at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much does Opsgenie cost in 2026?&lt;/strong&gt;&lt;br&gt;
Existing standalone customers pay $9.45/user/month annual or $11.55/user/month monthly on Essentials at 100 users. Standard and Enterprise add full routing, SSO, heartbeats, and advanced reporting. Incoming call routing is billed separately at $0.10/minute (US/Canada) and $0.35/minute (international). New signups are no longer accepted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What are the best Opsgenie alternatives?&lt;/strong&gt;&lt;br&gt;
The strongest 2026 alternatives are PagerDuty (incumbent with AI add-ons), incident.io (Slack-native with AI SRE), ilert (EU-hosted, GDPR-focused), Squadcast (budget-friendly, SolarWinds-owned), Rootly (modular IR + on-call + AI SRE), and Aurora by Arvo AI (open-source agentic RCA with Opsgenie and JSM support). Grafana OnCall OSS was archived in March 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does Opsgenie support AI-powered root cause analysis?&lt;/strong&gt;&lt;br&gt;
Standalone Opsgenie is an alerting and on-call product — it does not perform root cause analysis. Atlassian is adding AIOps features (alert grouping, automated resolutions) to JSM and Compass. Teams wanting autonomous multi-step RCA typically pair Opsgenie with a dedicated tool like Aurora, which ingests Opsgenie webhooks and investigates incidents automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens to my Opsgenie integrations after migration?&lt;/strong&gt;&lt;br&gt;
Monitoring integrations (Datadog, New Relic, Prometheus) migrate automatically via Atlassian's in-product tool. Chat integrations (Slack, Microsoft Teams) must be re-authorized manually because the OAuth grants do not transfer. Custom webhooks calling the legacy Opsgenie REST API must be repointed to the new JSM Operations endpoints before April 5, 2027.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can Aurora connect to Opsgenie and JSM?&lt;/strong&gt;&lt;br&gt;
Yes. Aurora supports both standalone Opsgenie (GenieKey authentication, US and EU regions) and JSM Operations (Atlassian API token). Aurora ingests alert webhooks, runs AI-powered alert correlation to group related alerts into incidents, and autonomously investigates the root cause. For JSM users, Aurora posts findings back as comments on the linked Jira incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is Jira Service Management cheaper than Opsgenie?&lt;/strong&gt;&lt;br&gt;
No. JSM Premium is widely reported by Atlassian Community users as more expensive than standalone Opsgenie Standard. Real-time outbound webhooks require JSM Premium, and Incident Command Center requires JSM Enterprise. Many Opsgenie customers see a net price increase after migration, which is why teams use the forced migration to evaluate alternatives.&lt;/p&gt;




&lt;p&gt;Related reading: &lt;a href="https://www.arvoai.ca/blog/top-10-aiops-platforms-free-root-cause-analysis-2026" rel="noopener noreferrer"&gt;Top 10 AIOps Platforms Offering Free Root Cause Analysis&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/pagerduty-alternative-root-cause-analysis" rel="noopener noreferrer"&gt;PagerDuty Alternative: Open-Source Root Cause Analysis&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/what-is-agentic-incident-management" rel="noopener noreferrer"&gt;What is Agentic Incident Management?&lt;/a&gt; · &lt;a href="https://www.arvoai.ca/blog/open-source-incident-management" rel="noopener noreferrer"&gt;Open Source Incident Management: Why It Matters&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;All Opsgenie, JSM, Compass, and alternative-vendor claims verified from official sources in April 2026.&lt;/strong&gt; Last updated: April 21, 2026.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.arvoai.ca/blog/opsgenie-complete-guide-2026" rel="noopener noreferrer"&gt;arvoai.ca/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;By Team at Arvo AI.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>Top 10 AIOps Platforms Offering Free Root Cause Analysis in 2026</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Fri, 10 Apr 2026 17:06:02 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/top-10-aiops-platforms-offering-free-root-cause-analysis-in-2026-2i3</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/top-10-aiops-platforms-offering-free-root-cause-analysis-in-2026-2i3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaway:&lt;/strong&gt; AIOps platforms now compete on the quality of AI-driven root cause analysis and the accessibility of free or open source entry points. Whether you need a full enterprise observability suite or a focused open source investigation tool, there's a platform with a free starting point for your team.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AIOps — Artificial Intelligence for IT Operations — combines AI/ML algorithms with big data analytics to automate IT operations and incident response across cloud and hybrid environments. In 2026, the landscape has matured significantly: platforms now offer autonomous investigation, deterministic AI, and agentic workflows that go far beyond basic alert correlation.&lt;/p&gt;

&lt;p&gt;This guide covers the 10 best AIOps platforms that offer free root cause analysis capabilities — either through free tiers, open source licenses, or trial access.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Comparison
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Platform / Type / Free Access / RCA Approach / Best For&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Aurora by Arvo AI&lt;/strong&gt; — Open source (Apache 2.0) — Free forever (self-hosted) — Alert correlation + AI summarization + agentic autonomous investigation — SRE teams needing the full AIOps workflow in one free tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynatrace&lt;/strong&gt; — Enterprise SaaS — 15-day trial — Deterministic AI (Davis AI) — Large enterprises with complex microservice architectures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datadog&lt;/strong&gt; — SaaS — Free tier (5 hosts) — Watchdog anomaly detection — Teams wanting unified observability with easy onboarding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New Relic&lt;/strong&gt; — SaaS — Free tier (100 GB/month) — Applied Intelligence — Organizations seeking usage-based pricing flexibility&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenObserve&lt;/strong&gt; — Open source (AGPL-3.0) — Free forever (self-hosted) — Log/metric/trace analytics — Cost-conscious teams needing petabyte-scale observability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Splunk ITSI&lt;/strong&gt; — Enterprise SaaS — Trial available — Predictive ML analytics — Enterprises with heavy log volumes and existing Splunk investment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana Cloud&lt;/strong&gt; — SaaS + Open source — Free tier (10k metrics) — ML-powered Sift diagnostics — Teams already using the Grafana/Prometheus stack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metoro&lt;/strong&gt; — SaaS — Free tier (1 cluster) — AI SRE for Kubernetes — Kubernetes-native teams wanting automated deployment verification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BigPanda&lt;/strong&gt; — Enterprise SaaS — Demo only — Open Box ML correlation — Large IT ops teams drowning in alert noise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PagerDuty&lt;/strong&gt; — SaaS — Free tier (5 users) — AIOps add-on (paid) — Teams needing on-call + incident coordination&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. Aurora by Arvo AI
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; covers the full AIOps investigation workflow — from alert correlation and incident summarization all the way to autonomous multi-step root cause analysis. When alerts fire, Aurora's AlertCorrelator groups related alerts into incidents, generates AI summaries, and then triggers autonomous agents that query your cloud infrastructure directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Aurora does RCA:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Alert correlation&lt;/strong&gt; — groups related alerts into incidents by service and time proximity (AlertCorrelator service)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI incident summarization&lt;/strong&gt; — generates structured summaries with context and suggested next steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous multi-step investigation&lt;/strong&gt; — LangGraph-orchestrated agents dynamically select from 30+ tools per investigation&lt;/li&gt;
&lt;li&gt;Executes &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, &lt;code&gt;gcloud&lt;/code&gt; commands in sandboxed Kubernetes pods (non-root, read-only filesystem, seccomp enforced)&lt;/li&gt;
&lt;li&gt;Queries cloud APIs directly — AWS (STS AssumeRole), Azure (Service Principal), GCP (OAuth), OVH, Scaleway&lt;/li&gt;
&lt;li&gt;Traverses Memgraph infrastructure dependency graph for blast radius analysis&lt;/li&gt;
&lt;li&gt;Searches Weaviate knowledge base via vector search over runbooks and past postmortems&lt;/li&gt;
&lt;li&gt;Generates structured RCA with timeline, evidence citations, and remediation steps&lt;/li&gt;
&lt;li&gt;Suggests code fixes with diff preview — human approves and creates PR&lt;/li&gt;
&lt;li&gt;Auto-generates postmortems exportable to Confluence and Jira&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Free access:&lt;/strong&gt; Completely free. Apache 2.0 open source, self-hosted via Docker Compose or Helm chart. No per-seat pricing, no usage limits. Use any LLM provider including &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; for local models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integrations:&lt;/strong&gt; 25+ verified — PagerDuty, Datadog, Grafana, New Relic, Dynatrace, Splunk, BigPanda, Kubernetes, Terraform, GitHub, Confluence, Slack, and more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; SRE teams that need a single free platform covering alert correlation, AI summarization, AND deep autonomous cloud investigation — without paying for three separate tools.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We built Aurora to cover the full investigation workflow. It correlates alerts, summarizes incidents, then actually queries your AWS accounts, checks your Kubernetes pods, and traces the dependency chain — all autonomously." — Noah Casarotto-Dinning, CEO at Arvo AI&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. Dynatrace
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.dynatrace.com" rel="noopener noreferrer"&gt;Dynatrace&lt;/a&gt; is an enterprise observability leader powered by its &lt;strong&gt;Davis AI&lt;/strong&gt; engine, which uses deterministic AI for precise root cause identification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCA approach:&lt;/strong&gt; Deterministic AI that consistently produces the same result for the same input — as opposed to probabilistic models that may vary. Davis AI continuously auto-discovers your infrastructure and maps dependencies across microservice architectures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free access:&lt;/strong&gt; &lt;a href="https://www.dynatrace.com/trial/" rel="noopener noreferrer"&gt;15-day free trial&lt;/a&gt; plus a public sandbox environment. No permanent free tier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Usage-based. Infrastructure monitoring starts at &lt;a href="https://www.dynatrace.com/pricing/" rel="noopener noreferrer"&gt;$7/month per host&lt;/a&gt; (Foundation), $29/month (Infrastructure Monitoring), $58/month (Full-Stack).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Deep auto-discovery, topology mapping, precise deterministic RCA.&lt;br&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; Enterprise-oriented pricing, complex configuration for advanced features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Large enterprises with complex microservice architectures needing precise, repeatable RCA.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Datadog
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.datadoghq.com" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt; provides a comprehensive observability ecosystem with a generous free tier for experimentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCA approach:&lt;/strong&gt; &lt;a href="https://www.datadoghq.com/product/watchdog/" rel="noopener noreferrer"&gt;Watchdog&lt;/a&gt; — an AI engine that continuously analyzes billions of data points for automatic anomaly detection, root cause analysis, and contextual insights across metrics, logs, traces, and security data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free access:&lt;/strong&gt; &lt;a href="https://www.datadoghq.com/pricing/" rel="noopener noreferrer"&gt;$0 free tier&lt;/a&gt; for Infrastructure Monitoring — up to 5 hosts with 1-day metric retention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Pro starts at &lt;a href="https://www.datadoghq.com/pricing/" rel="noopener noreferrer"&gt;$15/host/month&lt;/a&gt; (billed annually). Modular pricing across 20+ products.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Unified platform, easy onboarding, broad integration ecosystem.&lt;br&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; Costs can scale quickly with multiple products and high cardinality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams wanting unified cloud monitoring with AI-assisted incident detection and easy experimentation via the free tier.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. New Relic
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://newrelic.com" rel="noopener noreferrer"&gt;New Relic&lt;/a&gt; offers telemetry-centric observability with built-in AI for incident analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCA approach:&lt;/strong&gt; &lt;a href="https://newrelic.com/platform/applied-intelligence" rel="noopener noreferrer"&gt;Applied Intelligence&lt;/a&gt; — an AI module that deduplicates alerts, correlates incidents, and pinpoints root causes across cloud-native infrastructure using ML.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free access:&lt;/strong&gt; &lt;a href="https://newrelic.com/pricing" rel="noopener noreferrer"&gt;Free tier&lt;/a&gt; includes 100 GB/month data ingest, 1 full platform user, and 50+ capabilities. Usage-based pricing allows low-risk adoption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Usage-based — pay for data ingested and number of users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Flexible pricing, full-stack observability, large integration library.&lt;br&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; Advanced AI features may require higher tiers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Organizations seeking flexible, usage-based pricing with built-in AI for alert correlation and incident analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. OpenObserve
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://openobserve.ai" rel="noopener noreferrer"&gt;OpenObserve&lt;/a&gt; is an open source observability platform built in Rust for high-performance log, metric, and trace analytics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCA approach:&lt;/strong&gt; Analytics-driven observability — fast search and correlation across logs, metrics, and traces. Not agentic AI, but provides the data foundation for manual or scripted RCA.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free access:&lt;/strong&gt; Fully &lt;a href="https://github.com/openobserve/openobserve" rel="noopener noreferrer"&gt;open source under AGPL-3.0&lt;/a&gt;. Self-hosted is free forever with unlimited users. Cloud plan also offers a free tier. Self-hosted Enterprise is free up to 200 GB/day ingestion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Claims &lt;a href="https://openobserve.ai" rel="noopener noreferrer"&gt;140x lower storage cost&lt;/a&gt; vs Elasticsearch. Petabyte-scale. Written in Rust for performance.&lt;br&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; Observability platform, not a dedicated AIOps/RCA tool. Requires engineering effort for investigation workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Cost-conscious engineering teams needing high-performance observability as a foundation for RCA.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Splunk ITSI
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.splunk.com/en_us/products/it-service-intelligence.html" rel="noopener noreferrer"&gt;Splunk ITSI&lt;/a&gt; (IT Service Intelligence) is an enterprise AIOps platform for organizations with heavy log volumes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCA approach:&lt;/strong&gt; ML-powered predictive analytics — uses machine learning and historical data to detect future service degradations. Includes automated event aggregation with out-of-the-box ML policies and alert correlation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free access:&lt;/strong&gt; Trial available. No permanent free tier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Not publicly listed. ITSI is a premium add-on requiring a base Splunk Enterprise or Cloud license. Widely considered one of the most expensive options in the AIOps space — costs scale significantly with data volume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Predictive alerting, deep service-level insights, mature ML capabilities.&lt;br&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; Significant cost at scale, proprietary query language (SPL), complex implementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Mid-to-large enterprises with existing Splunk investment and heavy log volumes.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Grafana Cloud
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://grafana.com/products/cloud/" rel="noopener noreferrer"&gt;Grafana Cloud&lt;/a&gt; extends the popular open source Grafana ecosystem with cloud-hosted observability and ML-powered diagnostics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCA approach:&lt;/strong&gt; ML-powered &lt;a href="https://grafana.com/products/cloud/" rel="noopener noreferrer"&gt;Sift&lt;/a&gt; for automated diagnostics, plus Correlations features that create interactive links between data sources. Application Observability auto-correlates metrics, logs, and traces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free access:&lt;/strong&gt; &lt;a href="https://grafana.com/pricing/" rel="noopener noreferrer"&gt;Permanent free tier&lt;/a&gt; — 10,000 active metric series/month, 50 GB logs/traces/profiles, 3 active users, 14-day retention. No credit card required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Strong community, extensible with thousands of dashboards and plugins, works with Prometheus/Loki/Tempo natively.&lt;br&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; Operational tuning may be required for effective RCA at scale. ML features are newer additions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams already using the Grafana/Prometheus stack who want cloud-hosted ML-powered diagnostics.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Metoro
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://metoro.io" rel="noopener noreferrer"&gt;Metoro&lt;/a&gt; is a developer/SRE-focused AIOps platform built specifically for Kubernetes environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCA approach:&lt;/strong&gt; AI SRE for Kubernetes — autonomous deployment verification, AI issue detection, root cause analysis, and remediation suggestions. Uses eBPF for telemetry collection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free access:&lt;/strong&gt; &lt;a href="https://metoro.io" rel="noopener noreferrer"&gt;Hobby plan&lt;/a&gt; — free forever, includes 1 cluster, 1 user, 2 nodes, 200 GB ingested/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Kubernetes-native, automated deployment verification, APM + log management + infrastructure monitoring in one.&lt;br&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; Focused on Kubernetes — less suitable for non-containerized environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Kubernetes-native teams wanting an AI SRE that automates deployment verification and incident investigation.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. BigPanda
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.bigpanda.io" rel="noopener noreferrer"&gt;BigPanda&lt;/a&gt; specializes in transparent, explainable ML-based event correlation for large IT operations teams.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCA approach:&lt;/strong&gt; &lt;a href="https://www.bigpanda.io" rel="noopener noreferrer"&gt;Open Box Machine Learning (OBML)&lt;/a&gt; — transparent ML where users can examine automation logic in plain English, edit it, and preview before deploying. Correlates alerts across time, topology, context, and alert type. Claims &lt;a href="https://www.bigpanda.io" rel="noopener noreferrer"&gt;95%+ IT noise reduction&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free access:&lt;/strong&gt; No free tier or self-serve trial. Access through &lt;a href="https://www.bigpanda.io" rel="noopener noreferrer"&gt;demo requests&lt;/a&gt; and sales engagement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Transparent/explainable AI (not black box), massive noise reduction, customizable correlation rules.&lt;br&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; Enterprise-only, no self-serve access, requires sales engagement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Large IT ops teams drowning in alert noise who need transparent, customizable AI correlation.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. PagerDuty
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.pagerduty.com" rel="noopener noreferrer"&gt;PagerDuty&lt;/a&gt; is the industry standard for incident response and on-call coordination, with AIOps capabilities available as add-ons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RCA approach:&lt;/strong&gt; &lt;a href="https://www.pagerduty.com/platform/aiops/" rel="noopener noreferrer"&gt;AIOps add-on&lt;/a&gt; provides alert noise reduction (claims 91% reduction), intelligent correlation, and "Probable Origin" for root cause suggestions. Note: RCA features are &lt;strong&gt;not included in the free tier&lt;/strong&gt; — they require the AIOps add-on (&lt;a href="https://www.pagerduty.com/pricing/" rel="noopener noreferrer"&gt;$699+/month&lt;/a&gt;) on top of a paid plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free access:&lt;/strong&gt; &lt;a href="https://www.pagerduty.com/pricing/" rel="noopener noreferrer"&gt;Free tier&lt;/a&gt; includes up to 5 users, 1 on-call schedule, basic incident management, and 700+ integrations. Basic alerting and response only — no RCA.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Professional from &lt;a href="https://www.pagerduty.com/pricing/" rel="noopener noreferrer"&gt;$21/user/month&lt;/a&gt; (annual). AIOps add-on from $699/month. PagerDuty Advance (GenAI) from $415/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Industry-standard on-call, 700+ integrations, robust mobile app, strong ecosystem.&lt;br&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; RCA requires expensive add-ons, not included in base plans.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that already use PagerDuty for on-call and want to add AI-powered correlation and noise reduction.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Choose the Right Platform
&lt;/h2&gt;

&lt;p&gt;When evaluating free AIOps RCA tools, prioritize these criteria:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;RCA approach&lt;/strong&gt; — Deterministic AI (Dynatrace), probabilistic ML (BigPanda), or agentic investigation (Aurora)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telemetry breadth&lt;/strong&gt; — Does it cover logs, metrics, traces, and infrastructure state?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud integration&lt;/strong&gt; — Does it work with your cloud providers and existing monitoring stack?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free tier limitations&lt;/strong&gt; — What's actually included? Some "free" plans exclude RCA entirely (PagerDuty).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted vs SaaS&lt;/strong&gt; — Do you need data sovereignty? Only Aurora and OpenObserve offer full self-hosted deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation depth&lt;/strong&gt; — Does it correlate alerts, or does it actually query your infrastructure?&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;Start with a free tier or open source instance to validate whether automated RCA reduces your MTTR before scaling to paid plans.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Key Features to Look For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI/ML approach&lt;/strong&gt; — Deterministic vs probabilistic vs agentic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telemetry support&lt;/strong&gt; — Logs, metrics, traces, and infrastructure state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud provider integration&lt;/strong&gt; — Native connectors for AWS, Azure, GCP, Kubernetes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remediation guidance&lt;/strong&gt; — Does it just identify the cause, or suggest fixes?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem automation&lt;/strong&gt; — Auto-generated incident documentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge base&lt;/strong&gt; — Search over runbooks and past incidents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance&lt;/strong&gt; — SOC 2, HIPAA, GDPR if required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mean Time to Repair (MTTR) — the average time to detect, diagnose, and resolve an incident — is the key metric. Research shows that AIOps root cause automation can &lt;a href="https://www.goworkwize.com/blog/best-aiops-tools" rel="noopener noreferrer"&gt;cut MTTR by up to 50%&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Learn more about automated RCA in our &lt;a href="https://dev.to/blog/root-cause-analysis-complete-guide-sres"&gt;Root Cause Analysis: The Complete Guide for SREs&lt;/a&gt; and explore how agentic investigation works in &lt;a href="https://dev.to/blog/what-is-agentic-incident-management"&gt;What is Agentic Incident Management?&lt;/a&gt;. For open source options, see &lt;a href="https://dev.to/blog/open-source-incident-management"&gt;Open Source Incident Management: Why It Matters&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;All platform claims verified from official vendor websites.&lt;/strong&gt; Last verified: April 2026.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>opensource</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>incident.io Alternative: Open Source AI Incident Management</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Mon, 06 Apr 2026 22:18:30 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/incidentio-alternative-open-source-ai-incident-management-1ik0</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/incidentio-alternative-open-source-ai-incident-management-1ik0</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaway:&lt;/strong&gt; incident.io is one of the strongest incident management platforms available — used by Netflix, Airbnb, and Etsy with a free Basic tier. But it's closed-source SaaS with no self-hosted option and undisclosed AI. Aurora is an open source (Apache 2.0) alternative focused on autonomous AI investigation with full infrastructure access — free, self-hosted, and works with any LLM.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What is incident.io?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://incident.io" rel="noopener noreferrer"&gt;incident.io&lt;/a&gt; describes itself as "the all-in-one AI platform for on-call, incident response, and status pages — built for fast-moving teams." It's one of the most well-regarded tools in the space, with customers including &lt;a href="https://incident.io/customers" rel="noopener noreferrer"&gt;Netflix, Airbnb, Etsy, Intercom, and Vanta&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;incident.io offers four core products:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Incident Response&lt;/strong&gt; — Slack-native workflows, catalog, post-mortems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On-Call&lt;/strong&gt; — Schedules, escalation, alerting with &lt;a href="https://incident.io/on-call" rel="noopener noreferrer"&gt;40+ alert sources&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI SRE&lt;/strong&gt; — Autonomous investigation, code fix PRs, context search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status Pages&lt;/strong&gt; — Public, internal, and customer-specific pages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As Airbnb's Director of SRE &lt;a href="https://incident.io/customers" rel="noopener noreferrer"&gt;Nils Pommerien said&lt;/a&gt;: "If I could point to the single most impactful thing we did to change the culture at Airbnb, it would be rolling out incident.io."&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Aurora?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is an open source (Apache 2.0) AI agent for automated incident investigation and root cause analysis. Aurora's LangGraph-orchestrated agents autonomously query infrastructure across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes — delivering structured RCA with remediation recommendations.&lt;/p&gt;

&lt;p&gt;Aurora is free, self-hosted, and works with any LLM provider including local models via Ollama.&lt;/p&gt;

&lt;h2&gt;
  
  
  How They Compare
&lt;/h2&gt;

&lt;h3&gt;
  
  
  AI Investigation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;incident.io AI SRE&lt;/strong&gt; (&lt;a href="https://incident.io/ai-sre" rel="noopener noreferrer"&gt;incident.io/ai-sre&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Triages and investigates alerts, analyzes root cause&lt;/li&gt;
&lt;li&gt;Connects code changes, alerts, and past incidents to uncover what went wrong&lt;/li&gt;
&lt;li&gt;@incident chat in Slack — ask questions, get answers within seconds&lt;/li&gt;
&lt;li&gt;Spots failing pull requests behind incidents&lt;/li&gt;
&lt;li&gt;Searches through thousands of resources for relevant answers&lt;/li&gt;
&lt;li&gt;Pulls metrics from monitoring dashboards directly into Slack&lt;/li&gt;
&lt;li&gt;Scans public Slack channels for related discussions&lt;/li&gt;
&lt;li&gt;Drafts code fixes and opens pull requests directly from Slack&lt;/li&gt;
&lt;li&gt;Suggests next steps based on past incidents&lt;/li&gt;
&lt;li&gt;AI-native post-mortems&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://incident.io" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt; (Beta) for IDE integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora AI:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-agent architecture via LangGraph with dynamic tool selection (30+ tools)&lt;/li&gt;
&lt;li&gt;Correlates alerts across services and dependencies&lt;/li&gt;
&lt;li&gt;Constructs investigation timelines linking deployments, infra events, and telemetry&lt;/li&gt;
&lt;li&gt;Generates structured RCA with evidence citations and remediation steps&lt;/li&gt;
&lt;li&gt;Human-in-the-loop for write/destructive actions — read-only commands run automatically&lt;/li&gt;
&lt;li&gt;Executes &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, &lt;code&gt;gcloud&lt;/code&gt; commands in &lt;strong&gt;sandboxed Kubernetes pods&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Queries cloud APIs directly — AWS (STS AssumeRole), Azure (Service Principal), GCP (OAuth), OVH, Scaleway&lt;/li&gt;
&lt;li&gt;Traverses Memgraph infrastructure dependency graph for blast radius&lt;/li&gt;
&lt;li&gt;Searches Weaviate knowledge base (vector search over runbooks and past incidents)&lt;/li&gt;
&lt;li&gt;Suggests code fixes with diff preview — human approves and creates PR&lt;/li&gt;
&lt;li&gt;Exports postmortems to Confluence and Jira&lt;/li&gt;
&lt;li&gt;Works with any LLM provider — choose your model&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Difference
&lt;/h3&gt;

&lt;p&gt;incident.io's AI SRE correlates data from monitoring tools, source control, and past incidents within Slack. Aurora's agents go deeper — they directly query cloud provider APIs and execute CLI commands in sandboxed pods to gather live infrastructure data during investigation.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-Call &amp;amp; Alerting
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;incident.io&lt;/strong&gt; has a full on-call product:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://incident.io/on-call" rel="noopener noreferrer"&gt;40+ alert sources&lt;/a&gt; ready to go&lt;/li&gt;
&lt;li&gt;Schedules: simple, shadow rotations, follow-the-sun&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://incident.io/on-call" rel="noopener noreferrer"&gt;99.99% delivery reliability&lt;/a&gt; claimed&lt;/li&gt;
&lt;li&gt;AI alert intelligence (noise reduction)&lt;/li&gt;
&lt;li&gt;Cover requests and easy overrides&lt;/li&gt;
&lt;li&gt;Holiday feeds, compensation calculator&lt;/li&gt;
&lt;li&gt;Migration tools from PagerDuty and Opsgenie&lt;/li&gt;
&lt;li&gt;Mobile app&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora&lt;/strong&gt; has no on-call capabilities. For on-call, use incident.io, PagerDuty, Grafana OnCall, or Opsgenie alongside Aurora.&lt;/p&gt;

&lt;h3&gt;
  
  
  Incident Coordination
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;incident.io&lt;/strong&gt; excels here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slack-native incident response with workflows&lt;/li&gt;
&lt;li&gt;Catalog for service ownership and context&lt;/li&gt;
&lt;li&gt;Post-mortems with AI drafts&lt;/li&gt;
&lt;li&gt;Status pages (public, internal, customer-specific)&lt;/li&gt;
&lt;li&gt;Insights and analytics&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://incident.io/integrations" rel="noopener noreferrer"&gt;~69 integrations&lt;/a&gt; across monitoring, ticketing, communication, HR&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora&lt;/strong&gt; creates Slack incident channels, tracks action items with Jira sync, and generates postmortems. No status pages, no service catalog, no mobile app.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;incident.io has, Aurora doesn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call scheduling, escalation, alerting (40+ sources)&lt;/li&gt;
&lt;li&gt;Microsoft Teams support&lt;/li&gt;
&lt;li&gt;Status pages (public, internal, customer-specific)&lt;/li&gt;
&lt;li&gt;Service catalog&lt;/li&gt;
&lt;li&gt;Insights and analytics&lt;/li&gt;
&lt;li&gt;Mobile app&lt;/li&gt;
&lt;li&gt;MCP server for IDEs (Beta)&lt;/li&gt;
&lt;li&gt;AI that searches Slack channels for context&lt;/li&gt;
&lt;li&gt;Metrics dashboard pulling from Slack&lt;/li&gt;
&lt;li&gt;HR system integrations (BambooHR, Rippling, etc.)&lt;/li&gt;
&lt;li&gt;~69 integrations&lt;/li&gt;
&lt;li&gt;SOC 2, HIPAA compliance&lt;/li&gt;
&lt;li&gt;Netflix, Airbnb, Etsy as customers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora has, incident.io doesn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct cloud infrastructure querying (AWS, Azure, GCP, OVH, Scaleway APIs)&lt;/li&gt;
&lt;li&gt;CLI execution in sandboxed Kubernetes pods&lt;/li&gt;
&lt;li&gt;Native vector search knowledge base (Weaviate RAG)&lt;/li&gt;
&lt;li&gt;Infrastructure dependency graph (Memgraph)&lt;/li&gt;
&lt;li&gt;Terraform/IaC state analysis&lt;/li&gt;
&lt;li&gt;Open source (Apache 2.0) — full codebase auditable&lt;/li&gt;
&lt;li&gt;Self-hosted deployment (Docker Compose, Helm)&lt;/li&gt;
&lt;li&gt;LLM provider flexibility (OpenAI, Anthropic, Google, Ollama for air-gapped)&lt;/li&gt;
&lt;li&gt;Free — no per-user pricing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Both have:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI-powered root cause analysis&lt;/li&gt;
&lt;li&gt;AI-suggested code fixes and PR generation&lt;/li&gt;
&lt;li&gt;Slack incident channel management&lt;/li&gt;
&lt;li&gt;Automated postmortem generation&lt;/li&gt;
&lt;li&gt;GitHub and GitLab integration&lt;/li&gt;
&lt;li&gt;Datadog, Grafana integration&lt;/li&gt;
&lt;li&gt;Action item tracking&lt;/li&gt;
&lt;li&gt;RBAC and security controls&lt;/li&gt;
&lt;li&gt;Human-in-the-loop for destructive actions&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;incident.io&lt;/strong&gt; (&lt;a href="https://incident.io/pricing" rel="noopener noreferrer"&gt;incident.io/pricing&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Basic: &lt;strong&gt;Free forever&lt;/strong&gt; (1 custom field, 1 workflow, 2 integrations)&lt;/li&gt;
&lt;li&gt;Team: &lt;strong&gt;$15/user/month&lt;/strong&gt; (annual) — add on-call for +$10/user/month&lt;/li&gt;
&lt;li&gt;Pro: &lt;strong&gt;$25/user/month&lt;/strong&gt; — add on-call for +$20/user/month, AI post-mortems included&lt;/li&gt;
&lt;li&gt;Enterprise: Custom pricing — unlimited everything, HIPAA, SCIM, custom RBAC&lt;/li&gt;
&lt;li&gt;Standalone On-Call: &lt;strong&gt;$20/user/month&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Free&lt;/strong&gt; — Apache 2.0, self-hosted&lt;/li&gt;
&lt;li&gt;Costs: infrastructure + LLM API usage&lt;/li&gt;
&lt;li&gt;$0 LLM cost with Ollama local models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example: 20-person team on incident.io Pro + On-Call:&lt;/strong&gt;&lt;br&gt;
$25 + $20 = $45/user/month x 20 = &lt;strong&gt;$900/month&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aurora:&lt;/strong&gt; &lt;strong&gt;$0&lt;/strong&gt; + infrastructure + LLM API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source vs SaaS
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;incident.io&lt;/strong&gt; is closed-source SaaS. You cannot self-host, audit the AI's reasoning, or choose your LLM provider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aurora&lt;/strong&gt; is fully open source under Apache 2.0:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read every line of code the AI uses to investigate&lt;/li&gt;
&lt;li&gt;Self-host with zero data leaving your environment&lt;/li&gt;
&lt;li&gt;Use any LLM provider or run local models via Ollama&lt;/li&gt;
&lt;li&gt;Modify workflows, add custom tools, fork for your needs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Choose incident.io
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You want the best all-in-one SaaS platform&lt;/strong&gt; — incident.io is widely regarded as the best UX in the category&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack-native AI chat matters&lt;/strong&gt; — @incident in Slack is deeply integrated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need on-call + response + status pages&lt;/strong&gt; in one tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise customers are important&lt;/strong&gt; — Netflix, Airbnb, Etsy validation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free tier works for you&lt;/strong&gt; — Basic plan is genuinely free forever&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance is critical&lt;/strong&gt; — SOC 2, HIPAA available&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Choose Aurora
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Investigation is your bottleneck&lt;/strong&gt; — you need AI that directly queries your cloud infrastructure, not just correlates monitoring data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source is required&lt;/strong&gt; — full transparency into how AI investigates your production systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted is required&lt;/strong&gt; — compliance, data sovereignty, or air-gapped environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cloud breadth&lt;/strong&gt; — you need OVH or Scaleway alongside AWS, Azure, GCP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM flexibility&lt;/strong&gt; — choose your own provider or run local models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget is limited&lt;/strong&gt; — Aurora is free; incident.io Pro + On-Call is $900+/month for 20 users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want a custom integration&lt;/strong&gt; — the Arvo AI team builds custom integrations at no cost. &lt;a href="https://cal.com/arvo-ai" rel="noopener noreferrer"&gt;Reach out&lt;/a&gt; and they'll build it with you.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Using incident.io + Aurora Together
&lt;/h2&gt;

&lt;p&gt;They complement each other well:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Alert fires&lt;/strong&gt; → incident.io creates channel, pages on-call, updates status page&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same alert&lt;/strong&gt; → Aurora receives webhook, starts AI investigation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;incident.io&lt;/strong&gt; coordinates response (roles, workflows, comms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora&lt;/strong&gt; investigates in background (queries cloud, checks K8s, searches knowledge base)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On-call SRE&lt;/strong&gt; finds Aurora's RCA in the incident channel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora&lt;/strong&gt; generates postmortem → exports to Confluence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;incident.io&lt;/strong&gt; tracks follow-up actions&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Limitations of Aurora
&lt;/h2&gt;

&lt;p&gt;Aurora focuses on investigation, not full incident lifecycle management:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No on-call scheduling&lt;/strong&gt; — use incident.io, PagerDuty, or Grafana OnCall alongside Aurora&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No status pages&lt;/strong&gt; — incident.io includes these on all tiers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack only&lt;/strong&gt; — no Microsoft Teams support currently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No mobile app&lt;/strong&gt; — incident.io has a polished mobile experience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fewer integrations&lt;/strong&gt; — Aurora has 25+ vs incident.io's ~69&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SOC 2 Type II in progress&lt;/strong&gt; — not yet certified&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No Slack-native AI chat&lt;/strong&gt; — Aurora's AI works through its web dashboard, not @mentions in Slack channels like incident.io&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;"incident.io has the best UX in the category — we respect that. Aurora's strength is different: deep cloud infrastructure investigation. If your SRE team is spending hours querying AWS, kubectl, and Grafana manually after getting paged, that's the problem Aurora solves." — Noah Casarotto-Dinning, CEO at Arvo AI&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Getting Started with Aurora
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init
make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure your monitoring webhooks, add cloud credentials, and investigations start automatically. See the &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;full documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Learn more at &lt;a href="https://www.arvoai.ca" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;. For other comparisons, see &lt;a href="https://dev.to/blog/aurora-vs-traditional-incident-management-tools"&gt;Aurora vs Traditional Tools&lt;/a&gt;, &lt;a href="https://dev.to/blog/pagerduty-alternative-root-cause-analysis"&gt;PagerDuty Alternative&lt;/a&gt;, and &lt;a href="https://dev.to/blog/rootly-alternative-open-source-incident-management"&gt;Rootly Alternative&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;All claims sourced from official websites.&lt;/strong&gt; incident.io data from &lt;a href="https://incident.io" rel="noopener noreferrer"&gt;incident.io&lt;/a&gt;. Aurora data from &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;. Last verified: April 2026.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>opensource</category>
      <category>devops</category>
    </item>
    <item>
      <title>FireHydrant Alternative: Open Source AI Incident Management</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Mon, 06 Apr 2026 22:05:16 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/firehydrant-alternative-open-source-ai-incident-management-4adk</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/firehydrant-alternative-open-source-ai-incident-management-4adk</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaway:&lt;/strong&gt; FireHydrant is a solid incident management platform — but it was &lt;a href="https://firehydrant.com" rel="noopener noreferrer"&gt;acquired by Freshworks&lt;/a&gt; in December 2025, AI features are locked to the Enterprise tier, and there's no autonomous investigation. Aurora is an open source (Apache 2.0) alternative with AI agents that autonomously investigate root causes across your cloud infrastructure — completely free and self-hosted.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What is FireHydrant?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://firehydrant.com" rel="noopener noreferrer"&gt;FireHydrant&lt;/a&gt; is an all-in-one incident management platform that helps teams plan, respond to, and learn from incidents. Their tagline: "Fight Fires Faster." They claim teams resolve incidents &lt;a href="https://firehydrant.com" rel="noopener noreferrer"&gt;up to 90% faster&lt;/a&gt; with their platform.&lt;/p&gt;

&lt;p&gt;In December 2025, FireHydrant was &lt;a href="https://firehydrant.com" rel="noopener noreferrer"&gt;acquired by Freshworks&lt;/a&gt; (NASDAQ: FRSH). The platform will become the incident management and reliability layer inside Freshservice, Freshworks' ITSM product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notable customers:&lt;/strong&gt; &lt;a href="https://firehydrant.com/customer-stories" rel="noopener noreferrer"&gt;Backblaze&lt;/a&gt; (91% faster mitigation), &lt;a href="https://firehydrant.com/customer-stories" rel="noopener noreferrer"&gt;Bluecore&lt;/a&gt; (saving 30-90 minutes per incident), Snyk, LaunchDarkly, AuditBoard, Qlik, Avalara.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Aurora?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is an open source (Apache 2.0) AI agent for automated incident investigation and root cause analysis. When an alert fires, Aurora's LangGraph-orchestrated agents autonomously query your infrastructure across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes — delivering a structured RCA with remediation recommendations.&lt;/p&gt;

&lt;p&gt;Aurora is free, self-hosted, and works with any LLM provider.&lt;/p&gt;

&lt;h2&gt;
  
  
  How They Compare
&lt;/h2&gt;

&lt;h3&gt;
  
  
  AI Capabilities
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;FireHydrant AI&lt;/strong&gt; (&lt;a href="https://firehydrant.com/pricing" rel="noopener noreferrer"&gt;Enterprise tier only&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI-generated incident summaries from Slack messages&lt;/li&gt;
&lt;li&gt;Automated event timelines&lt;/li&gt;
&lt;li&gt;Real-time call transcription (Zoom, Google Meet) with key point summarization&lt;/li&gt;
&lt;li&gt;AI-drafted retrospectives with contributing factors and suggested action items&lt;/li&gt;
&lt;li&gt;Stakeholder update generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;FireHydrant's AI is &lt;strong&gt;documentation-focused&lt;/strong&gt; — it summarizes what happened, transcribes calls, and drafts retrospectives. It does not autonomously investigate root causes or query infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aurora AI:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-agent architecture via LangGraph with dynamic tool selection (30+ tools)&lt;/li&gt;
&lt;li&gt;Correlates alerts across services and dependencies&lt;/li&gt;
&lt;li&gt;Constructs investigation timelines linking deployments, infra events, and telemetry&lt;/li&gt;
&lt;li&gt;Generates structured RCA with evidence citations and remediation steps&lt;/li&gt;
&lt;li&gt;Human-in-the-loop for write/destructive actions — read-only commands run automatically&lt;/li&gt;
&lt;li&gt;Executes &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, &lt;code&gt;gcloud&lt;/code&gt; commands in sandboxed Kubernetes pods&lt;/li&gt;
&lt;li&gt;Queries cloud APIs directly — AWS, Azure, GCP, OVH, Scaleway&lt;/li&gt;
&lt;li&gt;Traverses Memgraph infrastructure dependency graph for blast radius&lt;/li&gt;
&lt;li&gt;Searches Weaviate knowledge base (vector search over runbooks and past incidents)&lt;/li&gt;
&lt;li&gt;Suggests code fixes with diff preview — human approves and creates PR&lt;/li&gt;
&lt;li&gt;Works with any LLM provider including local models via Ollama&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Incident Response &amp;amp; Coordination
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;FireHydrant&lt;/strong&gt; is strong at incident coordination:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slack and Microsoft Teams chatbot&lt;/li&gt;
&lt;li&gt;Automated runbooks (triggered by severity, service, or custom fields)&lt;/li&gt;
&lt;li&gt;Incident roles and assignments&lt;/li&gt;
&lt;li&gt;Service catalog with dependency mapping and deployment tracking&lt;/li&gt;
&lt;li&gt;&lt;a href="https://firehydrant.com/integrations" rel="noopener noreferrer"&gt;38+ integrations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;MTTx analytics (MTTD, MTTA, MTTR, MTTM)&lt;/li&gt;
&lt;li&gt;Mobile notifications (iOS, Android)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora&lt;/strong&gt; creates and manages Slack incident channels, tracks action items with Jira sync, and sends investigation notifications. Aurora does not have Microsoft Teams support, incident roles, service catalog, or mobile app.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-Call &amp;amp; Alerting
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;FireHydrant&lt;/strong&gt; (branded "Signals"):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Team-based on-call schedules with unlimited escalation policies&lt;/li&gt;
&lt;li&gt;SMS, voice, push, Slack, Teams, email, WhatsApp notifications&lt;/li&gt;
&lt;li&gt;Alert routing via Common Expression Language (CEL)&lt;/li&gt;
&lt;li&gt;Consumption-based alert pricing (not per-seat)&lt;/li&gt;
&lt;li&gt;Alert grouping (Enterprise only)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora&lt;/strong&gt; has no on-call capabilities. For on-call, use PagerDuty, Grafana OnCall, or Opsgenie alongside Aurora.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;FireHydrant has, Aurora doesn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Microsoft Teams support&lt;/li&gt;
&lt;li&gt;Incident roles and assignments&lt;/li&gt;
&lt;li&gt;Service catalog with dependency mapping&lt;/li&gt;
&lt;li&gt;Status pages (public and private)&lt;/li&gt;
&lt;li&gt;MTTx analytics dashboards&lt;/li&gt;
&lt;li&gt;Mobile notifications (iOS, Android)&lt;/li&gt;
&lt;li&gt;Deployment tracking&lt;/li&gt;
&lt;li&gt;Call transcription (Zoom, Google Meet)&lt;/li&gt;
&lt;li&gt;SOC 2 compliance&lt;/li&gt;
&lt;li&gt;38+ integrations&lt;/li&gt;
&lt;li&gt;Consumption-based alerting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora has, FireHydrant doesn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Autonomous AI investigation (FireHydrant AI is documentation-focused only)&lt;/li&gt;
&lt;li&gt;Direct cloud infrastructure querying (AWS, Azure, GCP, OVH, Scaleway)&lt;/li&gt;
&lt;li&gt;CLI execution in sandboxed Kubernetes pods&lt;/li&gt;
&lt;li&gt;Native vector search knowledge base (Weaviate RAG)&lt;/li&gt;
&lt;li&gt;Infrastructure dependency graph (Memgraph)&lt;/li&gt;
&lt;li&gt;Terraform/IaC state analysis&lt;/li&gt;
&lt;li&gt;AI-suggested code fixes with diff preview&lt;/li&gt;
&lt;li&gt;Open source (Apache 2.0) — full codebase auditable&lt;/li&gt;
&lt;li&gt;Self-hosted deployment (Docker Compose, Helm)&lt;/li&gt;
&lt;li&gt;LLM provider flexibility (OpenAI, Anthropic, Google, Ollama)&lt;/li&gt;
&lt;li&gt;Free — no licensing costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Both have:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slack incident channel management&lt;/li&gt;
&lt;li&gt;Automated postmortem/retrospective generation&lt;/li&gt;
&lt;li&gt;Action item tracking with Jira sync&lt;/li&gt;
&lt;li&gt;On-call integrations (PagerDuty, Opsgenie)&lt;/li&gt;
&lt;li&gt;Datadog, Grafana, New Relic monitoring integrations&lt;/li&gt;
&lt;li&gt;GitHub integration&lt;/li&gt;
&lt;li&gt;Runbook/workflow automation&lt;/li&gt;
&lt;li&gt;RBAC and security controls&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;FireHydrant&lt;/strong&gt; (&lt;a href="https://firehydrant.com/pricing" rel="noopener noreferrer"&gt;firehydrant.com/pricing&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free trial: 2 weeks, up to 10 responders&lt;/li&gt;
&lt;li&gt;Platform Pro: &lt;strong&gt;$9,600/year&lt;/strong&gt; (flat, up to 20 responders)&lt;/li&gt;
&lt;li&gt;Enterprise: Custom pricing (required for AI features)&lt;/li&gt;
&lt;li&gt;Alerting is consumption-based (separate from platform fee)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Free&lt;/strong&gt; — Apache 2.0, self-hosted&lt;/li&gt;
&lt;li&gt;Costs: infrastructure + LLM API usage&lt;/li&gt;
&lt;li&gt;$0 LLM cost with Ollama local models&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: FireHydrant AI features (summaries, transcripts, triage, retrospectives) are &lt;strong&gt;only available on the Enterprise tier&lt;/strong&gt;. Pro users do not get AI capabilities.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Freshworks Acquisition Factor
&lt;/h2&gt;

&lt;p&gt;FireHydrant was &lt;a href="https://firehydrant.com" rel="noopener noreferrer"&gt;acquired by Freshworks&lt;/a&gt; in December 2025. What this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The platform will be integrated into &lt;strong&gt;Freshservice&lt;/strong&gt; (Freshworks' ITSM product)&lt;/li&gt;
&lt;li&gt;Current accounts, pricing, and support stay the same during transition&lt;/li&gt;
&lt;li&gt;Long-term product direction is now under Freshworks' roadmap&lt;/li&gt;
&lt;li&gt;Some teams may want to evaluate alternatives before deeper Freshworks lock-in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Aurora is independently maintained open source — no acquisition risk, no vendor roadmap dependency.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Choose FireHydrant
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You need full incident coordination&lt;/strong&gt; — roles, runbooks, status pages, service catalog, analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call transcription matters&lt;/strong&gt; — real-time Zoom/Google Meet transcription with AI summaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Microsoft Teams is required&lt;/strong&gt; — Aurora is Slack-only&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want managed SaaS&lt;/strong&gt; — no infrastructure to maintain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You're already in the Freshworks ecosystem&lt;/strong&gt; — Freshservice integration will be seamless&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Choose Aurora
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Investigation is your bottleneck&lt;/strong&gt; — you need AI that actually investigates, not just summarizes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need direct cloud querying&lt;/strong&gt; — AI agents that run commands on AWS, Azure, GCP, K8s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source is required&lt;/strong&gt; — audit how AI investigates your infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted is required&lt;/strong&gt; — compliance, data sovereignty, or air-gapped environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget is limited&lt;/strong&gt; — FireHydrant Enterprise (required for AI) is custom pricing; Aurora is free&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM flexibility&lt;/strong&gt; — choose your provider or run local models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You're concerned about the acquisition&lt;/strong&gt; — Aurora has no vendor lock-in risk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want a custom integration&lt;/strong&gt; — the Arvo AI team builds custom integrations at no cost. &lt;a href="https://cal.com/arvo-ai" rel="noopener noreferrer"&gt;Reach out&lt;/a&gt; and they'll build it with you.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Limitations of Aurora
&lt;/h2&gt;

&lt;p&gt;Aurora is powerful for investigation but doesn't replace a full incident coordination platform:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No on-call scheduling&lt;/strong&gt; — use PagerDuty, Grafana OnCall, or Opsgenie alongside Aurora&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No status pages&lt;/strong&gt; — use Atlassian Statuspage, incident.io, or Instatus&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack only&lt;/strong&gt; — no Microsoft Teams support currently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No mobile app&lt;/strong&gt; — investigation results are accessed via web dashboard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SOC 2 Type II in progress&lt;/strong&gt; — not yet certified (FireHydrant has SOC 2)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted requires infrastructure&lt;/strong&gt; — you maintain the Docker/K8s deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;"We built Aurora for one job — investigating why incidents happen. We deliberately didn't build on-call or status pages because tools like PagerDuty and FireHydrant already do those well. Aurora is the investigation layer that plugs into your existing stack." — Noah Casarotto-Dinning, CEO at Arvo AI&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Getting Started with Aurora
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init
make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure your monitoring webhooks, add cloud credentials, and investigations start automatically. See the &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;full documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Learn more at &lt;a href="https://www.arvoai.ca" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;. For other comparisons, see &lt;a href="https://dev.to/blog/aurora-vs-traditional-incident-management-tools"&gt;Aurora vs Traditional Tools&lt;/a&gt;, &lt;a href="https://dev.to/blog/pagerduty-alternative-root-cause-analysis"&gt;PagerDuty Alternative&lt;/a&gt;, and &lt;a href="https://dev.to/blog/rootly-alternative-open-source-incident-management"&gt;Rootly Alternative&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;All claims sourced from official websites.&lt;/strong&gt; FireHydrant data from &lt;a href="https://firehydrant.com" rel="noopener noreferrer"&gt;firehydrant.com&lt;/a&gt;. Aurora data from &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;. Last verified: April 2026.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>open</category>
      <category>opensource</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Resolve.ai Alternative: Open Source AI for Incident Investigation</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Thu, 02 Apr 2026 21:44:19 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/resolveai-alternative-open-source-ai-for-incident-investigation-347k</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/resolveai-alternative-open-source-ai-for-incident-investigation-347k</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaway:&lt;/strong&gt; Resolve.ai is a $1B-valued AI SRE platform used by Coinbase, DoorDash, and Salesforce — but pricing requires contacting sales with no public pricing page. Aurora is an open source (Apache 2.0) alternative that delivers autonomous AI investigation with sandboxed cloud execution, infrastructure graphs, and knowledge base search — completely free and self-hosted.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What is Resolve.ai?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://resolve.ai" rel="noopener noreferrer"&gt;Resolve.ai&lt;/a&gt; is an AI-powered autonomous SRE platform founded in 2024 by Spiros Xanthos (former SVP at Splunk, co-creator of &lt;a href="https://opentelemetry.io/" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt;) and Mayank Agarwal. It raised &lt;a href="https://resolve.ai" rel="noopener noreferrer"&gt;$125M in Series A&lt;/a&gt; at a &lt;a href="https://techcrunch.com" rel="noopener noreferrer"&gt;reported $1 billion valuation&lt;/a&gt;, backed by Lightspeed and Greylock with angels including Fei-Fei Li and Jeff Dean.&lt;/p&gt;

&lt;p&gt;Resolve.ai positions as "machines on call for humans" — a multi-agent AI system that autonomously investigates production incidents across code, infrastructure, and telemetry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notable customers:&lt;/strong&gt; Coinbase (73% faster time to root cause), DoorDash (87% faster investigations), Salesforce, MongoDB, Zscaler, Toast, Pinecone.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Aurora?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is an open source (Apache 2.0) AI agent for automated incident investigation and root cause analysis. When an alert fires, Aurora's LangGraph-orchestrated agents autonomously query your infrastructure across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes — correlating data from 25+ tools and delivering a structured RCA with remediation recommendations.&lt;/p&gt;

&lt;p&gt;Aurora is free, self-hosted, and works with any LLM provider including local models via Ollama.&lt;/p&gt;




&lt;h2&gt;
  
  
  How They Compare
&lt;/h2&gt;

&lt;h3&gt;
  
  
  AI Investigation Approach
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Resolve.ai:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-agent architecture with parallel hypothesis testing&lt;/li&gt;
&lt;li&gt;Formulates multiple theories per incident, deploys sub-agents to investigate each simultaneously&lt;/li&gt;
&lt;li&gt;Correlates alerts across services and dependencies&lt;/li&gt;
&lt;li&gt;Constructs causal timelines linking code changes, infra events, and telemetry&lt;/li&gt;
&lt;li&gt;Generates root cause analysis with confidence scores&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://resolve.ai" rel="noopener noreferrer"&gt;Human-in-the-loop&lt;/a&gt; approval gates before automated actions&lt;/li&gt;
&lt;li&gt;Per-customer fine-tuned models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-agent architecture via LangGraph with dynamic tool selection (30+ tools)&lt;/li&gt;
&lt;li&gt;Correlates alerts across services and dependencies (AlertCorrelator + Memgraph graph)&lt;/li&gt;
&lt;li&gt;Constructs investigation timelines linking deployments, infra events, and telemetry&lt;/li&gt;
&lt;li&gt;Generates structured RCA with evidence citations and remediation steps&lt;/li&gt;
&lt;li&gt;Human-in-the-loop for write/destructive actions — read-only commands run automatically&lt;/li&gt;
&lt;li&gt;Executes &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, &lt;code&gt;gcloud&lt;/code&gt; commands in &lt;strong&gt;sandboxed Kubernetes pods&lt;/strong&gt; (non-root, read-only filesystem, capabilities dropped, seccomp enforced)&lt;/li&gt;
&lt;li&gt;Queries cloud APIs directly — AWS (STS AssumeRole), Azure (Service Principal), GCP (OAuth), OVH, Scaleway&lt;/li&gt;
&lt;li&gt;Traverses Memgraph infrastructure dependency graph for blast radius&lt;/li&gt;
&lt;li&gt;Searches Weaviate knowledge base (vector search over runbooks and past incidents)&lt;/li&gt;
&lt;li&gt;Works with any LLM provider — choose your own model&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cloud &amp;amp; Infrastructure
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Resolve.ai:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://resolve.ai/integrations" rel="noopener noreferrer"&gt;AWS and GCP confirmed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Azure is not listed on their integrations page&lt;/li&gt;
&lt;li&gt;Kubernetes support confirmed&lt;/li&gt;
&lt;li&gt;Deploys an on-premise "satellite" agent as a secure gateway — core platform runs in Resolve's cloud&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS, Azure, GCP, OVH, Scaleway — all five with native authentication&lt;/li&gt;
&lt;li&gt;Deep Kubernetes integration via outbound WebSocket kubectl-agent&lt;/li&gt;
&lt;li&gt;Fully self-hosted — Docker Compose or Helm chart&lt;/li&gt;
&lt;li&gt;No data leaves your environment&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Integrations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Resolve.ai&lt;/strong&gt; (&lt;a href="https://resolve.ai/integrations" rel="noopener noreferrer"&gt;resolve.ai/integrations&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitoring: Grafana, Datadog, Splunk, Prometheus, Dynatrace, Elastic, Chronosphere, Kloudfuse, OpenSearch&lt;/li&gt;
&lt;li&gt;Infrastructure: Kubernetes, AWS, GCP&lt;/li&gt;
&lt;li&gt;Code: GitHub&lt;/li&gt;
&lt;li&gt;Chat: Slack&lt;/li&gt;
&lt;li&gt;Knowledge: Notion&lt;/li&gt;
&lt;li&gt;Custom: MCP, APIs, Webhooks&lt;/li&gt;
&lt;li&gt;Total: ~17+ confirmed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora&lt;/strong&gt; (&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;github.com/Arvo-AI/aurora&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitoring: PagerDuty, Datadog, Grafana, New Relic, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, Splunk&lt;/li&gt;
&lt;li&gt;Cloud: AWS, Azure, GCP, OVH, Scaleway&lt;/li&gt;
&lt;li&gt;Infrastructure: Kubernetes, Terraform, Docker&lt;/li&gt;
&lt;li&gt;CI/CD: GitHub, Bitbucket, Jenkins, CloudBees, Spinnaker&lt;/li&gt;
&lt;li&gt;Docs: Confluence, Jira, SharePoint&lt;/li&gt;
&lt;li&gt;Network: Cloudflare, Tailscale&lt;/li&gt;
&lt;li&gt;Communication: Slack&lt;/li&gt;
&lt;li&gt;Total: 25+ confirmed&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Knowledge &amp;amp; Learning
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Resolve.ai:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learns from runbooks, wikis, chats, and historical incidents&lt;/li&gt;
&lt;li&gt;Builds a knowledge graph of infrastructure components&lt;/li&gt;
&lt;li&gt;Captures tribal knowledge from production systems&lt;/li&gt;
&lt;li&gt;Per-customer fine-tuned models that improve from feedback (thumbs up/down)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Built-in Weaviate vector store for semantic search over runbooks, postmortems, and documentation&lt;/li&gt;
&lt;li&gt;Memgraph infrastructure dependency graph maps relationships across all cloud providers&lt;/li&gt;
&lt;li&gt;Learns from past investigations stored in the knowledge base&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Code Fixes &amp;amp; Remediation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Resolve.ai:&lt;/strong&gt; Generates remediation PRs via GitHub with supporting context. Suggests kubectl commands and scripts. All actions require human approval before execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aurora:&lt;/strong&gt; Suggests code fixes with diff preview — human reviews and creates PR with one click via GitHub and Bitbucket. Executes read-only CLI commands in sandboxed pods. Generates postmortems exportable to Confluence and Jira.&lt;/p&gt;




&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolve.ai has, Aurora doesn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic JIRA ticket updates during investigation&lt;/li&gt;
&lt;li&gt;Enterprise support with SLAs&lt;/li&gt;
&lt;li&gt;Available on AWS Marketplace&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora has, Resolve.ai doesn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Azure, OVH, and Scaleway cloud support&lt;/li&gt;
&lt;li&gt;Open source (Apache 2.0) — full codebase auditable&lt;/li&gt;
&lt;li&gt;Self-hosted deployment (Docker Compose, Helm)&lt;/li&gt;
&lt;li&gt;LLM provider flexibility (OpenAI, Anthropic, Google, Ollama for air-gapped)&lt;/li&gt;
&lt;li&gt;Slack incident channel creation and management&lt;/li&gt;
&lt;li&gt;PagerDuty, New Relic, BigPanda, ThousandEyes, Coroot integrations&lt;/li&gt;
&lt;li&gt;Terraform/IaC state analysis&lt;/li&gt;
&lt;li&gt;Bitbucket, Jenkins, CloudBees, Spinnaker integrations&lt;/li&gt;
&lt;li&gt;Confluence and SharePoint integration&lt;/li&gt;
&lt;li&gt;Network integrations (Cloudflare, Tailscale)&lt;/li&gt;
&lt;li&gt;Free — no licensing costs whatsoever&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Both have:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Autonomous AI incident investigation&lt;/li&gt;
&lt;li&gt;Multi-agent architecture&lt;/li&gt;
&lt;li&gt;Root cause analysis with evidence&lt;/li&gt;
&lt;li&gt;AI-suggested code fixes (human-approved PRs)&lt;/li&gt;
&lt;li&gt;Infrastructure dependency/knowledge graph&lt;/li&gt;
&lt;li&gt;Knowledge base search (runbooks, wikis, past incidents)&lt;/li&gt;
&lt;li&gt;Kubernetes investigation&lt;/li&gt;
&lt;li&gt;AWS and GCP support&lt;/li&gt;
&lt;li&gt;Datadog, Grafana, Splunk, Dynatrace integrations&lt;/li&gt;
&lt;li&gt;Slack integration&lt;/li&gt;
&lt;li&gt;RBAC and security controls&lt;/li&gt;
&lt;li&gt;AI that learns from user feedback&lt;/li&gt;
&lt;li&gt;Causal timeline construction with dependency chain mapping&lt;/li&gt;
&lt;li&gt;Human-in-the-loop for destructive actions&lt;/li&gt;
&lt;li&gt;Per-customer tuning (Resolve.ai via fine-tuned models; Aurora via open source customization)&lt;/li&gt;
&lt;li&gt;SOC 2 Type II compliance (Resolve.ai: certified; Aurora: in progress)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolve.ai:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No public pricing page&lt;/li&gt;
&lt;li&gt;Custom enterprise pricing (contact sales)&lt;/li&gt;
&lt;li&gt;No free tier or self-service signup&lt;/li&gt;
&lt;li&gt;Target: large enterprise SRE teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Free&lt;/strong&gt; — Apache 2.0, self-hosted&lt;/li&gt;
&lt;li&gt;Costs: infrastructure (VM or K8s cluster) + LLM API usage&lt;/li&gt;
&lt;li&gt;$0 LLM cost with Ollama local models&lt;/li&gt;
&lt;li&gt;No contracts, no sales calls, no per-user pricing&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;The price difference is the core story. Resolve.ai delivers enterprise AI investigation for enterprise budgets. Aurora delivers open source AI investigation for everyone else.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Open Source vs Enterprise SaaS
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resolve.ai&lt;/strong&gt; is a closed-source, cloud-hosted enterprise platform. You cannot audit the AI's reasoning, choose your own LLM, or self-host. Your incident data flows through Resolve's infrastructure (they state they don't persist raw data or train across customers).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aurora&lt;/strong&gt; is fully open source. You can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read every line of code the AI uses to investigate your infrastructure&lt;/li&gt;
&lt;li&gt;Self-host with zero data leaving your environment&lt;/li&gt;
&lt;li&gt;Use any LLM provider — or run local models for fully air-gapped operation&lt;/li&gt;
&lt;li&gt;Modify investigation workflows, add custom tools, fork for your needs&lt;/li&gt;
&lt;li&gt;Contribute back to the project&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When to Choose Resolve.ai
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You're a large enterprise company&lt;/strong&gt; with budget for enterprise AI tooling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed fine-tuned models&lt;/strong&gt; — you want the vendor to handle per-customer model training rather than customizing open source yourself&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need certified compliance today&lt;/strong&gt; — SOC 2 Type II, HIPAA, GDPR already certified (Aurora's SOC 2 is in progress)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed service preferred&lt;/strong&gt; — you don't want to maintain AI infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Choose Aurora
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Budget matters&lt;/strong&gt; — you can't justify custom enterprise pricing for AI investigation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source is required&lt;/strong&gt; — you need full transparency into how AI investigates your production systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted is required&lt;/strong&gt; — compliance, data sovereignty, or air-gapped environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cloud breadth&lt;/strong&gt; — you need Azure, OVH, or Scaleway alongside AWS and GCP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM flexibility&lt;/strong&gt; — you want to choose your own provider or run models locally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You're a startup or mid-market&lt;/strong&gt; — Resolve.ai has no mid-market pricing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want a custom integration&lt;/strong&gt; — the Arvo AI team actively builds custom integrations for companies at no cost. If there's a feature gap, &lt;a href="https://cal.com/arvo-ai" rel="noopener noreferrer"&gt;reach out&lt;/a&gt; and they'll build it with you.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Getting Started with Aurora
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init
make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure your monitoring webhooks (PagerDuty, Datadog, Grafana), add cloud provider credentials, and investigations start automatically. See the &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;full documentation&lt;/a&gt; for deployment guides.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/aurora-vs-traditional-incident-management-tools" rel="noopener noreferrer"&gt;Aurora vs Traditional Incident Management Tools&lt;/a&gt; — Comparison with Rootly, FireHydrant, incident.io&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/pagerduty-alternative-root-cause-analysis" rel="noopener noreferrer"&gt;PagerDuty Alternative for Root Cause Analysis&lt;/a&gt; — PagerDuty vs Aurora deep dive&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/rootly-alternative-open-source-incident-management" rel="noopener noreferrer"&gt;Rootly Alternative: Open Source AI Incident Management&lt;/a&gt; — Rootly vs Aurora&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;Aurora Documentation&lt;/a&gt; — Full setup and configuration guides&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/resolve-ai-alternative-open-source" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;&lt;/em&gt; by team arvoai.ca&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Rootly Alternative: Open Source AI Incident Management</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Thu, 02 Apr 2026 21:28:21 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/rootly-alternative-open-source-ai-incident-management-4o89</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/rootly-alternative-open-source-ai-incident-management-4o89</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaway:&lt;/strong&gt; Rootly is an AI-native incident management platform with on-call, workflows, and AI SRE agents — starting at $20/user/month with AI SRE priced separately. Aurora is an open source (Apache 2.0) AI agent focused purely on autonomous incident investigation and root cause analysis. Rootly orchestrates your entire incident lifecycle. Aurora automates the hardest part — figuring out &lt;em&gt;why&lt;/em&gt; something broke.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What is Rootly?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://rootly.com" rel="noopener noreferrer"&gt;Rootly&lt;/a&gt; describes itself as an "AI-native incident management platform" — an all-in-one tool for detecting, managing, learning from, and resolving incidents. Founded in 2021, it's used by teams at Replit, NVIDIA, LinkedIn, Figma, and &lt;a href="https://rootly.com/customers" rel="noopener noreferrer"&gt;hundreds more&lt;/a&gt;, with a &lt;a href="https://www.g2.com/products/rootly/reviews" rel="noopener noreferrer"&gt;4.8/5 rating on G2&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Rootly offers three products:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Incident Response&lt;/strong&gt; — Slack/Teams-native workflows, playbooks, roles, status pages, retrospectives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On-Call&lt;/strong&gt; — Schedules, escalation policies, alert routing, live call routing, mobile app&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI SRE&lt;/strong&gt; — Autonomous AI agents for root cause analysis, remediation, and alert triage&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What is Aurora?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is an open source AI agent that automates incident investigation. When a monitoring tool fires an alert, Aurora's LangGraph-orchestrated agents autonomously query your infrastructure across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes — correlating data from 25+ tools and delivering a structured root cause analysis with remediation recommendations.&lt;/p&gt;

&lt;p&gt;Aurora doesn't manage your incident lifecycle. It investigates the root cause.&lt;/p&gt;




&lt;h2&gt;
  
  
  How They Compare
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Incident Response &amp;amp; Coordination
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Rootly&lt;/strong&gt; is a full incident lifecycle platform:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slack and Microsoft Teams native incident channels&lt;/li&gt;
&lt;li&gt;Automated workflows (create channels, page responders, update status)&lt;/li&gt;
&lt;li&gt;Incident roles (commander, communications lead, etc.)&lt;/li&gt;
&lt;li&gt;Playbooks and runbooks&lt;/li&gt;
&lt;li&gt;Status pages (internal and external)&lt;/li&gt;
&lt;li&gt;Action item tracking with Jira sync&lt;/li&gt;
&lt;li&gt;DORA metrics and advanced analytics&lt;/li&gt;
&lt;li&gt;Mobile app (iOS and Android)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora&lt;/strong&gt; is not a full incident coordination platform — no roles or status pages. However, Aurora does create and manage Slack incident channels, tracks action items with Jira sync, sends investigation notifications, and supports &lt;a class="mentioned-user" href="https://dev.to/aurora"&gt;@aurora&lt;/a&gt; mentions in any channel for conversational investigation.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-Call Management
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Rootly&lt;/strong&gt; has a full on-call product:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schedules with shadow rotations, holiday calendars, PTO overrides&lt;/li&gt;
&lt;li&gt;Escalation policies with gap detection&lt;/li&gt;
&lt;li&gt;SMS, voice, push notifications (bypass Do Not Disturb)&lt;/li&gt;
&lt;li&gt;Live call routing&lt;/li&gt;
&lt;li&gt;On-call pay calculator&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rootly.com" rel="noopener noreferrer"&gt;99.99% uptime claim&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora&lt;/strong&gt; has no on-call capabilities. No schedules, no paging, no escalation. For on-call, use Rootly, PagerDuty, Grafana OnCall, or Opsgenie alongside Aurora.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI Investigation
&lt;/h3&gt;

&lt;p&gt;This is where the tools diverge most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rootly AI SRE&lt;/strong&gt; (&lt;a href="https://rootly.com/ai-sre" rel="noopener noreferrer"&gt;rootly.com/ai-sre&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Correlates alerts with code changes, deploys, and config changes&lt;/li&gt;
&lt;li&gt;Generates root cause analysis with confidence scores&lt;/li&gt;
&lt;li&gt;Surfaces similar past incidents and proven solutions&lt;/li&gt;
&lt;li&gt;Drafts remediation steps and PRs with suggested fixes&lt;/li&gt;
&lt;li&gt;AI Meeting Bot that transcribes incident bridges in real time&lt;/li&gt;
&lt;li&gt;
&lt;a class="mentioned-user" href="https://dev.to/rootly"&gt;@rootly&lt;/a&gt; AI chat in Slack/Teams for summaries and task assignment&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://rootly.com/blog/rootly-mcp-goes-ga-up-to-95-less-tokens" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt; for IDEs (Cursor, Windsurf, Claude Code)&lt;/li&gt;
&lt;li&gt;Chain-of-thought visibility ("see &lt;em&gt;why&lt;/em&gt; a root cause is flagged")&lt;/li&gt;
&lt;li&gt;Whether it directly queries cloud infrastructure APIs is unverified&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora AI Investigation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Autonomous multi-step investigation using LangGraph-orchestrated agents&lt;/li&gt;
&lt;li&gt;Dynamically selects from 30+ tools per investigation&lt;/li&gt;
&lt;li&gt;Executes &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, &lt;code&gt;gcloud&lt;/code&gt; commands in &lt;strong&gt;sandboxed Kubernetes pods&lt;/strong&gt; (non-root, read-only filesystem, capabilities dropped, seccomp enforced)&lt;/li&gt;
&lt;li&gt;Queries cloud APIs directly — AWS (STS AssumeRole), Azure (Service Principal), GCP (OAuth), OVH, Scaleway&lt;/li&gt;
&lt;li&gt;Traverses Memgraph infrastructure dependency graph for blast radius&lt;/li&gt;
&lt;li&gt;Searches Weaviate knowledge base (vector search over runbooks, past postmortems)&lt;/li&gt;
&lt;li&gt;Generates structured RCA with timeline, evidence citations, and remediation&lt;/li&gt;
&lt;li&gt;Generates code fix pull requests via GitHub and Bitbucket&lt;/li&gt;
&lt;li&gt;Exports postmortems to Confluence and Jira&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Knowledge Base
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Rootly:&lt;/strong&gt; Surfaces similar past incidents during investigations. Integrates with &lt;a href="https://rootly.com/integrations" rel="noopener noreferrer"&gt;Glean&lt;/a&gt; for broader knowledge search. No native vector search product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aurora:&lt;/strong&gt; Built-in Weaviate-powered vector store. Upload runbooks, past postmortems, and documentation — the AI agent searches them using semantic similarity during every investigation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Postmortems
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Rootly:&lt;/strong&gt; AI-generated retrospectives with context, timelines, and custom templates. Collaborative editing. Jira sync for action items.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aurora:&lt;/strong&gt; AI-generated postmortems with timeline, root cause, impact assessment, and remediation steps. One-click export to Confluence and Jira.&lt;/p&gt;




&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Rootly has, Aurora doesn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call scheduling, escalation policies, paging (SMS/voice/push)&lt;/li&gt;
&lt;li&gt;Microsoft Teams support (Aurora is Slack-only)&lt;/li&gt;
&lt;li&gt;Automated incident workflows (create channels, page responders, update status)&lt;/li&gt;
&lt;li&gt;Status pages (internal and external)&lt;/li&gt;
&lt;li&gt;Incident roles&lt;/li&gt;
&lt;li&gt;DORA metrics and analytics&lt;/li&gt;
&lt;li&gt;Mobile app (iOS, Android)&lt;/li&gt;
&lt;li&gt;MCP server for IDEs&lt;/li&gt;
&lt;li&gt;AI Meeting Bot for incident bridges&lt;/li&gt;
&lt;li&gt;SOC 2 Type II, HIPAA, GDPR, CCPA compliance&lt;/li&gt;
&lt;li&gt;70+ integrations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora has, Rootly doesn't (or is unverified):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct cloud infrastructure querying (AWS, Azure, GCP, OVH, Scaleway APIs)&lt;/li&gt;
&lt;li&gt;CLI command execution in sandboxed Kubernetes pods&lt;/li&gt;
&lt;li&gt;Native vector search knowledge base (Weaviate RAG)&lt;/li&gt;
&lt;li&gt;Infrastructure dependency graph (Memgraph)&lt;/li&gt;
&lt;li&gt;Terraform/IaC state analysis&lt;/li&gt;
&lt;li&gt;Open source (Apache 2.0) — full codebase auditable&lt;/li&gt;
&lt;li&gt;Self-hosted deployment (Docker Compose, Helm)&lt;/li&gt;
&lt;li&gt;LLM provider flexibility including local models (Ollama for air-gapped)&lt;/li&gt;
&lt;li&gt;Free — no per-user or per-incident pricing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Both have:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI-powered root cause analysis&lt;/li&gt;
&lt;li&gt;Code fix PR generation&lt;/li&gt;
&lt;li&gt;Automated postmortem generation&lt;/li&gt;
&lt;li&gt;PagerDuty, Datadog, Grafana integrations&lt;/li&gt;
&lt;li&gt;GitHub integration&lt;/li&gt;
&lt;li&gt;Confluence integration&lt;/li&gt;
&lt;li&gt;HashiCorp Vault integration&lt;/li&gt;
&lt;li&gt;BYOK for LLM providers&lt;/li&gt;
&lt;li&gt;Slack incident channels&lt;/li&gt;
&lt;li&gt;Action item tracking with Jira sync&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Rootly&lt;/strong&gt; (&lt;a href="https://rootly.com/pricing" rel="noopener noreferrer"&gt;rootly.com/pricing&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incident Response Essentials: &lt;strong&gt;$20/user/month&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;On-Call Essentials: &lt;strong&gt;$20/user/month&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;AI SRE: &lt;strong&gt;Contact sales&lt;/strong&gt; (no published price)&lt;/li&gt;
&lt;li&gt;Enterprise tiers: Contact sales&lt;/li&gt;
&lt;li&gt;Bundle discounts available for IR + On-Call + AI SRE&lt;/li&gt;
&lt;li&gt;Startup discount: up to 50% off (&amp;lt;100 employees, &amp;lt;$50M raised)&lt;/li&gt;
&lt;li&gt;Free 14-day trial&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Aurora:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Free&lt;/strong&gt; — Apache 2.0, self-hosted&lt;/li&gt;
&lt;li&gt;Costs: infrastructure (VM or K8s cluster) + LLM API usage&lt;/li&gt;
&lt;li&gt;$0 LLM cost possible with Ollama local models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example: 20-person SRE team&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For Rootly IR + On-Call: $20 + $20 = $40/user/month x 20 = &lt;strong&gt;$800/month&lt;/strong&gt; (before AI SRE add-on, which is priced separately via sales).&lt;/p&gt;

&lt;p&gt;For Aurora: &lt;strong&gt;$0&lt;/strong&gt; + infrastructure + LLM API.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: Rootly pricing from &lt;a href="https://rootly.com/pricing" rel="noopener noreferrer"&gt;rootly.com/pricing&lt;/a&gt;. AI SRE pricing is not publicly listed.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Open Source vs SaaS
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Rootly&lt;/strong&gt; is SaaS-only. The core platform is proprietary. They have &lt;a href="https://github.com/rootlyhq" rel="noopener noreferrer"&gt;open source tooling on GitHub&lt;/a&gt; (Terraform provider with 400,000+ downloads, Backstage plugin, CLI, SDKs) but not the platform itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aurora&lt;/strong&gt; is fully open source under Apache 2.0. The entire codebase — backend, frontend, agent orchestration — is on &lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. You can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Audit exactly what the AI does on your infrastructure&lt;/li&gt;
&lt;li&gt;Modify investigation workflows and add custom tools&lt;/li&gt;
&lt;li&gt;Fork and customize for your organization&lt;/li&gt;
&lt;li&gt;Run fully air-gapped with local LLMs via Ollama&lt;/li&gt;
&lt;li&gt;Keep all incident data in your own environment&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When to Choose Rootly
&lt;/h2&gt;

&lt;p&gt;Rootly is the better choice when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You need a full incident lifecycle platform&lt;/strong&gt; — on-call, workflows, status pages, roles, retrospectives, DORA metrics in one tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slack/Teams-native workflows matter&lt;/strong&gt; — Rootly's incident channels and AI chat are deeply embedded in collaboration tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance requirements&lt;/strong&gt; — SOC 2 Type II, HIPAA, GDPR out of the box&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want managed SaaS&lt;/strong&gt; — no infrastructure to maintain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need a mobile app&lt;/strong&gt; — iOS and Android for on-call&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise support&lt;/strong&gt; — dedicated support, SLAs, BAA for HIPAA&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Choose Aurora
&lt;/h2&gt;

&lt;p&gt;Aurora is the better choice when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Investigation is your bottleneck&lt;/strong&gt; — your team spends hours diagnosing incidents manually&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need deep cloud investigation&lt;/strong&gt; — AI agents that directly query AWS, Azure, GCP, and Kubernetes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want open source&lt;/strong&gt; — full transparency into how AI investigates your infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted is required&lt;/strong&gt; — compliance, data sovereignty, or air-gapped environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget is limited&lt;/strong&gt; — free forever, no per-user pricing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM flexibility matters&lt;/strong&gt; — bring any provider, including local models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You already have on-call&lt;/strong&gt; — PagerDuty, Grafana OnCall, or Opsgenie handles paging; you need the investigation layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want a custom integration&lt;/strong&gt; — Aurora is open source and the Arvo AI team actively builds custom integrations for companies that need them — at no cost. If there's a feature gap, &lt;a href="https://cal.com/arvo-ai" rel="noopener noreferrer"&gt;reach out&lt;/a&gt; and they'll build it with you.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Using Rootly + Aurora Together
&lt;/h2&gt;

&lt;p&gt;They're not mutually exclusive. Rootly manages your incident lifecycle; Aurora investigates the root cause:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Alert fires&lt;/strong&gt; → Rootly creates incident channel, pages on-call&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same alert&lt;/strong&gt; → Aurora receives webhook, starts AI investigation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rootly&lt;/strong&gt; coordinates the response (roles, comms, status page)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora&lt;/strong&gt; investigates in the background (queries cloud, checks K8s, searches knowledge base)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On-call SRE&lt;/strong&gt; finds Aurora's completed RCA with root cause and remediation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora&lt;/strong&gt; generates postmortem → exports to Confluence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rootly&lt;/strong&gt; tracks action items → syncs to Jira&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Getting Started with Aurora
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init
make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure your monitoring webhooks (PagerDuty, Datadog, Grafana), add cloud provider credentials, and investigations start automatically. See the &lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;full documentation&lt;/a&gt; for deployment guides.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/aurora-vs-traditional-incident-management-tools" rel="noopener noreferrer"&gt;Aurora vs Traditional Incident Management Tools&lt;/a&gt; — Comparison with Rootly, FireHydrant, incident.io&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/pagerduty-alternative-root-cause-analysis" rel="noopener noreferrer"&gt;PagerDuty Alternative for Root Cause Analysis&lt;/a&gt; — PagerDuty vs Aurora deep dive&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/open-source-incident-management" rel="noopener noreferrer"&gt;Open Source Incident Management: Why It Matters&lt;/a&gt; — The case for self-hosted tools&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;Aurora Documentation&lt;/a&gt; — Full setup and configuration guides&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/rootly-alternative-open-source-incident-management" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;&lt;/em&gt; by team arvoai.ca &lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>opensource</category>
    </item>
    <item>
      <title>PagerDuty Alternative for Root Cause Analysis: Why SRE Teams Are Adding AI Investigation</title>
      <dc:creator>Siddharth Singh</dc:creator>
      <pubDate>Wed, 01 Apr 2026 21:36:15 +0000</pubDate>
      <link>https://forem.com/siddharth_singh_409bd5267/pagerduty-alternative-for-root-cause-analysis-why-sre-teams-are-adding-ai-investigation-3np2</link>
      <guid>https://forem.com/siddharth_singh_409bd5267/pagerduty-alternative-for-root-cause-analysis-why-sre-teams-are-adding-ai-investigation-3np2</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaway:&lt;/strong&gt; PagerDuty is the industry standard for alerting and on-call management — but it doesn't investigate &lt;em&gt;why&lt;/em&gt; incidents happen. Aurora is an open source AI agent that plugs into PagerDuty via webhooks and autonomously investigates root causes across AWS, Azure, GCP, and Kubernetes. They're complementary tools, but for teams spending hours on manual RCA, Aurora fills the gap PagerDuty doesn't cover.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;PagerDuty has over &lt;a href="https://www.pagerduty.com" rel="noopener noreferrer"&gt;30,000 customers&lt;/a&gt; and dominates on-call management. It's excellent at what it does: detecting alerts, routing them to the right person, coordinating incident response, and tracking SLAs.&lt;/p&gt;

&lt;p&gt;But here's the problem: &lt;strong&gt;PagerDuty pages you. Then you're on your own.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The actual investigation — SSHing into servers, querying CloudWatch, checking Kubernetes pod logs, correlating deployments with error spikes — is still manual. According to the &lt;a href="https://www.thevoid.community/" rel="noopener noreferrer"&gt;VOID (Verica Open Incident Database)&lt;/a&gt;, the median incident involves 3.5 contributing factors, and the investigation phase consumes the majority of mean time to resolve (MTTR).&lt;/p&gt;

&lt;p&gt;This is the gap Aurora fills.&lt;/p&gt;




&lt;h2&gt;
  
  
  PagerDuty vs Aurora: Different Tools, Different Jobs
&lt;/h2&gt;

&lt;p&gt;This isn't a "which is better" comparison. PagerDuty and Aurora solve different problems:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;PagerDuty&lt;/th&gt;
&lt;th&gt;Aurora&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary job&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Alert routing, on-call, coordination&lt;/td&gt;
&lt;td&gt;Root cause investigation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Answers the question&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Who needs to know and how do we coordinate?"&lt;/td&gt;
&lt;td&gt;"Why did this happen and what should we fix?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trigger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Monitoring tool fires alert&lt;/td&gt;
&lt;td&gt;PagerDuty webhook (or Datadog, Grafana, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Engineer gets paged, war room opens&lt;/td&gt;
&lt;td&gt;Structured RCA with timeline, root cause, remediation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;They work together.&lt;/strong&gt; Aurora ingests PagerDuty &lt;code&gt;incident.triggered&lt;/code&gt; webhooks. When PagerDuty pages your SRE, Aurora is already investigating in the background.&lt;/p&gt;




&lt;h2&gt;
  
  
  What PagerDuty Does Well
&lt;/h2&gt;

&lt;p&gt;PagerDuty's strengths are real and well-established:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;On-call scheduling&lt;/strong&gt; — Flexible rotations, escalation policies, shift overrides&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert routing&lt;/strong&gt; — &lt;a href="https://www.pagerduty.com/integrations/" rel="noopener noreferrer"&gt;700+ integrations&lt;/a&gt; for ingesting alerts from every monitoring tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-channel paging&lt;/strong&gt; — SMS, phone, push notifications, email&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident coordination&lt;/strong&gt; — War rooms, stakeholder communications, status pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLA tracking&lt;/strong&gt; — Urgency-based alerting and escalation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI noise reduction&lt;/strong&gt; — &lt;a href="https://www.pagerduty.com/platform/aiops/" rel="noopener noreferrer"&gt;AIOps add-on&lt;/a&gt; claims 91% alert noise reduction via intelligent correlation and deduplication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PagerDuty has also added AI features through &lt;a href="https://www.pagerduty.com/platform/aiops/" rel="noopener noreferrer"&gt;PagerDuty Advance&lt;/a&gt;, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI incident summaries ("catch me up" in Slack)&lt;/li&gt;
&lt;li&gt;AI-generated status updates&lt;/li&gt;
&lt;li&gt;AI postmortem drafts (Beta)&lt;/li&gt;
&lt;li&gt;SRE Agent for triage and approved remediation actions&lt;/li&gt;
&lt;li&gt;Probable Origin for pattern-based root cause suggestions&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Where PagerDuty Stops
&lt;/h2&gt;

&lt;p&gt;Despite the AI additions, PagerDuty's investigation capabilities have limits:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No autonomous multi-step investigation.&lt;/strong&gt; PagerDuty's SRE Agent surfaces past incidents and patterns, but it doesn't autonomously query your AWS accounts, check Kubernetes pod status, correlate Terraform changes, or trace dependency graphs. The investigation itself is still on the engineer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No native cloud infrastructure querying.&lt;/strong&gt; PagerDuty receives alerts &lt;em&gt;from&lt;/em&gt; CloudWatch, Azure Monitor, etc. — it doesn't query them directly. It can't run &lt;code&gt;kubectl get pods&lt;/code&gt; or &lt;code&gt;aws cloudwatch get-metric-data&lt;/code&gt; on your behalf during an investigation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No knowledge base with vector search.&lt;/strong&gt; PagerDuty's RAG capability is partial — it requires configuring &lt;a href="https://www.pagerduty.com/integrations/" rel="noopener noreferrer"&gt;Amazon Q Business&lt;/a&gt; as an external integration. There's no native vector search over your runbooks and past postmortems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No code fix suggestions.&lt;/strong&gt; PagerDuty can surface recent code changes that may be related to an incident, but it doesn't generate remediation code or create pull requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI features are paid add-ons.&lt;/strong&gt; AIOps starts at &lt;a href="https://www.pagerduty.com/pricing/" rel="noopener noreferrer"&gt;$699/month&lt;/a&gt;. PagerDuty Advance starts at &lt;a href="https://www.pagerduty.com/pricing/" rel="noopener noreferrer"&gt;$415/month&lt;/a&gt;. These are on top of per-user pricing ($21-$41+/user/month depending on tier).&lt;/p&gt;




&lt;h2&gt;
  
  
  What Aurora Does Differently
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Arvo-AI/aurora" rel="noopener noreferrer"&gt;Aurora&lt;/a&gt; is an open source (Apache 2.0) AI agent that automates the investigation phase — the part that happens &lt;em&gt;after&lt;/em&gt; you get paged.&lt;/p&gt;

&lt;h3&gt;
  
  
  Autonomous Investigation
&lt;/h3&gt;

&lt;p&gt;When Aurora receives an alert webhook, its LangGraph-orchestrated AI agents:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Analyze the alert context (severity, service, timing)&lt;/li&gt;
&lt;li&gt;Dynamically select from 30+ tools to investigate&lt;/li&gt;
&lt;li&gt;Execute &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, &lt;code&gt;az&lt;/code&gt;, &lt;code&gt;gcloud&lt;/code&gt; commands in &lt;strong&gt;sandboxed Kubernetes pods&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Query logs, metrics, and recent deployments across cloud providers&lt;/li&gt;
&lt;li&gt;Search the knowledge base for relevant runbooks and past incidents&lt;/li&gt;
&lt;li&gt;Traverse the infrastructure dependency graph for blast radius&lt;/li&gt;
&lt;li&gt;Synthesize everything into a structured root cause analysis&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No human in the loop during investigation. The SRE gets paged by PagerDuty and finds a completed RCA waiting in Aurora.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Cloud Native
&lt;/h3&gt;

&lt;p&gt;Aurora connects directly to your cloud infrastructure:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Authentication&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;STS AssumeRole (temporary credentials)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Service Principal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OAuth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OVH&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scaleway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kubernetes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kubeconfig via outbound WebSocket agent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  25+ Verified Integrations
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monitoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PagerDuty, Datadog, Grafana, New Relic, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, Splunk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS, Azure, GCP, OVH, Scaleway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kubernetes, Terraform, Docker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI/CD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GitHub, Bitbucket, Jenkins, CloudBees, Spinnaker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Docs &amp;amp; Knowledge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Confluence, Jira, SharePoint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Network&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloudflare, Tailscale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Communication&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slack&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Knowledge Base with RAG
&lt;/h3&gt;

&lt;p&gt;Aurora includes a built-in Weaviate-powered vector store. Upload your runbooks, past postmortems, and documentation — the AI agent searches them during every investigation using semantic similarity, not just keyword matching.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI Code Fix Suggestions
&lt;/h3&gt;

&lt;p&gt;Aurora can generate pull requests with remediation code via its GitHub and Bitbucket integrations. It doesn't just tell you what's wrong — it suggests how to fix it with actual code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automated Postmortems
&lt;/h3&gt;

&lt;p&gt;Structured postmortem documents generated automatically with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incident timeline with timestamps&lt;/li&gt;
&lt;li&gt;Root cause identification with evidence and citations&lt;/li&gt;
&lt;li&gt;Impact assessment&lt;/li&gt;
&lt;li&gt;Remediation steps (taken and recommended)&lt;/li&gt;
&lt;li&gt;One-click export to Confluence or Jira&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;PagerDuty&lt;/th&gt;
&lt;th&gt;Aurora&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On-call scheduling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (core)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Alert routing &amp;amp; escalation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (core)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SMS/phone/push paging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (core)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Status pages&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (add-on, &lt;a href="https://www.pagerduty.com/pricing/" rel="noopener noreferrer"&gt;from $89/mo&lt;/a&gt;)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SLA/SLO tracking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Autonomous AI investigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Partial (SRE Agent for triage)&lt;/td&gt;
&lt;td&gt;Yes (full multi-step)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Native cloud querying&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No (receives alerts)&lt;/td&gt;
&lt;td&gt;Yes (AWS, Azure, GCP, OVH, Scaleway)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CLI execution on infra&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Via &lt;a href="https://www.pagerduty.com/platform/automation/" rel="noopener noreferrer"&gt;Runbook Automation add-on&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Yes (sandboxed K8s pods)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge base (RAG)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Via Amazon Q Business integration&lt;/td&gt;
&lt;td&gt;Yes (native Weaviate)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure graph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (Memgraph)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI postmortems&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Beta (via Jeli)&lt;/td&gt;
&lt;td&gt;Yes (with Confluence export)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI code fix PRs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (GitHub, Bitbucket)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No (Rundeck only)&lt;/td&gt;
&lt;td&gt;Yes (Apache 2.0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-hosted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No (SaaS only)&lt;/td&gt;
&lt;td&gt;Yes (Docker, Helm)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM provider choice&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No (undisclosed, fixed)&lt;/td&gt;
&lt;td&gt;Yes (OpenAI, Anthropic, Google, Ollama)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Integrations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.pagerduty.com/integrations/" rel="noopener noreferrer"&gt;700+&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;25+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://www.pagerduty.com/pricing/" rel="noopener noreferrer"&gt;From $21/user/mo&lt;/a&gt; + AI add-ons ($415-$699/mo)&lt;/td&gt;
&lt;td&gt;Free (self-hosted)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Cost Comparison
&lt;/h2&gt;

&lt;p&gt;For a team of 20 SREs on PagerDuty Business with AI features:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Line Item&lt;/th&gt;
&lt;th&gt;PagerDuty&lt;/th&gt;
&lt;th&gt;Aurora&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Base platform&lt;/td&gt;
&lt;td&gt;$41/user/mo x 20 = &lt;strong&gt;$820/mo&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AIOps&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$699/mo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Included&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PagerDuty Advance (GenAI)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$415/mo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Included&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Status pages&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$89/mo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not included&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$2,023/mo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0 + infra + LLM API&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Aurora's costs are infrastructure (a VM or K8s cluster) and LLM API usage. With Ollama running local models, the LLM cost is also $0.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: PagerDuty pricing verified from &lt;a href="https://www.pagerduty.com/pricing/" rel="noopener noreferrer"&gt;pagerduty.com/pricing&lt;/a&gt; as of March 2026. Aurora is free under Apache 2.0.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  When to Use PagerDuty + Aurora Together
&lt;/h2&gt;

&lt;p&gt;The strongest setup is running both:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;PagerDuty&lt;/strong&gt; receives alerts from your monitoring tools (Datadog, CloudWatch, Grafana)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PagerDuty&lt;/strong&gt; pages the right on-call engineer via SMS/phone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora&lt;/strong&gt; receives the same alert via PagerDuty webhook (&lt;code&gt;incident.triggered&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora's AI agents&lt;/strong&gt; investigate autonomously in the background&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The on-call SRE&lt;/strong&gt; opens Aurora and finds a completed RCA with root cause, timeline, and remediation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aurora&lt;/strong&gt; generates the postmortem and exports it to Confluence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;PagerDuty handles the &lt;em&gt;who&lt;/em&gt; and &lt;em&gt;when&lt;/em&gt;. Aurora handles the &lt;em&gt;why&lt;/em&gt; and &lt;em&gt;how to fix it&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Aurora Alone Might Be Enough
&lt;/h2&gt;

&lt;p&gt;For smaller teams or budget-conscious organizations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You don't need enterprise on-call&lt;/strong&gt; — Your team is small enough that a simple rotation works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You already have alerting&lt;/strong&gt; — Datadog, Grafana, or CloudWatch can send webhooks directly to Aurora&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation is your bottleneck&lt;/strong&gt; — You're spending more time diagnosing than coordinating&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need self-hosted&lt;/strong&gt; — Compliance or security requires keeping incident data on-premise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget is limited&lt;/strong&gt; — PagerDuty + AI add-ons at $2,000+/mo isn't feasible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Aurora can ingest webhooks directly from any monitoring tool — PagerDuty is not required.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Arvo-AI/aurora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;aurora
make init
make prod-prebuilt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure your PagerDuty webhook to point at Aurora, add your cloud provider credentials, and investigations start automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/aurora-vs-traditional-incident-management-tools" rel="noopener noreferrer"&gt;Aurora vs Traditional Incident Management Tools&lt;/a&gt; — Comparison with Rootly, FireHydrant, incident.io&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/root-cause-analysis-complete-guide-sres" rel="noopener noreferrer"&gt;Root Cause Analysis: The Complete Guide for SREs&lt;/a&gt; — RCA techniques from manual to AI-powered&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.arvoai.ca/blog/open-source-incident-management" rel="noopener noreferrer"&gt;Open Source Incident Management: Why It Matters&lt;/a&gt; — The case for self-hosted tools&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arvo-ai.github.io/aurora/" rel="noopener noreferrer"&gt;Aurora Documentation&lt;/a&gt; — Full setup and configuration guides&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.pagerduty.com/pricing/" rel="noopener noreferrer"&gt;PagerDuty Pricing&lt;/a&gt; — Official PagerDuty pricing page&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.pagerduty.com/platform/aiops/" rel="noopener noreferrer"&gt;PagerDuty AIOps&lt;/a&gt; — PagerDuty's AI features&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.arvoai.ca/blog/pagerduty-alternative-root-cause-analysis" rel="noopener noreferrer"&gt;arvoai.ca&lt;/a&gt;&lt;/em&gt; by team &lt;a href="https://www.arvoai.ca" rel="noopener noreferrer"&gt;https://www.arvoai.ca&lt;/a&gt; &lt;/p&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>opensource</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
