Forem: Siddharth Singh

Aurora Actions: User-Defined Background Automations for Incident Response

Siddharth Singh — Mon, 11 May 2026 17:49:20 +0000

Key Takeaways

Aurora Actions are reusable, natural-language automations that Aurora's agent executes in the background using all 22+ connected integrations. Available today on the main branch of Aurora.

Three trigger types out of the box: manual ("run now"), on incident completion (chain follow-up work after every RCA), and recurring schedule (Celery Beat–driven intervals).

Same agent, same tools, different prompt scaffolding. Actions reuse Aurora's existing LangGraph agent and 30+ tools (kubectl, aws, gcloud, az, Terraform, Confluence, Slack, GitHub) — they just run as background chat sessions with eager-loaded skills and no RCA mandate.

/action <name> is a first-class chat primitive. Slash-command autocomplete in the chat input, "Run Action" dropdown on completed incidents, and full RBAC-gated CRUD UI in Settings.

Aurora Actions turn the agent into a programmable platform. This is the building block for CI/CD auto-remediation, scheduled audits, and post-incident health checks — covered in our CI/CD Auto-Remediation guide.

We shipped one of the most-requested features in Aurora's history: Aurora Actions — user-defined background automations that run on Aurora's agent. An Aurora Action is a named, natural-language instruction the user writes once and then triggers manually, on incident completion, or on a recurring schedule; Aurora's agent executes it as a background task with full access to every connected integration. Where traditional incident management tools force you to pick from a fixed catalog of "automations" (close incident, post to Slack, run runbook), Actions are written in plain English and inherit the full reasoning capability of the agent.

This post is for SRE and platform teams already running Aurora — or evaluating it — who want to understand what Actions actually do, where they fit on the agentic spectrum, and how to use them safely.

What is an Aurora Action?

An Aurora Action has four parts:

A name — used as the slash-command handle (/action <name>) and as the dropdown label on incident cards.
A natural-language instruction — the prompt the agent will execute. The same instruction the user would type into chat, except it can reference incident context placeholders when triggered post-incident.
A trigger type — manual, on-incident-completion, or on-schedule (interval-based via Celery Beat).
An on/off toggle — actions can be disabled without deletion, with full RBAC for who can create, edit, or trigger them.

The implementation is a thin layer over Aurora's existing chat agent. When an Action triggers, the executor service creates a background chat session with the action's instruction as the user message, runs it through the same LangGraph workflow that powers interactive chat, and persists the run history. The agent has full tool access (kubectl, cloud CLIs, Terraform, Slack, GitHub, Confluence, Memgraph, Weaviate) and eager-loaded skills — the only differences from interactive chat are scaffolded prompts and the absence of any RCA mandate.

Why this matters

Most incident management automation today is workflow automation: PagerDuty fires, Slack channel is created, status page is updated, runbook link is posted. The "automation" is a directed graph of static actions. There is no reasoning, no investigation, no judgment. Tools like Rootly, FireHydrant, and incident.io are excellent at this — but they don't do anything an SRE wouldn't have to manually verify after the fact.

Aurora's bet has always been the opposite: automate the investigation itself. Aurora Actions extend that bet from one-shot incident investigations to recurring or post-incident workflows. A few concrete examples:

Noisy alert tuning — "Every Friday at 5pm, review which Datadog alerts fired more than 20 times this week with mean time-to-acknowledge over 10 minutes. Open a Terraform PR to widen the thresholds or move them to a warning channel."
Post-incident health check — "After every completed RCA, run a 15-minute observation on the affected service: check error rate, p99 latency, and pod restart count. Post results to #incident-followup."
Scheduled infrastructure audit — "Every Monday at 9am, audit IAM roles in the production AWS account that have not been used in 90 days. List candidates for removal in a Confluence page."

None of these are runbook automation. Each requires the agent to query infrastructure, reason about results, and produce a structured output. Each one was previously the job of an on-call engineer doing follow-up between pages.

Where Actions sit on the agentic capability spectrum

In our Open-Source AI SRE comparison, we proposed a four-level spectrum for AI SRE capability. Actions don't change the level — they change when the agent runs.

When the agent runs	Trigger	Pre-Actions example	With Actions
On alert	Webhook from PagerDuty / Datadog / Grafana	Aurora investigates the alert and produces an RCA	Same — investigation flow is unchanged
On user request	Engineer asks a question in chat	Aurora answers using tools	Same — plus `/action <name>` shortcuts
After every incident	Incident state transitions to "resolved"	Postmortem generated; engineer manually does follow-up checks	Action runs automatically with incident context in scope
On a schedule	Celery Beat cron	No equivalent — required external scheduler + custom code	Single source of truth: agent runs the prompt on cadence

The post-incident and scheduled triggers are the genuinely new capability. Before Actions, anything recurring or post-incident required gluing Aurora to an external scheduler, an external prompt store, and bespoke trigger code. Actions collapse all three into the product surface.

How Actions work under the hood

This is for the technically curious. A few architecturally interesting things from the implementation:

1. Background chat sessions, not a separate runtime. When an Action triggers, the executor service creates a regular chat session with the action's instruction as the seed message and dispatches it as a background Celery task. The agent doesn't know it's running an Action — it just runs the workflow. This means every capability the interactive agent has (tool calls, RAG, graph traversal, sub-agent orchestration) is available inside Actions for free.

2. Eager-loaded skills, no RCA mandate. Interactive chat lazy-loads skills based on the user message. Background actions eager-load all skills because there is no human to clarify ambiguity. The system prompt also strips the "your job is to find root cause" framing — Actions can do anything the agent can do, not just investigate.

3. RLS context is preserved. Aurora uses PostgreSQL row-level security for multi-tenancy. The executor explicitly sets RLS context (org_id, user_id) before running so background tasks see only their own org's data — even though they run under a service identity.

4. Stale run cleanup is integrated. Aurora's existing background-chat janitor already handles orphaned chat sessions from crashed pods. Action runs go through the same path, so a worker pod dying mid-action doesn't leave the run state inconsistent.

5. RBAC is enforced at the route layer. Action CRUD is gated by Aurora's Casbin-based RBAC. Org admins can restrict which roles can create or trigger actions — important because an Action with cloud-CLI access has real blast radius.

Trigger types in detail

Manual triggers

The simplest case. An admin creates the action, an engineer triggers it from the Actions page or via /action <name> in chat. Useful for codifying common operational tasks ("rotate ECS task definitions for service X", "scan Confluence for stale runbooks") into named, repeatable commands.

The chat integration is worth calling out: /action is implemented as an LLM tool call using the same pattern as Aurora's /rca slash command. The agent processes the action dispatch and then continues responding to the rest of the user's message — so you can write "kick off the IAM audit and tell me what changed since last week" and the agent will dispatch the audit action and answer your question in the same turn.

On-incident-completion triggers

When an incident transitions to "resolved", any action with this trigger type runs against the incident context. The incident's metadata, RCA, and timeline are available to the action's agent without the user having to paste anything in. This is the trigger that turns Aurora from a reactive tool ("investigate this page") into a continuous one ("investigate, then run health checks, then file the postmortem").

Scheduled triggers

Interval-based, driven by Celery Beat. Choose a cadence (every N minutes / hours / days), and the action runs without user involvement. This is the building block for the CI/CD auto-remediation and scheduled audit use cases — and it's why we're calling this post and the CI/CD Auto-Remediation guide sister posts.

What Actions don't do (and why)

A few capability decisions worth being explicit about:

No external webhook triggers in this release. We could have added "trigger on arbitrary webhook" but it overlaps with the existing alert-triggered investigation flow. We may add it if we see demand for triggers from systems that don't go through PagerDuty / Datadog / Grafana.
No agent-authored Actions yet. The agent can't create or modify Actions on its own. Self-modification is a serious security boundary; we'd want approval gating and audit logging before opening that door. (See our AI Agent kubectl Safety guide for the threat model.)
No conditional / DAG composition in this release. Actions are single-prompt for now. If you need a multi-step workflow, write a single prompt that describes the steps — the agent is good at sequencing. We'll add explicit composition if the natural-language form proves limiting.

Safety: what to think about before enabling

Every Action is a small program with access to your cloud environment. A few rules we use ourselves:

Start read-only. Actions inherit Aurora's tool permissions. If your tool config restricts write actions (no kubectl apply, no aws ec2 terminate-instances), Actions inherit that posture. Keep it that way for the first few weeks.
Use scheduled triggers conservatively. A daily audit is cheap. A 5-minute polling loop with cloud CLI calls is not. Watch the LLM bill.
Audit who can create Actions. RBAC defaults to org-admin-only creation. Leave it there unless you have a clear reason to widen.
Pin the model. Action prompts can be sensitive to model behavior. Pin a known-good model per action (gpt-5.5, claude-sonnet-4.6, opus-4.7, etc.) using Aurora's per-org model dropdown until you have confidence in cross-model stability.
Review action runs weekly. Every action has a run-history view. Spend 10 minutes a week reading the agent's traces for your scheduled actions — anomalous reasoning is the leading indicator of prompt drift or tool drift.

How to ship your first Action

A six-step recipe.

1. Pick a recurring task you currently do manually

Anything you do every week or after every incident. Examples: stale-PR review, alert-noise audit, on-call handover summary. The smaller and more deterministic, the better for v1.

2. Write the prompt as if you were typing it into chat

Don't translate to "automation language." Write it the way you would write a chat message to a smart junior SRE. "Look at..." "Check whether..." "Open a PR that..."

3. Create the Action with a manual trigger

Settings → Actions → New Action. Paste the prompt, set trigger = manual, leave it disabled if you want to review before enabling. Trigger it once and watch the run.

4. Inspect the run trace

Click the run in the history view. Read every tool call. Look for: tool misuse (wrong cloud account), excessive tool calls (3 attempts at the same thing), hallucinated paths or resource IDs. Iterate on the prompt until the trace is clean for three consecutive runs.

5. Promote to the right trigger type

If the action makes sense after every incident → on-incident-completion. If it's a routine sweep → on-schedule with the longest cadence that still meets your need. Only use short cadences when you have a clear cost and blast-radius understanding.

6. Add it to your team's incident review

Treat agent runs the same way you treat human runs: include them in your weekly incident review. Look for actions that produced wrong output, actions that nobody read the output of, and actions that produced output nobody acted on. Delete or downgrade as needed.

Aurora Actions vs traditional incident-management automation

The category most people compare us to is "workflow automation in incident-management SaaS" — Rootly, FireHydrant, incident.io. The comparison is informative but ultimately category-different:

Capability	Aurora Actions	Rootly / FireHydrant / incident.io workflows
Authoring	Natural language	DSL or visual builder
Reasoning	Yes — LLM agent	No — fixed conditional graph
Tool reach	Cloud CLIs, kubectl, Terraform, Slack, Confluence, GitHub, RAG, infra graph	Slack, status pages, Zoom, runbook links, ticket creation
Scheduled execution	Yes (Celery Beat)	Limited (some support timed reminders)
Post-incident chaining	Yes — full incident context available	Yes — but limited to workflow actions
Open source	Yes (Apache 2.0, self-hosted)	No
Pricing	Free (self-hosted; LLM tokens only)	Per-user SaaS

The honest framing: traditional incident-management tools automate the process around the incident. Aurora Actions automate what happens inside the agent. Both have value; they cover non-overlapping work. If you live in PagerDuty and use Rootly for incident channels, Aurora Actions sit alongside that — they don't replace it.

What's next

Aurora Actions is the foundation for several capabilities on our roadmap:

DAG composition — explicit multi-step Action chains where each step is itself an Action.
Approval gates — Actions that pause for human approval before destructive tool calls (already supported in chat; explicit Action-level gating coming).
CI/CD auto-remediation hooks — first-class integration with GitHub Actions, Jenkins, and ArgoCD so a failing pipeline becomes a triggered Aurora investigation. (Background and detailed write-up in our CI/CD Auto-Remediation guide.)
Action marketplace — community-contributed Actions you can install with one click. Bring-your-own prompt store.

We'll publish each of these as they ship.

Get Aurora

Aurora is fully open source under Apache 2.0. Self-host with Docker Compose or Helm. Actions ship in the next tagged release after aurora-oss-1.2.15 (April 15, 2026); the feature is available on main today.

GitHub: github.com/Arvo-AI/aurora
Docs: arvo-ai.github.io/aurora
Compare against alternatives: Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT · Aurora vs traditional incident-management tools

Originally published at arvoai.ca.

CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)

Siddharth Singh — Mon, 11 May 2026 17:32:08 +0000

Key Takeaways

Most teams do not yet auto-remediate inside CI/CD. Per JetBrains' AI Pulse coverage (April 2026), 78.2% of respondents don't use AI in CI/CD workflows at all — even though AI is now widely used elsewhere in the development lifecycle.

CI/CD auto-remediation is an architectural pattern, not a product category. It combines progressive delivery (canary, blue-green), automated metric-driven rollback, and AI-assisted root-cause-and-fix. Owned components, not a single SKU.

Three layers, four maturity levels. We propose the CI/CD Auto-Remediation Maturity Spectrum (CARM): L0 (manual), L1 (rollback), L2 (rollback + diagnostic), L3 (rollback + diagnostic + remediation), L4 (closed-loop with policy gates).

Open-source stack is mature. Argo Rollouts, Flagger, and metric-driven AnalysisTemplates cover L1–L2 with no AI. AI agents like Aurora extend to L3 with Actions-based remediation.

DORA's bar is real. Top-performing teams keep change failure rate low and recover from failed deployments in under one hour (DORA program guidance). Auto-remediation is how non-elite teams close the gap.

Of the 46+ AI SRE products and dozens of progressive-delivery tools shipping in 2026, only a handful explicitly target the pattern this guide is about. CI/CD auto-remediation is the practice of having your delivery pipeline automatically detect, diagnose, and recover from failure — without paging a human — using a combination of progressive-delivery primitives, metric-driven rollback policies, and (increasingly) AI agents that propose or apply fixes. It is not the same as auto-deploy. It is not the same as canary rollout. It is the closing of the loop between "the pipeline noticed something is wrong" and "the system is back in a good state" — without an engineer in the middle.

This guide is for SRE and platform teams who already run continuous delivery and want to push toward the auto-remediation end of the spectrum. By the end, you should be able to: position your current setup on the CARM maturity spectrum, identify the next concrete step, and pick a credible tool stack to get there.

Why auto-remediation matters in 2026

Three numbers explain the demand.

1. AI is shipping more code, faster. Per JetBrains' AI Pulse coverage on the TeamCity blog (April 2026), AI tools are now used by a large majority of developers in their daily work. The DX 2026 change-failure-rate analysis puts a number on it: with 91% of developers having adopted AI and 20%+ of merged code now AI-authored, code velocity has gone up while quality has gone in the opposite direction. More deployments per day means more chances to break production.

2. The pipeline itself is the new bottleneck. JetBrains' 2025 State of CI/CD survey documents widespread frustration with slow and unreliable CI/CD pipelines as a leading contributor to developer burnout.

3. AI in CI/CD specifically lags adoption. Per JetBrains' AI Pulse coverage (April 2026), 78.2% of respondents don't use AI in CI/CD workflows at all — even though most use AI everywhere else in the development lifecycle. The gap isn't capability; it's trust and integration. AI in IDEs is low-risk; AI in pipelines touches production. Teams want the impact but won't take the blast radius until the architecture is right.

Auto-remediation is the architecture that closes that gap. It bounds the agent's reach (only inside the delivery pipeline), gives it deterministic guardrails (progressive delivery and metric-driven rollback), and produces a clear contract: detect, diagnose, fix-or-rollback, log.

What "auto-remediation" actually means

It is easiest to define by negation. Auto-remediation is not:

Auto-deploy. Auto-deploy ships code on merge. Auto-remediation is what happens after a problem appears.
Canary release. Canary is the detection mechanism — it surfaces problems early by shifting traffic gradually. Remediation is the response — rolling back, hotfixing, or reverting.
Self-healing infrastructure. Self-healing systems like Kubernetes restart pods. Auto-remediation includes that plus change-driven failure recovery: rolling back a bad deploy, rolling forward a fix, or pausing the pipeline while a human investigates.
AIOps. AIOps platforms surface alerts and correlations. Auto-remediation closes the loop by acting on them.

The minimum viable definition: a pipeline transition from a degraded state back to a healthy state, triggered by automated detection, executed by automated action, observed and logged for human review.

The CI/CD Auto-Remediation Maturity Spectrum (CARM)

There is no single industry-standard maturity model for auto-remediation. We use the following five-level spectrum — derived from how teams actually evolve.

Level	What happens on failed deploy	Tools that get you here	Trust required
L0 — Manual	Pipeline fails. PagerDuty pages the on-call. Engineer investigates, decides to roll back or hotfix, executes manually.	None — this is the default for most teams.	None — humans do everything.
L1 — Automated Rollback	Pipeline detects health-check failure (error rate, latency, smoke test). Automatically rolls back to the previous version. Pages a human after the fact.	Argo Rollouts, Flagger, Spinnaker	Trust that the health metric reflects user-visible failure.
L2 — Rollback + Diagnostic	L1 plus: AI agent runs an investigation when rollback fires. Produces an RCA before the human starts. Page goes out with context, not blank.	L1 stack + HolmesGPT, Aurora, K8sGPT	Trust that the diagnostic is right enough to bias human reasoning.
L3 — Rollback + Diagnostic + Remediation	L2 plus: agent proposes (or in some cases applies) a fix — a PR, a config change, an alert threshold update. Human reviews and merges.	L2 stack + Aurora Actions, HolmesGPT Operator mode	Trust that the agent's fix is correct, scoped, and reviewable.
L4 — Closed-loop with policy gates	L3 plus: certain low-risk, well-understood fixes are applied automatically inside policy guardrails (alert threshold widening, log-only changes, retry loops). Destructive or high-risk changes still gated.	L3 stack + policy engine (OPA, Casbin, Kyverno) + audit logging	Trust the policy gate definitions more than the agent.

Most teams in 2026 are at L0 or L1. The leap from L1 to L2 is the single highest-leverage move available because it preserves human-in-the-loop decision-making while removing the "blank-page" delay that explains a large share of MTTR. The 2024-2025 DORA research renamed MTTR to Failed Deployment Recovery Time (FDRT) precisely because the metric is more meaningful when scoped to change-driven failures — which is exactly the failure mode auto-remediation targets.

L1: Automated rollback (where most serious teams should be)

This is the foundation. Without L1, AI-assisted remediation at L2-L3 has nowhere to act.

The two Apache 2.0 incumbents are Argo Rollouts and Flagger. Both run in Kubernetes; both implement metric-driven progressive delivery with automated rollback. They differ in invasiveness.

Capability	Argo Rollouts	Flagger
CNCF status	Part of Argo (Graduated, Dec 2022)	Part of Flux (Graduated, Nov 2022)
Resource model	Replaces `Deployment` with `Rollout` CRD	Wraps existing `Deployment`
GitOps pairing	ArgoCD	FluxCD
Analysis	`AnalysisTemplate` querying Prometheus, Datadog, CloudWatch, etc.	Service-mesh metrics + custom webhooks
Automated rollback	Metric-threshold breach → revert	Metric-threshold breach → revert
Traffic shaping	Native + ingress + service mesh	Service-mesh first (Istio, Linkerd, App Mesh)
Invasiveness	Higher (changes resource type)	Lower (transparent wrapper)
Webhooks for custom logic	`Experiment` resource + analysis runs	Pre-/post-/during-rollout hooks

Pick Argo Rollouts if you already use ArgoCD and want explicit per-step canary control. Pick Flagger if you use a service mesh and want progressive delivery to be transparent to existing manifests.

For non-Kubernetes pipelines, equivalent capability lives in Spinnaker (multi-cloud, mature), Harness (commercial), and feature-flag platforms like LaunchDarkly (when "rollback" can be a flag flip).

A minimal Argo Rollouts AnalysisTemplate for HTTP error rate, simplified from the official docs:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 30s
      successCondition: result[0] <= 0.01
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[1m]))
            / sum(rate(http_requests_total{service="{{args.service-name}}"}[1m]))

Three failed 30-second windows → rollback. This is L1 in 30 lines of YAML.

L2: Rollback + automated diagnostic

L1 gets you out of an outage fast. It does not tell you why the deploy failed. The human gets paged with a rollback notification and starts from zero.

L2 fills that gap with an AI agent that runs when rollback fires. The agent queries the cluster state, the application logs, the rollout metrics, and the changed code — and produces an RCA before the human starts typing.

Three credible open-source options exist as of 2026 (compared in detail in our Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT guide):

K8sGPT — rule-based scanner with LLM explanations. Best for low-blast-radius first deployment; explains why a resource is unhealthy.
HolmesGPT — ReAct-loop AI agent (CNCF Sandbox). 30+ observability integrations. Read-only by default. Strong fit for cluster-scoped investigation.
Aurora — LangGraph supervisor agent. Multi-cloud (AWS / Azure / GCP / OVH / Scaleway). Generates postmortems. Opens remediation PRs with human approval.

Wiring up L2 is straightforward: configure your AI SRE's webhook to receive the rollback event (Argo Rollouts emits Kubernetes events; you can route them via Argo Notifications to the agent). The agent investigates and posts results to the on-call Slack channel before the human acknowledges the page.

L3: Diagnostic + agent-proposed remediation

L3 is where AI starts proposing fixes, not just diagnosis. The pattern that works:

Pipeline fails → automated rollback (L1).
Agent investigates → RCA produced (L2).
Agent proposes a fix as a pull request, with the RCA as the PR description, the diff scoped to one file, and tests where possible.
Human reviews PR. If correct, merges. If wrong, comments and rejects.

This works because the pull request is the natural human-review surface. The agent doesn't touch production directly; it touches the repository, which already has CI, code review, and a merge gate.

Aurora Actions is built precisely for this pattern. A post-incident-completion Action with a prompt like "Open a PR widening alert thresholds for the three noisiest alerts in this incident" converts the human follow-up step into automated PR creation. The human review surface stays exactly the same as for human-authored PRs.

The HolmesGPT equivalent ships as "Operator mode" — the agent can write to GitHub when explicitly enabled.

L4: Closed-loop with policy gates

L4 is the contentious one. It involves the agent making changes without human approval — but only inside a tightly scoped policy.

The pattern:

A policy engine (Open Policy Agent, Kyverno, Casbin) defines which classes of remediation can run automatically.
The agent proposes a fix. The policy engine evaluates whether the fix matches a permitted class.
If yes → apply automatically with audit logging. If no → route to L3 (PR for human review).

Permitted classes that are usually safe at L4: widening an alert threshold by less than 2x, restarting a pod, scaling a deployment within preset bounds, adding a retry loop to a network call, suppressing a noisy log line.

Permitted classes that are usually not safe at L4: any data-plane change, any production traffic routing change, any secret or RBAC change, any change touching the policy engine itself.

The reason L4 is contentious is that the policy gate is now a high-value target. An attacker who can broaden the policy can broaden the agent's blast radius. The same threat model we walk through in our AI Agent kubectl Safety guide applies, plus an additional layer: the policy engine must be operated with the same rigor as the orchestration plane itself.

Almost no production teams in 2026 run pure L4. The credible deployments are L3 with hardcoded L4 exceptions for two or three well-understood remediation classes. That's where to aim.

Common pitfalls

A short list of failure modes we have seen — in our own work and in customer deployments.

Auto-remediating into a worse state. The classic failure is auto-scaling a service to handle elevated error rates that are themselves caused by a downstream dependency. The service scales, hammers the dependency harder, and the dependency collapses. Fix: never auto-remediate without dependency-graph awareness. Aurora uses Memgraph for this; HolmesGPT uses its toolset structure; pure-L1 stacks should require manual escalation when the failure crosses service boundaries.
Trusting the AnalysisTemplate metric too much. A 1% error rate threshold on a P99-tail service is meaningless if your real failure mode is request-stalled-not-failed. Fix: model what user-visible failure actually looks like, not what the cleanest Prometheus query produces.
Letting the agent run unbounded retries. AI agents that hit a "this didn't work" signal will often retry — sometimes thousands of times — burning tokens and triggering downstream rate limits. Fix: cap the agent's tool-call budget. Aurora's executor enforces this by default; verify your agent does the same.
Skipping the post-mortem. Auto-remediation that "just worked" without a clear human review of what happened is a slow path to brittleness. Every auto-remediation event should produce a postmortem the on-call reads.
Conflating auto-remediation with "self-healing infra". Kubernetes pod restarts are not auto-remediation. They are a runtime affordance. Auto-remediation is the response to a change-driven failure — the deploy, the config push, the schema migration. Keep the categories separate.

A pragmatic 90-day path to auto-remediation

For a team currently at L0 or L1.

Days 1–14: instrument and detect

Pick your three highest-traffic services. Add or harden:

Synthetic checks that exercise the user-visible path.
One Prometheus error-rate metric per service with a clear threshold.
A canary or blue-green rollout primitive (Argo Rollouts or Flagger).

Goal at end of week 2: a controlled bad deploy auto-rolls back without human intervention.

Days 15–45: wire in the agent

Deploy one of Aurora, HolmesGPT, or K8sGPT in read-only mode. Configure rollback events to webhook the agent. Have it post an RCA to your incident channel within five minutes of rollback.

Goal at end of week 6: every rollback comes with a written diagnostic before the human acknowledges.

Days 46–75: add agent-proposed remediation

Enable PR-creation for the agent (Aurora Actions on-incident-completion trigger, or HolmesGPT Operator mode). Constrain initial scope to one repo and one class of fix (alert thresholds, retry loops, log suppression). Review every PR for the first two weeks.

Goal at end of week 11: agent opens correct PRs in 70%+ of fired rollbacks. False-positive PRs are caught at code review.

Days 76–90: policy-gate one fix class for L4

Pick the safest class — usually alert threshold widening when an alert fired more than N times in M hours with mean TTA above some bound. Define an OPA / Kyverno policy that permits only that class. Wire the agent to apply directly when the policy permits, raise a PR otherwise.

Goal at end of week 12: one L4 lane open for one fix class with full audit trail.

This is the conservative path. Aggressive teams have moved faster, but we have not seen anyone skip steps successfully.

The DORA reality check

The DORA program's published guidance is blunt about what good looks like. Historical State of DevOps Reports have consistently shown the same shape of distribution:

Change Failure Rate: top performers maintain low single-digit percentages; lower performers see substantially higher rates.
Failed Deployment Recovery Time (FDRT): top performers recover in under one hour; lower performers can take days to weeks.

DORA's research has also consistently found that speed and stability reinforce each other rather than trade off — the fastest teams are also the most stable, per DORA's history of metrics and successive State of DevOps Reports. Auto-remediation is one of the small number of capabilities that moves teams across these tiers without requiring deeper organizational change. The L1→L2 jump alone reduces FDRT meaningfully because the human is no longer reconstructing context — the agent has already done it.

Where this is heading

Two predictions, each with a reasonable evidence base.

1. The L2 → L3 transition becomes table-stakes within 18 months. AI-authored PRs from agents are already merging in production at multiple companies in our network. Once the review surface is the same as for human-authored PRs (which it already is via GitHub / Bitbucket / GitLab), there is no organizational reason not to use them.

2. L4 stays narrow. The threat surface of agent-applied changes is genuinely scary, and the per-incident savings of going from L3 to L4 are smaller than the savings from L1 to L2. Expect L4 to be the place where one or two well-understood fix classes get automated, while everything else stays L3.

The teams who win in 2026-2027 are the ones who get to credible L3 first.

Where Aurora fits

Aurora is the AI agent layer of an auto-remediation stack — it covers L2 (investigation), L3 (PR-based remediation via Aurora Actions), and the agent half of L4 (policy-gated remediation). It does not replace Argo Rollouts or Flagger at L1; those remain the foundation. Aurora is the difference between rolling back blind and rolling back with a written RCA and a draft PR.

GitHub: github.com/Arvo-AI/aurora
Docs: arvo-ai.github.io/aurora
Aurora Actions launch: Aurora Actions: User-Defined Background Automations
OSS comparison: Aurora vs HolmesGPT vs K8sGPT
Safety architecture: AI Agent kubectl Safety

Originally published at arvoai.ca.

AI Agent kubectl Safety: Sandboxed Execution for Production

Siddharth Singh — Wed, 06 May 2026 20:44:12 +0000

Key Takeaways

Giving an AI agent kubectl access is an architecture decision, not a permission flag. Per-permission gates fail under prompt injection.

OWASP ranks "Excessive Agency" as LLM06 in the 2025 Top 10 for LLM Applications and "Tool Misuse and Exploitation" as ASI02 in the 2026 Top 10 for Agentic Applications.

The Kubernetes ecosystem already has an answer: k8s-sigs/agent-sandbox provides a declarative API for isolated agent runtimes using gVisor or Kata Containers.

Real precedent exists. EchoLeak (CVE-2025-32711), CVSS 9.3, was the first publicly documented zero-click prompt-injection data exfiltration in a production LLM system. The kubectl analogue would be cluster-wide.

Aurora runs every kubectl command in a pod-isolated process via its terminal_run primitive, with an environment-variable allowlist that strips secrets, signature-matcher and LLM-judge guardrails, and per-invocation cloud credentials.

Of the 46+ products marketed as "AI SRE" in 2026, only a handful publicly document their kubectl execution architecture — and the gap between vendors that handle this well and vendors that handle it badly is the single largest unspoken risk in the category. AI agent kubectl safety is the architectural discipline of letting an AI agent run kubectl (or any cloud CLI) against production without inheriting cluster-wide blast radius if the agent is compromised. It is not the same as RBAC scoping, and it is not the same as a human approval prompt — both are necessary but neither is sufficient on its own.

When OWASP published its 2025 Top 10 for LLM Applications, it ranked Prompt Injection (LLM01) as the top risk and Excessive Agency (LLM06) as one of the most consequential — defining it across three root causes: excessive functionality, excessive permissions, and excessive autonomy. In December 2025, OWASP followed up with a dedicated Top 10 for Agentic Applications that names Tool Misuse and Exploitation (ASI02) and Identity and Privilege Abuse (ASI03) as primary attack surfaces.

Translation: if you give an AI agent the ability to run kubectl, aws, or gcloud commands against production, you have a security architecture problem — not a permissions problem. This guide walks through the threat model, the emerging Kubernetes sandboxing standard, and how to evaluate any AI SRE on its kubectl safety.

What can go wrong when AI agents run kubectl?

Any LLM-driven agent that executes commands inherits the security properties of the LLM, the harness, and the runtime. Three real-world precedents illustrate the failure modes:

EchoLeak (CVE-2025-32711) — Microsoft 365 Copilot, CVSS 9.3 critical, patched in June 2025. Discovered by Aim Security, it was the first publicly documented zero-click indirect prompt-injection data exfiltration in a production LLM system. A crafted email sat in Outlook; when the user later asked Copilot for an unrelated summary, the email's hidden instructions fired and exfiltrated SharePoint, OneDrive, and Teams data. Research paper: arXiv:2509.10540.
MITRE ATLAS prompt-injection techniques — MITRE ATLAS catalogues real-world adversary techniques against AI systems, including indirect prompt injection that turns an LLM with tool access into an attacker-controlled execution surface. The framework specifically documents techniques for exfiltration via AI agent tool invocation.
Agent Session Smuggling — Palo Alto Unit 42 (November 2025) demonstrated rogue agents exploiting trust in the Agent-to-Agent (A2A) protocol with multi-turn manipulation. Documented in OWASP's Agentic Top 10.

None of these specifically targeted kubectl-running agents in production — but the class is the same and the blast radius would be larger. An agent that can run kubectl delete is one prompt-injection payload away from a cluster-wide outage.

The Four Attack Surfaces of Agentic kubectl

Most teams think of kubectl agent safety as a single problem ("can the agent be tricked?"). It's actually four distinct attack surfaces, each requiring its own mitigation.

Surface	Failure mode	Why permission-scoping alone fails	Mitigation
1. Prompt injection	Hidden instructions in logs, alerts, runbooks, or chat coerce the agent	Compromised agent acts within its granted permissions, which is exactly what permission-scoping permits	Sandboxed runtime; never trust LLM output derived from data the LLM read
2. Credential leakage	Executed command reads `AWS_SECRET_ACCESS_KEY`, `VAULT_TOKEN`, `KUBECONFIG` from inherited env	Permissions live on credentials; if the credential leaks, the permission set leaks with it	Per-invocation short-lived credentials (STS, Service Principal); explicit env allowlist that strips secrets
3. Blast radius escalation	Legitimate command runs against wrong namespace, region, or cluster	Permissions don't model "right action, wrong target"	Default read-only; dependency-graph awareness; human approval for destructive writes
4. Audit trail gaps	Logs capture commands without the agent's reasoning	Permission systems audit "who ran what," not "why"	Per-investigation transcripts that link reasoning → tool calls → outputs

Attack Surface 1: Prompt injection

The agent reads a log line, alert payload, runbook, or chat message that contains hidden instructions. The LLM cannot reliably distinguish data from instructions in the same channel — this is the fundamental property OWASP's LLM01 captures. Even frontier models do not eliminate it. Anthropic has publicly stated that "no browser agent is immune to prompt injection" and publishes defense benchmarks showing measurable but imperfect attack-prevention rates across computer-use, bash tool use, and MCP workflows. The implication for kubectl-running agents is clear: the LLM is not the security boundary. The runtime is.

Mitigation: never trust LLM output that originates from data the LLM also read. Sandbox the execution layer so even a successful injection has limited blast radius.

Attack Surface 2: Credential leakage

If the agent runs commands with credentials inherited from the host process environment (AWS_SECRET_ACCESS_KEY, KUBECONFIG, VAULT_TOKEN), a successful command-injection or shell escape exposes everything the agent process has access to. Long-lived static credentials make this catastrophic.

Mitigation: per-invocation credential scoping. AWS STS AssumeRole, Azure Service Principal sessions, GCP short-lived tokens. Strip everything else from the child process environment with an explicit allowlist.

Attack Surface 3: Blast radius escalation

Even legitimate, non-injected commands can have outsized effects. kubectl delete pod on the wrong namespace. aws ec2 terminate-instances against a misidentified region. The agent doesn't need to be compromised — it just needs to be wrong.

Mitigation: read-only by default, write actions behind explicit human approval, and dependency-graph awareness so the agent can compute blast radius before acting.

Attack Surface 4: Audit trail gaps

When an investigation runs across 20+ tool invocations, traditional audit systems (CloudTrail, Kubernetes audit logs) record what was run but not why. A reviewer six months later cannot tell whether a kubectl scale was a legitimate response to a load spike or an injected instruction.

Mitigation: structured per-investigation transcripts that capture agent reasoning alongside tool calls. The right log isn't "kubectl was run" — it's "in response to alert X, the agent hypothesized Y, ran kubectl Z, and observed W."

Why "human approval" alone is not enough

The most common safety story in the AI SRE space is "the agent suggests; humans approve." That is necessary but not sufficient.

The problem with approval gates as the only line of defense:

Decision fatigue. An agent that handles 50 alerts a week generates dozens of approval prompts. Humans rubber-stamp.
Approval ≠ understanding. Engineers approve commands they don't fully understand because the agent's reasoning sounds plausible.
Injected intent looks legitimate. A prompt-injection payload can produce a recommendation that reads exactly like a normal RCA. The approver has no signal that the underlying instruction came from an attacker.

Approval gates are critical, but they need to sit on top of an already-sandboxed runtime — not be the only protection.

Permission scoping vs sandboxed execution: what's the difference?

These two terms get conflated. They aren't the same thing.

Permission scoping restricts what an agent's identity can do. RBAC roles, IAM policies, kubeconfig contexts. It's necessary, but it operates at the cluster-API layer — meaning a successful prompt injection can still use every permission the agent has.

Sandboxed execution isolates the runtime in which commands execute. If the agent's process is compromised, the sandbox limits what the compromised process can do regardless of the credentials it holds. The compromised process can't read other pods' files, can't reach other nodes, can't escalate to the host kernel.

The defensible architecture combines both: tight permission scoping (small RBAC role, short-lived credentials) + runtime isolation (sandboxed execution).

How sandboxed kubectl actually works

The Kubernetes ecosystem standardized on this pattern in 2025–2026.

k8s-sigs/agent-sandbox

k8s-sigs/agent-sandbox is a formal Kubernetes SIG Apps subproject that launched at KubeCon Atlanta in November 2025. It provides a declarative Kubernetes API for "isolated, stateful, singleton workloads" — built specifically for AI agent runtimes that may execute untrusted, LLM-generated code.

Core CRDs:

Sandbox — an isolated pod-equivalent with stronger boundaries
SandboxTemplate — reusable configuration
SandboxClaim — request a sandbox for a workload
SandboxWarmPool — pre-created sandboxes that bring cold-start under one second

The Kubernetes blog post from March 2026 makes the architectural claim explicit: "Isolation achieved via runtime-level sandboxing (gVisor/Kata), not just container-level namespaces."

gVisor

gVisor is a Google-maintained user-space application kernel that provides kernel-level isolation without full virtualization. Architecture: Sentry (a kernel emulator written in Go) intercepts roughly 200 Linux syscalls; Gofer brokers filesystem access over 9P. The OCI runtime is runsc, drop-in compatible with runc.

gVisor runs in production at Google for App Engine standard, Cloud Functions, Cloud Run, and Cloud ML Engine. GKE Sandbox productizes it for GKE node pools. It is one of two named isolation backends in agent-sandbox (the other being Kata Containers, which uses lightweight VMs).

Why this matters for AI SRE

An AI SRE that runs kubectl against production is exactly the kind of workload agent-sandbox was built for. It executes LLM-generated commands. It needs file system isolation, syscall isolation, and per-invocation credential scoping. It benefits enormously from a warm pool that reduces cold-start latency.

If you are evaluating an AI SRE in 2026, this is one of the right questions to ask: what isolation backend does the agent use when it executes commands?

How Aurora's pod-isolated execution works

Aurora's approach predates agent-sandbox and follows the same architectural principles.

When Aurora's agent runs a kubectl, aws, az, or gcloud command, it doesn't use subprocess.run() directly. It uses an internal primitive called terminal_run, defined in server/utils/terminal/terminal_run.py. The module's docstring is explicit:

Drop-in replacement for subprocess.run() that executes in terminal pods. This module provides a terminal_run() function that mimics subprocess.run() API but executes commands in isolated terminal pods via kubectl exec. Safety guardrails (signature matcher + LLM judge) run automatically unless the caller passes trusted=True for known-safe internal operations.

Three properties matter:

1. Pod-isolated execution. When the ENABLE_POD_ISOLATION flag is set (the default in Kubernetes deployments), every external command runs inside a separate terminal pod via kubectl exec. The agent's own process never executes the command directly. A successful command-injection in the agent's reasoning loop does not give an attacker access to the agent host.

2. Two-stage safety guardrails. Before any non-trusted command runs, two checks fire automatically: a deterministic signature matcher that rejects known-dangerous patterns, and an LLM judge that evaluates the proposed command against the investigation context. The trusted=True flag bypasses both — used only for known-safe internal operations like configured connector calls.

3. Sanitized environment allowlist. Aurora's terminal_exec_tool module defines an explicit _SAFE_ENV_KEYS set: PATH, HOME, USER, SHELL, TERM, LANG, TMPDIR, SSL_CERT_FILE, plus ENABLE_POD_ISOLATION itself. Everything else — including VAULT_TOKEN, DATABASE_URL, SECRET_KEY, and any cloud credentials — is stripped from the child process environment. A compromised command cannot read the agent's secrets via env.

Cloud credentials are handled separately. Aurora calls generate_contextual_access_token and generate_azure_access_token per invocation. AWS uses STS AssumeRole via cross-account roles (aurora-cross-account-role.yaml) — short-lived credentials, not long-lived access keys. Azure uses Service Principal sessions. GCP uses OAuth-derived tokens.

For agents that need to reach customer Kubernetes clusters Aurora can't access directly, a separate kubectl-agent binary deploys via Helm into the customer's cluster and connects outbound over WebSocket. No inbound network access required, no kubeconfig sharing, no static credentials at rest.

How to evaluate an AI SRE's kubectl safety model

Eight questions to ask any AI SRE vendor or open-source project before enabling production access:

Where does the command actually execute? Same process as the agent? Same host? Separate container? Sandboxed runtime (gVisor/Kata)?
What credentials does the command inherit from the host environment? Specifically: can the executed command read your agent's vault token, database URL, or other host secrets?
Are credentials short-lived or static? STS / Service Principal sessions, or long-lived access keys?
Is the default read-only? What flag, configuration, or RBAC role enables write access?
What happens between "agent decides to run X" and "X runs"? Is there a deterministic policy check? An LLM judge? A human approval prompt? All three?
Are destructive actions specifically gated? What's the definition of "destructive" — vendor-defined or operator-configurable?
What does the audit trail capture? Just the commands, or the agent's reasoning + the commands together?
What's the blast radius of a single successful prompt injection? Walk through the worst case explicitly with the vendor.

If a vendor can't answer these clearly, the architecture isn't ready for production write access.

Open questions in 2026

This is a young problem space. Several questions are not yet resolved:

Standardization. k8s-sigs/agent-sandbox is the leading candidate for a standard, but Knative Sandbox, container-level approaches, and microVM-based runtimes (Firecracker) are all in play.
Multi-cloud isolation. Sandboxing a Kubernetes pod is a solved problem. Sandboxing a process that calls aws, az, gcloud across cloud APIs from a single agent is harder — the credentials and trust boundaries change per provider.
Approval UX at scale. Engineers can't approve 200 actions per week. The right UI for batch approval, policy-based pre-approval, and rollback-only autonomy is still being figured out.

Expect significant movement on all three through 2026 and into 2027.

Aurora's approach in summary

If you operate an AI SRE in production, the safety questions are non-negotiable. Aurora's answer is: pod-isolated execution by default, deterministic + LLM-judge guardrails before any non-trusted command, environment-variable allowlist that strips secrets, per-invocation cloud credentials via STS/Service Principal/short-lived tokens, and human approval for destructive write operations. The full architecture is open source under Apache 2.0 — auditable in the Aurora repository.

For background on the agent and tool model, see the complete guide to AI SRE, the open-source AI SRE comparison, or the explainer on agentic incident management.

Originally published at arvoai.ca.

Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT (2026)

Siddharth Singh — Wed, 06 May 2026 20:38:19 +0000

Key Takeaways

Three credible open-source AI SREs exist in 2026: Aurora (Arvo AI), HolmesGPT (Robusta + Microsoft, CNCF Sandbox), and K8sGPT (CNCF Sandbox). All three are Apache 2.0.

Only one is a true multi-step agent. HolmesGPT runs an iterative ReAct loop. K8sGPT is a rule-based scanner that uses an LLM only to explain findings. Aurora is a multi-step LangGraph agent with cross-cloud execution.

Only Aurora handles multi-cloud out of the box (AWS, Azure, GCP, OVH, Scaleway, plus Kubernetes). HolmesGPT covers Kubernetes plus 30+ observability integrations. K8sGPT is Kubernetes-only.

Only Aurora generates remediation pull requests. HolmesGPT can open PRs with suggested fixes in Operator mode; K8sGPT is strictly read-only with no write actions.

All three support BYO LLM, including local inference via Ollama for air-gapped deployments — the differentiator over commercial AI SREs.

Of the 46+ companies offering "AI SRE" products in 2026, only a handful are open source — and only three are credible enough to deploy in production: Aurora, HolmesGPT, and K8sGPT. An open-source AI SRE is an AI agent that performs incident investigation, root cause analysis, and (sometimes) remediation under a permissive license that allows self-hosting, source-code audit, and modification. They get lumped together in marketing, but architecturally these three are different products solving different parts of the incident response problem.

This guide compares them on the things that actually matter: agent architecture, execution model, integration scope, and where you can deploy them. By the end, you should be able to pick the right one for your stack — or know whether you need all three.

What is an open-source AI SRE?

An open-source AI SRE is an AI agent that performs site reliability engineering work — alert triage, incident investigation, root cause analysis, remediation — under a permissive license that allows self-hosting, source-code audit, and modification. Three properties are non-negotiable:

License: Apache 2.0, MIT, or equivalent. Source-available licenses (BSL, SSPL) do not count for most production teams.
Self-hostable: runs entirely inside your environment without phoning home to a vendor.
LLM-driven: uses large language models, not just static rules or regex. (This is what separates "AI SRE" from older AIOps tools.)

The reason this category matters: incident data is some of the most sensitive telemetry an organization produces. Self-hosted, audit-able AI is the only model that works for regulated industries, air-gapped environments, or any team that doesn't want production telemetry leaving their perimeter.

Why open source matters for AI SRE

Three reasons buyers in 2026 are explicitly asking for open-source AI SRE:

Data sovereignty. Incident telemetry includes log lines, configuration values, deployment IDs, and sometimes payloads. SaaS AI SREs send all of it to their backend and to a third-party LLM. Self-hosted means it stays in your VPC.
Audit transparency. Regulators and security teams want to know exactly what the agent does on production systems. Source code answers that question; vendor marketing does not.
Cost predictability. Per-user or per-incident pricing can balloon quickly. Open-source costs scale with infrastructure and LLM tokens — and Ollama-local inference can flatten the LLM bill entirely.

The trade-off is real: you operate the system yourself. For teams already operating Kubernetes and observability stacks, that's marginal effort. For teams without that operational maturity, a commercial AI SRE is often the right call.

How the three compare

This is the only table you need. Verified from each project's GitHub repo, official docs, and source as of May 2026.

Dimension	Aurora	HolmesGPT	K8sGPT
License	Apache 2.0	Apache 2.0	Apache 2.0
GitHub stars	201	2,366	7,737
Latest release	v1.1.1 (Mar 2026)	0.26.0 (Apr 2026)	v0.4.32 (Apr 2026)
CNCF status	Independent	Sandbox (Oct 2025)	Sandbox
Built by	Arvo AI	Robusta + Microsoft	k8sgpt-ai community
Agent architecture	LangGraph supervisor + sub-agents	ReAct loop (`ToolCallingLLM`)	Rule-based scanner + LLM explainer
Multi-step reasoning	Yes	Yes	No (single-shot per analyzer)
Cloud providers	AWS, Azure, GCP, OVH, Scaleway	Kubernetes + AWS via MCP	Kubernetes only
Kubernetes execution	`kubectl` in sandboxed pods	Read-only `kubectl get`/`describe`	Read-only via Kube API
Other integrations	22+ (PagerDuty, Datadog, Grafana, Slack, Confluence, Bitbucket, Jenkins, etc.)	30+ toolsets (Prometheus, Grafana, Datadog, Loki, Jira, etc.)	None — Kubernetes-only by design
Knowledge base / RAG	Weaviate vector search over runbooks + postmortems	Yes (via toolsets)	No
Dependency graph	Memgraph (cross-cloud blast radius)	No	No
Postmortem generation	Yes, exports to Confluence	Investigation reports only	No
Pull request remediation	GitHub + Bitbucket with human approval gate	GitHub PRs in Operator mode	None — strictly read-only
MCP server	Yes (340+ endpoints, 6 named tools)	Yes (consumes MCP servers)	No
LLM providers	OpenAI, Anthropic, Google, Vertex, OpenRouter, Ollama	OpenAI, Anthropic, Azure OpenAI, Bedrock, Gemini, Vertex, Ollama	OpenAI, Azure, Cohere, Bedrock, SageMaker, Gemini, Vertex, HuggingFace, WatsonX, LocalAI, Ollama
Air-gapped support	Yes (Ollama + image tarballs)	Yes (Ollama)	Yes (LocalAI / Ollama)
Deployment	Docker Compose or Helm	Binary, API server, K8s Operator, Python SDK	Go binary, K8s operator

The OSS AI SRE Maturity Spectrum

A useful way to position these tools is on a four-level spectrum of agent capability. Each level is strictly more capable than the one below — and each requires more architectural work to deploy safely.

Level	What the agent does	Tools at this level
L1 — Diagnostic Explainer	Reads system state, finds anomalies via deterministic rules, uses an LLM only to explain findings in natural language. No multi-step reasoning. Strictly read-only.	K8sGPT
L2 — Read-Only Investigator	Runs an iterative ReAct loop. Picks tools dynamically. Investigates across multiple data sources (metrics, logs, traces, K8s state). Read-only by design.	HolmesGPT
L3 — Investigation + Suggestion	Everything in L2, plus opens pull requests with suggested fixes. Humans review and merge. No autonomous writes to infrastructure.	HolmesGPT (Operator mode), Aurora
L4 — Investigation + Approved Remediation	Everything in L3, plus can execute approved remediation actions (rollbacks, restarts, scale changes) inside guardrails — typically a sandboxed runtime with explicit human approval for destructive operations.	Aurora (with Bitbucket connector's human approval gate for destructive actions)

No open-source tool today operates as a fully autonomous L5 (closed-loop remediation without human approval) — and that's by design. Most serious teams want explicit gates before agents touch production.

Aurora vs HolmesGPT — which should you choose?

Aurora and HolmesGPT are the two genuinely agentic options. The choice depends on your blast radius.

Pick HolmesGPT when:

Your stack is heavily Kubernetes + Prometheus + Grafana and your incidents live there.
You want a tool that already integrates with 30+ observability sources, including Loki, AlertManager, NewRelic, Datadog APM, OpsGenie, and Slack.
You value CNCF governance and a steep ecosystem velocity.
You don't need cross-cloud (AWS APIs, Azure resources, GCP services) reasoning out of the box.

Pick Aurora when:

You operate across multiple clouds (AWS + Azure, GCP + AWS, etc.) and need an agent that can correlate incidents across providers.
You want auto-generated postmortems exported to Confluence.
You want the agent to draft remediation PRs against your codebase.
You need a graph-based blast radius model (Memgraph) for dependency analysis.
You want an MCP server so your IDE assistants (Cursor, Claude Desktop, Windsurf) can query live incident state.

In practice, some teams run both: HolmesGPT for in-cluster Kubernetes triage, Aurora for cross-cloud investigation and postmortem generation.

Aurora vs K8sGPT — which should you choose?

This is closer to "which tool category do you need?" than a head-to-head.

Pick K8sGPT when:

You want the absolute simplest entry point to AI for Kubernetes — a single Go binary you can install with Homebrew and run as k8sgpt analyze --explain.
Your needs stop at "explain why this pod is broken" rather than multi-step incident investigation.
You want the maturity of a 7.7k-star CNCF Sandbox project with rule-based analyzers that won't hallucinate causes (because they are deterministic before the LLM ever sees them).

Pick Aurora when:

You need agentic investigation, not just diagnostic explanation.
You operate beyond Kubernetes — cloud APIs, Terraform, monitoring tools, runbooks.
You want auto-generated postmortems and remediation PRs.

These two are complements, not competitors. Many teams run K8sGPT as a lightweight first-line scanner and Aurora (or HolmesGPT) for full incident investigation.

HolmesGPT vs K8sGPT — head-to-head

Despite both being CNCF Sandbox projects targeting Kubernetes, these are different categories.

Aspect	HolmesGPT	K8sGPT
What it is	Multi-step AI agent	Rule-based scanner with LLM explanations
When it shines	Investigating an alert end-to-end across signals	Diagnosing why a specific resource is unhealthy
Latency	Seconds to minutes (multi-step)	Sub-second per analyzer
LLM cost	Higher (multiple calls per investigation)	Lower (one explanation per finding)
Hallucination risk	Higher (agent reasons across signals)	Lower (deterministic before LLM)
Best fit	On-call engineers handling alerts	Platform teams running periodic cluster audits

K8sGPT's anonymization feature (which masks resource names and labels before sending to the LLM) is a meaningful privacy advantage that HolmesGPT does not match.

When NOT to use open-source AI SRE

Honest take: open-source AI SRE is the right answer for most engineering-led, security-conscious teams. It's the wrong answer when:

You don't have the operational capacity to run another stateful service in production.
You want vendor support with SLAs and a phone number to call at 3 AM.
Your team is small enough that the LLM-API bill of an investigation-heavy agent will exceed the per-seat price of a SaaS AI SRE.
You need certifications (SOC2, ISO 27001) at the AI-vendor layer rather than at the cloud-provider layer.

How to pilot an open-source AI SRE in your team

A six-step, low-risk pilot for any of the three tools:

Pick one cluster and one observability source. Don't try to cover everything at once.
Install in read-only mode first. All three tools default to read-only — keep it that way for the first two weeks.
Connect one alert source. PagerDuty, Datadog, or Grafana — pick the one that's already firing real alerts.
Run for two weeks alongside human on-call. Compare the agent's RCA conclusions to what your engineers determined. Track accuracy and time-to-RCA.
Feed it your historical context. Aurora and HolmesGPT both support runbook + postmortem ingestion. Agents become dramatically more useful with organizational memory.
Expand carefully. Add more clusters, then enable remediation suggestions, then (only after trust) approved automated actions for specific low-risk patterns.

Getting started with Aurora

Aurora is the multi-cloud, multi-tool option among open-source AI SREs. To run it:

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt

Aurora supports any LLM provider — OpenAI, Anthropic, Google, OpenRouter, or local models via Ollama for air-gapped deployments.

For the technical side of running an agent that executes kubectl against production, see the companion piece on AI agent kubectl safety and sandboxed execution.

Originally published at arvoai.ca.

AI SRE: The Complete Guide for Engineering Teams in 2026

Siddharth Singh — Fri, 24 Apr 2026 21:37:36 +0000

Key Takeaway: An AI SRE (AI Site Reliability Engineer) is an autonomous AI agent that triages alerts, investigates incidents, performs root cause analysis, and generates postmortems without step-by-step human direction. Gartner projects that by 2029, 70% of enterprises will deploy agentic AI agents to operate their IT infrastructure, up from less than 5% in 2025. This guide explains what an AI SRE actually does, how it differs from AIOps and traditional SRE, and how to evaluate the commercial and open-source tools available in 2026.

An AI SRE is an autonomous software agent that performs site reliability engineering work — alert triage, incident investigation, root cause analysis, postmortem generation, and in some cases guided remediation — using large language models and production tooling to operate with minimal human direction. Unlike chatbots or copilots, an AI SRE decides what to investigate, which systems to query, and how to synthesize findings into actionable outcomes.

The category crystallized in 2026. Microsoft made Azure SRE Agent generally available on March 10, 2026. Komodor reports being named a Representative Vendor in Gartner's 2026 Market Guide for AI SRE Tooling. Open-source options like Aurora, K8sGPT, and HolmesGPT emerged as credible alternatives to commercial platforms.

What is an AI SRE?

An AI SRE (AI Site Reliability Engineer) is an autonomous AI agent that performs SRE work — alert triage, incident investigation, root cause analysis, postmortem generation, and guided remediation — without requiring step-by-step human direction.

Three characteristics distinguish an AI SRE from earlier generations of operations tooling:

Autonomy. An AI SRE decides which tools to use and what data to gather. It is not a runbook that executes predefined steps; it is an agent that plans a multi-step investigation based on the specific alert.
Access to production. An AI SRE reads real infrastructure signals — metrics, logs, traces, Kubernetes events, cloud API responses, deployment history — rather than working only from summaries.
Synthesis. An AI SRE produces structured outputs: a root cause analysis, a timeline, a blast radius assessment, a postmortem, or a remediation PR. It does not stop at "the error rate is elevated."

Why AI SRE Emerged in 2026

The conditions that made AI SRE viable came together between 2024 and 2026:

Alert volume outpaced human capacity. PagerDuty's State of Digital Operations data shows the average on-call engineer receives roughly 50 alerts per week, with only 2–5% requiring real human intervention. A 2024 Catchpoint study cited by OneUptime found that 70% of SRE teams list alert fatigue as a top-three operational concern.

Multi-cloud became the default. According to the Flexera 2025 State of the Cloud Report, organizations use an average of 2.4 public cloud providers, and 70% operate a hybrid cloud strategy. Correlating incidents across AWS, Azure, and GCP by hand is increasingly impractical.

Change velocity rose faster than reliability tooling. The 2025 DORA State of AI-Assisted Software Development report found that incidents per PR increased 242.7% as AI coding assistants accelerated delivery — without a matching improvement in incident response capacity.

LLM tool use matured. Agent frameworks like LangGraph made it practical to give a language model 30+ tools and let it chain them into a coherent investigation. Claude, GPT-5, and Gemini 2.5+ reached enough reliability at structured tool use to be trusted with read-only production access.

Gartner codified the category. In Predicts 2026: AI Agents Will Transform IT Infrastructure and Operations, Gartner projected that by 2029, 70% of enterprises will deploy agentic AI to operate IT infrastructure, up from less than 5% in 2025.

How Does an AI SRE Work?

An AI SRE runs a repeatable loop for every alert it receives:

Alert ingestion. A monitoring tool (PagerDuty, Datadog, Grafana, BigPanda) fires a webhook. The AI SRE receives the payload and begins investigation without waiting for a human to acknowledge the page.
Context gathering. The agent reads the recent state: pod status, metric trends, deployment history, recent configuration changes, related alerts within a time window.
Hypothesis formation. Using the alert semantics plus the gathered context, the agent proposes one or more candidate causes.
Evidence collection. The agent selects from its tool inventory — running kubectl describe, querying metrics, searching a vector knowledge base of past postmortems — to test each hypothesis.
Root cause synthesis. The agent produces a structured RCA: what failed, why, what the blast radius is, which services are affected, whether a recent change likely caused it.
Remediation (optional). Some AI SREs stop at recommendations. Others generate a PR, roll back a deployment, or restart a service — typically behind a human approval gate for destructive actions.
Postmortem generation. The agent assembles a draft postmortem with timeline, contributing factors, impact, and action items, ready for human review and export to Confluence or another docs system.

A trustworthy AI SRE is transparent about this loop — surfacing the evidence it considered, the hypotheses it ruled out, and its confidence in the final answer.

AI SRE vs Traditional SRE vs AIOps

The three categories are often conflated but address different problems.

Aspect	Traditional SRE	AIOps	AI SRE
Primary function	Human engineers manage reliability	Anomaly detection, alert correlation	Autonomous incident investigation and RCA
Investigation	Manual (human reads logs, queries systems)	Suggests related alerts	Agent runs multi-step investigation
Root cause analysis	Hours, depends on engineer's expertise	Correlation hints, not causation	Structured RCA in minutes
Tool use	Engineer runs kubectl, aws CLI, dashboards	Reads pre-ingested telemetry	Dynamically selects from 20–40+ tools
Remediation	Human-driven	Typically suggestions only	Agentic execution, often with approval gates
Knowledge transfer	Runbooks, tribal knowledge	Alert correlation models	RAG over runbooks and past postmortems
Core technology	Humans plus monitoring dashboards	ML models for anomaly detection	LLM agents with tool calling

The short version: AIOps tells you what is anomalous. An AI SRE tells you why it is happening and, increasingly, fixes it. Traditional SRE is the human discipline both categories augment.

What Capabilities Should an AI SRE Have?

Serious AI SREs in 2026 share a consistent capability stack:

Autonomous multi-step investigation

The agent must plan and execute investigations without requiring humans to choose tools or pass data between steps. Simple tool-calling is not enough — the agent needs memory across steps and the ability to revise hypotheses as evidence arrives.

Broad tool access with safe execution

kubectl, aws, az, gcloud, metric queries, log search, deployment history, IaC state. How tools are executed matters: running kubectl on the agent host is a production risk. Aurora, for example, runs CLI commands in sandboxed Kubernetes pods with per-invocation credential scoping, not on the agent host.

Cross-cloud and cross-platform reach

With the Flexera 2025 average at 2.4 public clouds per organization, an AI SRE that works only inside AWS or only inside Kubernetes will miss the majority of real incidents.

Knowledge base retrieval

Past postmortems, runbooks, and docs searchable by the agent via vector search (RAG). The knowledge your senior SRE built up should be available to the agent on day one.

Infrastructure dependency graph

When a database fails, the agent needs to know which services depend on it. Graph databases like Memgraph are a common choice for modeling cross-service and cross-cloud relationships.

Postmortem generation

Structured timeline, contributing factors, blast radius, action items — produced during the investigation, not written manually afterward.

Remediation with guardrails

Generating PRs, rolling back deployments, restarting services. Destructive actions should require human approval. Aurora's Bitbucket connector, added in v1.1.0, requires explicit human approval before agents can write.

LLM flexibility

OpenAI, Anthropic, Google, and local models via Ollama for air-gapped deployments. Vendor lock-in on LLM is a real risk as model quality and pricing evolve rapidly.

The AI SRE Landscape in 2026

Commercial platforms

Azure SRE Agent — Microsoft's first-party agent, generally available since March 10, 2026. Deep Azure integration, adjustable autonomy from "review recommendations" to "fully automated," billed via Azure Agent Units on pay-as-you-go.
Rootly AI SRE — AI layer built on top of a mature incident management platform. Transparent chain-of-thought reasoning. SOC2 since January 2022. Depends on external observability tools for telemetry.
Komodor Klaudia — Kubernetes-specialized AI SRE. Komodor reports Klaudia achieves 95% accuracy across real-world incident scenarios and that Komodor was named a Representative Vendor in Gartner's 2026 Market Guide for AI SRE Tooling.
incident.io AI SRE — Multi-agent AI investigation integrated into an incident response platform, with code fix suggestions.
Traversal — Focused on large distributed systems using causal ML. Traversal reports a 38% MTTR reduction at DigitalOcean. Supports on-prem and bring-your-own model.
Resolve.ai — Pushes toward high-autonomy resolution with guardrails.

Open-source AI SRE options

Aurora — Apache 2.0, self-hosted, multi-cloud (AWS via STS AssumeRole, Azure via Service Principal, GCP, OVH, Scaleway, Kubernetes). LangGraph-orchestrated agents with 30+ tools, Memgraph dependency graph, Weaviate RAG, postmortem export to Confluence, PR generation via GitHub and Bitbucket. Works with any LLM (OpenAI, Anthropic, Google, OpenRouter, Ollama).
K8sGPT — Open-source CLI for scanning Kubernetes clusters and explaining failures in plain English. Narrower scope than a full AI SRE.
HolmesGPT — Open-source cross-stack SRE agent covering Kubernetes, Prometheus, logs, and Slack workflows.
Coroot (Community Edition) — Kubernetes observability plus AI-assisted RCA. Community Edition is free; commercial tier is priced transparently from $1 per monitored CPU core per month.

Open-Source vs Commercial AI SRE

Consideration	Open-Source	Commercial
Data residency	Fully self-hosted; incident data stays in your environment	Usually SaaS; incident data leaves your perimeter
Cost model	Free software; you pay for infra and LLM API usage	Per-seat or per-incident pricing
LLM choice	Bring any provider, including local via Ollama	Often bundled or restricted
Audit transparency	Source code available; you can audit how the agent behaves	Typically black-box
Support and managed ops	Community plus self-managed	Vendor support, SLAs, managed infrastructure
Time to deploy	Longer — self-hosting has setup cost	Shorter — SaaS onboarding
Customization	Fork, modify, add tools	Limited to what the vendor exposes

For regulated industries (finance, healthcare, government), air-gapped deployments, or teams already operating their own Kubernetes, open-source AI SRE is often the right fit. For teams prioritizing fastest time to value, commercial platforms win.

How to Evaluate an AI SRE Tool

If you are piloting an AI SRE in 2026, these are the questions to answer before committing:

How does the agent actually execute commands? Host process, container, sandboxed pod? Read-only or write? What credentials does it use?
Which alerts can it investigate today? Ask for specific integrations by name (PagerDuty, Datadog, CloudWatch) and test with your own alert payloads.
What happens when it is wrong? How does the agent surface low-confidence answers? Can you see the evidence it gathered?
Can it handle multi-cloud? If you run on more than one cloud, does it correlate across providers or investigate each in isolation?
Does it learn from past incidents? Does it ingest your existing runbooks and postmortems? How?
What is the remediation model? Suggestions only? PRs with human approval? Direct execution? Where are the guardrails?
Which LLM does it use — and can you change it? LLM cost and quality move quickly. Lock-in is a risk.
Where does your incident data go? Self-hosted, vendor cloud, LLM provider? Read the data flow carefully.

Limitations of AI SREs in 2026

The category is real but not a silver bullet:

Novel failure modes. Agents excel at recognizing patterns similar to past incidents. Genuinely new failures still often require human judgment.
Organizational root causes. "The deploy pipeline does not validate environment variables" is the kind of root cause an AI SRE can surface. "We do not have enough staff to maintain this service" is not.
LLM cost at scale. Complex investigations can consume hundreds of LLM calls. Local inference via Ollama mitigates this but requires GPU infrastructure.
Tool coverage gaps. An AI SRE can only investigate systems it has tools for. Legacy systems, internal tooling, and unusual stacks require custom connectors.
Trust-building takes time. Teams typically start with the agent in "observe" mode, graduate to "suggest," and only later enable autonomous remediation.

The DORA 2025 report is instructive: AI improves throughput but can increase instability in teams without strong platform engineering foundations. AI SRE tools amplify existing practices more than they fix broken ones.

How to Pilot an AI SRE in Your Team

A low-risk pilot follows six steps. Expect it to take four to six weeks end-to-end.

Pick one service and one alert source. Do not try to cover everything at once. Choose a service your team knows well and a monitoring tool you already use.
Deploy the AI SRE in read-only mode. Connect it to alerts, read-only cloud credentials, and your existing observability tools. Do not grant write permissions yet.
Run for two weeks, compare to human RCA. Let the agent investigate every incident that fires. Compare its root cause conclusions to what the on-call engineer eventually determined.
Measure accuracy and time-to-RCA. Two metrics matter: was the agent's root cause correct, and how much faster was it than the human?
Expand scope gradually. Add more services, enable remediation suggestions, then (only after trust is established) approved automated actions for specific low-risk patterns.
Feed historical context. Ingest your existing runbooks and past postmortems into the agent's knowledge base. Agents become dramatically more useful with organizational memory.

Getting Started with Aurora

Aurora is an open-source (Apache 2.0) AI SRE built by Arvo AI. It autonomously investigates incidents across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes, integrating with 22+ tools including PagerDuty, Datadog, Grafana, Slack, Bitbucket, and Confluence.

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt

Aurora works with any LLM provider — OpenAI, Anthropic, Google Gemini, OpenRouter, or local models via Ollama for air-gapped deployments. See the full documentation or the original post on arvoai.ca for more context.

This post was originally published on arvoai.ca.

Opsgenie 2026: Features, Pricing, EOL & Alternatives

Siddharth Singh — Tue, 21 Apr 2026 17:36:17 +0000

TL;DR — Opsgenie is ending. Atlassian stopped new Opsgenie signups on June 4, 2025 and will shut the service down permanently on April 5, 2027. Any data not migrated by that date will be deleted. Atlassian's official migration paths are Jira Service Management (JSM) Operations and Compass. Many teams are using the forced migration as a chance to evaluate alternatives — especially AI-powered options that weren't available when Opsgenie was originally adopted.

Opsgenie is an alerting and on-call management platform that was acquired by Atlassian in 2018. For years it was one of the most widely adopted tools in the SRE stack, sitting alongside PagerDuty and xMatters. In March 2025 Atlassian announced that Opsgenie's capabilities would be absorbed into Jira Service Management and Compass, and that the standalone product would be retired.

This guide covers what Opsgenie is, how it works, what it costs, the exact end-of-life timeline, what happens to your data when it shuts down, the official migration paths, and the current landscape of alternatives. Every claim is linked to an official source.

Last updated: April 21, 2026.

What is Opsgenie?

Opsgenie is a cloud-based incident alerting and on-call management platform for DevOps and SRE teams. It routes alerts from 200+ monitoring tools to the right on-call responders via SMS, voice, email, push, Slack, and Microsoft Teams. Atlassian acquired Opsgenie in 2018 and will retire the standalone product on April 5, 2027.

The tool was founded in 2012 and its capabilities are being absorbed into Jira Service Management and Compass.

Opsgenie at a glance vs top alternatives

	Opsgenie (retiring)	JSM Operations	PagerDuty	Aurora
Available after April 2027	No	Yes	Yes	Yes
Starting price	N/A (closed)	Per-agent	$21/user/mo	Free (OSS)
Built-in AI RCA	No	Partial	Add-on ($699+/mo)	Yes (agentic)
Open source	No	No	No	Apache 2.0
On-call + escalations	Yes	Yes	Yes	Via integration

Opsgenie End-of-Life Timeline (Official)

Atlassian announced the end of Opsgenie in The Evolution of IT Operations. The three critical dates are:

Milestone	Date	What it means
End of Sale	June 4, 2025	No new signups, upgrades, or downgrades on standalone Opsgenie
End of Support / Shutdown	April 5, 2027	Opsgenie service is turned off; REST APIs stop responding
Data Deletion	April 5, 2027	All unmigrated alerts, schedules, escalation policies, integrations, and incidents are permanently deleted

Existing customers can continue using Opsgenie through April 5, 2027, but cannot expand their footprint. After migration, Opsgenie and the new JSM or Compass instance can run in parallel for up to 120 days, after which Opsgenie is automatically switched off (official source).

"Opsgenie REST APIs will continue to work until April 5, 2027. However, Atlassian recommends updating all API endpoints before Opsgenie is turned off to avoid any disruptions." — Atlassian Support

Opsgenie Features

Opsgenie's core feature set is mature — this is a 13-year-old product. Here is what it currently provides, verified from Atlassian's documentation.

Integrations

Opsgenie ships with over 200 integrations with monitoring, ticketing, chat, and ITSM tools. Most are bidirectional — alerts flow in, and acknowledgement or closure events flow back.

Multi-Channel Notifications

Supported notification channels, per Atlassian documentation:

SMS — Aggregated at a minimum 1-minute interval; users can acknowledge or close alerts via reply
Voice calls — Capped at 2 minutes; dial-pad actions (1 = read, 2 = close, 3 = acknowledge, 4 = escalate)
Email — With inline action buttons
Push notifications — iOS and Android with swipe-to-ack/close
Slack — Bidirectional integration
Microsoft Teams — Bidirectional integration

On-Call Management

Opsgenie supports daily, weekly, and custom rotation types including follow-the-sun, with ad-hoc overrides, "Take on-call for an hour" self-service, and a "No-One" participant for scheduled gaps (official docs).

Escalation Policies

Default escalation is 5 minutes, then 10 minutes, repeatable up to 20 times per alert. Acknowledgement or closure stops the policy (official docs).

Heartbeat Monitoring

A "dead man's switch" — if an expected HTTP ping doesn't arrive within the configured interval (minimum 1 minute), Opsgenie fires an alert. Available on Standard and Enterprise plans only (official docs).

Alert Deduplication, Suppression, and Grouping

Opsgenie uses an alias field to deduplicate alerts — identical alias values increment a counter on the existing alert instead of creating a new one. The counter stops logging at 100 occurrences, but deduplication continues (official docs).

Delay policies can hold notifications for a fixed time, until a deduplication threshold is reached, or until an occurrence rate threshold triggers.

Routing Rules

Each team can have up to 100 routing rules, evaluated top-down with first-match semantics. Free and Essentials plans are limited to 1 routing rule and can only route by priority or tags. Standard and Enterprise plans support full-field routing.

Reporting by Plan

Report	Essentials	Standard	Enterprise
Notifications + API Usage	Yes	Yes	Yes
Monthly Overview (Looker)	No	Yes	Yes
Advanced reporting / MTTA / MTTR	No	Yes	Yes
Team Reports	No	No	Yes
Global Reports + Looker dashboards	No	No	Yes
Post-Incident Analysis	No	No	Yes

Source: Opsgenie Advanced Reporting.

Mobile App

Opsgenie's iOS and Android apps support swipe-to-acknowledge from the lock screen and iOS Critical Alerts that override Do Not Disturb and silent mode.

SSO / SAML

SSO is available on Standard and Enterprise plans only, with supported providers including Google, Azure AD, Okta, OneLogin, Ping Identity, and Microsoft AD FS (official docs).

Compliance

Opsgenie is covered under Atlassian's Trust program with SOC 2 Type II (annual), ISO/IEC 27001, ISO/IEC 27018, CSA, and TISAX AL2 certifications, plus a pre-signed GDPR DPA (official page).

Data Residency

Opsgenie is offered in US and EU regions, both hosted on AWS (official docs).

Who Should Use Opsgenie in 2026?

With end-of-sale already behind us, Opsgenie is only relevant to existing subscribers planning their exit. New teams cannot sign up. The question for existing subscribers is whether to stay with Atlassian (migrate to JSM or Compass) or evaluate alternatives.

Stay with Atlassian (migrate to JSM Operations) if you are already a Jira Service Management customer, need ITSM workflows (change, problem, incident), and are comfortable with the Premium-tier price increase.
Stay with Atlassian (migrate to Compass) if you are a DevOps or SRE team that wants alerting paired with a software component catalog and service ownership model, not ITSM.
Switch to a dedicated alerting tool (PagerDuty, ilert, Squadcast) if you want deeper alerting features and do not need Atlassian platform integration.
Switch to AI-powered incident management (incident.io, Rootly, Aurora) if you want autonomous investigation and root cause analysis, not just alert routing.

Opsgenie Pricing (Standalone, 100-User Reference)

Pricing below is for standalone Opsgenie with 100 users — sourced from the official Opsgenie pricing page. New signups are closed, so these numbers apply only to existing customers on legacy plans.

Plan	Monthly	Annual	Routing Rules	Heartbeats	SSO
Free	$0 (up to 5 users)	—	1	No	No
Essentials	$11.55/user/mo	$9.45/user/mo	1	No	No
Standard	~ $29/user/mo	Discounted	100 per team	Yes	Yes
Enterprise	~ $39/user/mo	Discounted	100 per team	Yes	Yes

Enterprise-exclusive features include Incident Command Center (built-in video chatroom tied to incidents), Stakeholders (notification-only users), Service Subscriptions, Incident Templates, and Post-Incident Analysis.

Incoming call routing is charged separately: $0.10 per minute for US/Canada and $0.35 per minute internationally after the free tier.

What Happens When Opsgenie Is Turned Off

On April 5, 2027, Atlassian will:

Disable the Opsgenie web application, mobile apps, and REST APIs
Delete all data that was not migrated to JSM or Compass — alerts, on-call schedules, escalation policies, integrations, incidents, notes, attachments
Stop accepting any incoming webhooks or notifications

Important: unlike the legacy Opsgenie Enterprise plan, JSM automatically deletes alert data after a retention window. Once alert data is deleted in JSM, it cannot be recovered. Export anything you need for compliance or audit before migration (official source).

Opsgenie Migration Paths: JSM vs Compass

Atlassian offers two official migration destinations. Both share the same underlying Operations engine — schedules, alerts, and policies sync bidirectionally — but the wrapping product and pricing differ (managing operations across both).

Jira Service Management (JSM) Operations

JSM Operations is the ITSM-centric path — alerts are paired with change, problem, and incident workflows. JSM pricing (official page):

JSM Plan	Price	Outbound Webhooks	Incident Command Center	Post-Incident Reviews	99.9% SLA
Free	$0 (up to 3 agents)	No	No	No	No
Standard	Per-agent	No	No	No	No
Premium	Per-agent	Yes	No	Yes	Yes
Enterprise	Contact sales	Yes	Yes	Yes	99.95%

Opsgenie features that do not carry over to JSM Operations, per Atlassian's shifting guide:

Incoming Call Routing integration is not supported
Stakeholder role — custom Opsgenie roles default to User
Alert creation rules from Opsgenie do not migrate
Legacy api.opsgenie.com/v1/services endpoint stops working
Chat integrations must be reconnected manually
The old Opsgenie mobile app stops working — responders switch to the Jira mobile app

Compass

Compass is positioned as a software component catalog + alerting platform aimed at DevOps, SRE, and Platform Engineering teams rather than ITSM. Compass pricing (official page):

Compass Plan	Price	Alerting	Heartbeats	99.9% SLA
Free	$0 (up to 3 full users)	Basic	No	No
Standard	$8/user/mo	Yes (150+ integrations)	No	No
Premium	$25/user/mo	Advanced	Yes	Yes

Migration Friction

Real complaints from the Atlassian Community:

Price increases — JSM Premium is widely reported as more expensive than standalone Opsgenie Standard
Feature parity gaps — some users need JSM and Compass together to match Opsgenie's alert processing depth
120-day forced cutover — Opsgenie auto-shuts-down 120 days after migration begins; Atlassian has declined requests to extend the window
Split paths confusion — some features only exist in JSM, others only in Compass, forcing customers to choose or buy both

One user put it bluntly: "Switching to Compass seems like buying a new car just to listen to the radio."

Why Teams Are Evaluating Alternatives Instead of Migrating

The forced migration has created a rare evaluation moment. Teams that adopted Opsgenie in 2018 are re-evaluating the entire category with three shifts in mind:

AI-native incident management has arrived. Products like Aurora, incident.io AI SRE, Rootly AI, and PagerDuty Advance didn't exist when most Opsgenie contracts were signed. Per Gartner (October 2025), 54% of I&O leaders are now adopting AI in operations.
On-call burnout is a hiring and retention problem. The Catchpoint SRE Report 2025 found that roughly 70% of SREs cite on-call stress as a direct cause of burnout, and toil rose to 30% of SRE work.
Downtime costs have climbed. PagerDuty's 2024 research put the average cost of a major incident at $794,000, or $4,537 per minute. ITIC's 2024 survey found 97% of large enterprises say an hour of downtime costs them over $100,000.

Against this backdrop, "like-for-like Opsgenie replacement" is no longer the only question — many teams are asking whether the replacement should also do autonomous investigation, not just alerting.

"By 2030, 75% of IT work will be human plus AI, 25% will be AI-only, and zero percent will be human-only." — Gartner CIO survey of 700+ CIOs, 2025

Top Opsgenie Alternatives in 2026

Verified pricing and capabilities from each vendor's official site. Last checked April 2026.

Product	Starting price	Free plan	Open source	AI-native	Best for
Aurora by Arvo AI	$0 self-hosted	Yes (OSS)	Apache 2.0	Yes (agentic)	OSS teams wanting alerting + autonomous RCA in one stack
PagerDuty	$21/user/mo	14-day trial	No	Yes (PagerDuty Advance, $415+/mo)	Enterprises wanting the incumbent with AI add-ons
ilert	Up to €49/user/mo Scale	Yes (5 responders)	Partial (MCP server)	Yes	EU-based teams requiring GDPR data residency
Squadcast	$9/user/mo Pro	Yes (5 users)	No	Yes	Small SRE teams on tight budgets
Rootly OnCall	From $20/user/mo	Trial	Partial (MCP, Agents JSON)	Yes (AI SRE standalone)	Teams wanting modular IR + on-call + AI SRE
incident.io On-call	$19 base + $10 add-on	Trial	No	Yes (AI SRE)	Slack-native incident coordination with AI
FireHydrant Signals	Usage-based	Trial	No	Yes (AI Copilot)	Teams preferring pay-per-alert over per-seat
xMatters	$39/user/mo base	Yes (10 users)	No	Partial	Everbridge customers needing codeless workflows
Grafana OnCall OSS	Free	Yes	AGPLv3 (archived)	No	Not recommended — archived March 24, 2026

Product Notes

PagerDuty — Most mature alerting product. PagerDuty Advance adds AI agents (SRE, Scribe, Shift) but requires a paid base plan and a separate $415+/mo Advance subscription. AIOps features require a $699+/mo add-on.

ilert — EU-hosted with a clear GDPR and data-sovereignty story; the AI SRE opts out of LLM training on customer data. Free tier includes 5 responders.

Squadcast — Acquired by SolarWinds on March 3, 2025. Roadmap now driven by SolarWinds.

Rootly — Rootly AI Labs launched February 20, 2026; Rootly MCP GA April 2, 2026. Rootly sells IR, On-Call, and AI SRE as standalone products.

incident.io — $62M Series B funded the launch of AI SRE — an always-on agent that investigates alerts, drafts PRs, and can autoresolve incidents.

FireHydrant — Acquisition by Freshworks expected to close Q1 2026; FireHydrant will become the incident layer inside Freshservice.

Grafana OnCall — Entered maintenance mode March 11, 2025 and archived March 24, 2026. Do not start new deployments. Grafana is consolidating on a unified Cloud IRM app.

Splunk On-Call (VictorOps) — Pricing not publicly listed. Cisco completed its $28B Splunk acquisition in March 2024; no official EOL announcement as of April 2026, but the product has seen minimal public investment since.

How Aurora Integrates with Opsgenie and JSM Operations

Aurora is open-source agentic incident management that works alongside Opsgenie (and the JSM Operations successor). Most AI incident tools have already deprecated Opsgenie support ahead of the 2027 shutdown — Aurora supports both so teams can run their migration on their own timeline. The integration is fully documented in Aurora's docs.

What Aurora does with Opsgenie alerts:

Bidirectional authentication — Accepts either a native Opsgenie GenieKey (US or EU region) or a JSM Operations Atlassian API token. Credentials are encrypted in HashiCorp Vault.
Webhook ingestion — Receives Create, Acknowledge, Close, and custom alert actions. Only Create triggers an investigation, preventing duplicates from acknowledgement webhooks.
Alert correlation — Aurora's AlertCorrelator groups incoming alerts with existing incidents by service, title, and time proximity. Correlated alerts attach to the parent incident instead of spawning a new one.
Priority mapping — Opsgenie priorities map deterministically: P1 → critical, P2 → high, P3 → medium, P4/P5 → low.
Service extraction — Aurora reads alerts for a service:xxx tag first, then falls back to the source and entity fields.
Autonomous RCA — On alert creation, Aurora creates an incident record, generates an AI summary, and launches a LangGraph-orchestrated agent that queries your cloud infrastructure to find the root cause.
Bidirectional JSM commenting — For JSM Operations users, Aurora posts an "RCA in progress" comment back onto the linked Jira incident and updates it with findings.
Chatbot query surface — Engineers can ask Aurora in natural language: "Who is on-call right now?", "Show me P1 alerts from the last 24 hours", "Get details for alert ABC-123". Aurora queries 8 Opsgenie resource types (alerts, alert details, incidents, incident details, services, on-call, schedules, teams) via parallel API calls.

"Most AI investigation tools only work with PagerDuty. We built Aurora to meet SRE teams where they already live — including Opsgenie and JSM — so AI-powered RCA isn't gated on migrating your alerting stack first." — Noah Casarotto-Dinning, CEO at Arvo AI

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init && make prod-prebuilt

How to Migrate Off Opsgenie Before April 5, 2027

Prerequisites: administrator access to your Opsgenie account, access to your monitoring stack, and a target destination decided (JSM Operations, Compass, or a third-party alternative).

If You Are Staying with Atlassian

Inventory your Opsgenie config. Document integrations, escalation policies, routing rules, heartbeats, on-call schedules, and custom roles.
Choose JSM Operations vs Compass. Pick JSM if you need ITSM workflows (change, problem, incident); pick Compass if you want alerting tied to a service catalog.
Verify feature parity. Review the Atlassian shifting guide for features that do not migrate.
Export historical data. Alert data in JSM auto-deletes after a retention window — export anything needed for audit or compliance first.
Run the in-product migration tool. Atlassian provides a guided migration that copies your data to JSM or Compass.
Re-authenticate chat integrations. Re-authorize Slack and Microsoft Teams — OAuth grants do not transfer.
Update API endpoints. Every consumer of the legacy Opsgenie REST API must be repointed to the new JSM Operations endpoints before April 5, 2027.
Replan the mobile rollout. The standalone Opsgenie mobile app stops working — responders move to the Jira mobile app.
Close Opsgenie within 120 days. After migration, Opsgenie runs in parallel for up to 120 days, then auto-shuts down.

If You Are Evaluating Alternatives

Shortlist two or three alternatives using the comparison table above.
Run a 90-day parallel trial alongside Opsgenie — most vendors offer free trials.
Validate the integrations that matter — especially monitoring tool webhooks and your chat platform.
Measure MTTR and on-call satisfaction against your Opsgenie baseline.
Decide before Atlassian's 120-day cutover window closes on any migration you start with JSM or Compass.

Frequently Asked Questions

When is Opsgenie being shut down?
Atlassian will shut down Opsgenie permanently on April 5, 2027. End of sale was June 4, 2025 — no new signups, upgrades, or downgrades are allowed. On April 5, 2027 the service will be disabled and any data that has not been migrated to Jira Service Management or Compass will be permanently deleted.

Can I still buy Opsgenie in 2026?
No. Atlassian closed new Opsgenie sales on June 4, 2025. Existing customers can continue using their current Opsgenie subscription until April 5, 2027 but cannot upgrade, downgrade, or add new users beyond their existing plan limits.

What are the official Opsgenie migration paths?
Atlassian offers two paths: Jira Service Management (JSM) Operations for ITSM teams needing change, problem, and incident workflows, and Compass for DevOps/SRE teams wanting alerting paired with a service catalog. Both share the same Operations engine, so schedules, alerts, and policies sync if you use both.

Will my Opsgenie data be preserved after migration?
Only data you explicitly migrate through Atlassian's in-product migration tool is preserved. Unlike legacy Opsgenie Enterprise, JSM automatically deletes alert data after a retention window — so you must export anything needed for compliance or audit before migration. Some features like alert creation rules and custom roles do not carry over at all.

How much does Opsgenie cost in 2026?
Existing standalone customers pay $9.45/user/month annual or $11.55/user/month monthly on Essentials at 100 users. Standard and Enterprise add full routing, SSO, heartbeats, and advanced reporting. Incoming call routing is billed separately at $0.10/minute (US/Canada) and $0.35/minute (international). New signups are no longer accepted.

What are the best Opsgenie alternatives?
The strongest 2026 alternatives are PagerDuty (incumbent with AI add-ons), incident.io (Slack-native with AI SRE), ilert (EU-hosted, GDPR-focused), Squadcast (budget-friendly, SolarWinds-owned), Rootly (modular IR + on-call + AI SRE), and Aurora by Arvo AI (open-source agentic RCA with Opsgenie and JSM support). Grafana OnCall OSS was archived in March 2026.

Does Opsgenie support AI-powered root cause analysis?
Standalone Opsgenie is an alerting and on-call product — it does not perform root cause analysis. Atlassian is adding AIOps features (alert grouping, automated resolutions) to JSM and Compass. Teams wanting autonomous multi-step RCA typically pair Opsgenie with a dedicated tool like Aurora, which ingests Opsgenie webhooks and investigates incidents automatically.

What happens to my Opsgenie integrations after migration?
Monitoring integrations (Datadog, New Relic, Prometheus) migrate automatically via Atlassian's in-product tool. Chat integrations (Slack, Microsoft Teams) must be re-authorized manually because the OAuth grants do not transfer. Custom webhooks calling the legacy Opsgenie REST API must be repointed to the new JSM Operations endpoints before April 5, 2027.

Can Aurora connect to Opsgenie and JSM?
Yes. Aurora supports both standalone Opsgenie (GenieKey authentication, US and EU regions) and JSM Operations (Atlassian API token). Aurora ingests alert webhooks, runs AI-powered alert correlation to group related alerts into incidents, and autonomously investigates the root cause. For JSM users, Aurora posts findings back as comments on the linked Jira incident.

Is Jira Service Management cheaper than Opsgenie?
No. JSM Premium is widely reported by Atlassian Community users as more expensive than standalone Opsgenie Standard. Real-time outbound webhooks require JSM Premium, and Incident Command Center requires JSM Enterprise. Many Opsgenie customers see a net price increase after migration, which is why teams use the forced migration to evaluate alternatives.

All Opsgenie, JSM, Compass, and alternative-vendor claims verified from official sources in April 2026. Last updated: April 21, 2026.

Originally published on arvoai.ca/blog.

By Team at Arvo AI.

Top 10 AIOps Platforms Offering Free Root Cause Analysis in 2026

Siddharth Singh — Fri, 10 Apr 2026 17:06:02 +0000

Key Takeaway: AIOps platforms now compete on the quality of AI-driven root cause analysis and the accessibility of free or open source entry points. Whether you need a full enterprise observability suite or a focused open source investigation tool, there's a platform with a free starting point for your team.

AIOps — Artificial Intelligence for IT Operations — combines AI/ML algorithms with big data analytics to automate IT operations and incident response across cloud and hybrid environments. In 2026, the landscape has matured significantly: platforms now offer autonomous investigation, deterministic AI, and agentic workflows that go far beyond basic alert correlation.

This guide covers the 10 best AIOps platforms that offer free root cause analysis capabilities — either through free tiers, open source licenses, or trial access.

Quick Comparison

Platform / Type / Free Access / RCA Approach / Best For

Aurora by Arvo AI — Open source (Apache 2.0) — Free forever (self-hosted) — Alert correlation + AI summarization + agentic autonomous investigation — SRE teams needing the full AIOps workflow in one free tool
Dynatrace — Enterprise SaaS — 15-day trial — Deterministic AI (Davis AI) — Large enterprises with complex microservice architectures
Datadog — SaaS — Free tier (5 hosts) — Watchdog anomaly detection — Teams wanting unified observability with easy onboarding
New Relic — SaaS — Free tier (100 GB/month) — Applied Intelligence — Organizations seeking usage-based pricing flexibility
OpenObserve — Open source (AGPL-3.0) — Free forever (self-hosted) — Log/metric/trace analytics — Cost-conscious teams needing petabyte-scale observability
Splunk ITSI — Enterprise SaaS — Trial available — Predictive ML analytics — Enterprises with heavy log volumes and existing Splunk investment
Grafana Cloud — SaaS + Open source — Free tier (10k metrics) — ML-powered Sift diagnostics — Teams already using the Grafana/Prometheus stack
Metoro — SaaS — Free tier (1 cluster) — AI SRE for Kubernetes — Kubernetes-native teams wanting automated deployment verification
BigPanda — Enterprise SaaS — Demo only — Open Box ML correlation — Large IT ops teams drowning in alert noise
PagerDuty — SaaS — Free tier (5 users) — AIOps add-on (paid) — Teams needing on-call + incident coordination

1. Aurora by Arvo AI

Aurora covers the full AIOps investigation workflow — from alert correlation and incident summarization all the way to autonomous multi-step root cause analysis. When alerts fire, Aurora's AlertCorrelator groups related alerts into incidents, generates AI summaries, and then triggers autonomous agents that query your cloud infrastructure directly.

How Aurora does RCA:

Alert correlation — groups related alerts into incidents by service and time proximity (AlertCorrelator service)
AI incident summarization — generates structured summaries with context and suggested next steps
Autonomous multi-step investigation — LangGraph-orchestrated agents dynamically select from 30+ tools per investigation
Executes kubectl, aws, az, gcloud commands in sandboxed Kubernetes pods (non-root, read-only filesystem, seccomp enforced)
Queries cloud APIs directly — AWS (STS AssumeRole), Azure (Service Principal), GCP (OAuth), OVH, Scaleway
Traverses Memgraph infrastructure dependency graph for blast radius analysis
Searches Weaviate knowledge base via vector search over runbooks and past postmortems
Generates structured RCA with timeline, evidence citations, and remediation steps
Suggests code fixes with diff preview — human approves and creates PR
Auto-generates postmortems exportable to Confluence and Jira

Free access: Completely free. Apache 2.0 open source, self-hosted via Docker Compose or Helm chart. No per-seat pricing, no usage limits. Use any LLM provider including Ollama for local models.

Integrations: 25+ verified — PagerDuty, Datadog, Grafana, New Relic, Dynatrace, Splunk, BigPanda, Kubernetes, Terraform, GitHub, Confluence, Slack, and more.

Best for: SRE teams that need a single free platform covering alert correlation, AI summarization, AND deep autonomous cloud investigation — without paying for three separate tools.

"We built Aurora to cover the full investigation workflow. It correlates alerts, summarizes incidents, then actually queries your AWS accounts, checks your Kubernetes pods, and traces the dependency chain — all autonomously." — Noah Casarotto-Dinning, CEO at Arvo AI

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init && make prod-prebuilt

2. Dynatrace

Dynatrace is an enterprise observability leader powered by its Davis AI engine, which uses deterministic AI for precise root cause identification.

RCA approach: Deterministic AI that consistently produces the same result for the same input — as opposed to probabilistic models that may vary. Davis AI continuously auto-discovers your infrastructure and maps dependencies across microservice architectures.

Free access: 15-day free trial plus a public sandbox environment. No permanent free tier.

Pricing: Usage-based. Infrastructure monitoring starts at $7/month per host (Foundation), $29/month (Infrastructure Monitoring), $58/month (Full-Stack).

Strengths: Deep auto-discovery, topology mapping, precise deterministic RCA.
Limitations: Enterprise-oriented pricing, complex configuration for advanced features.

Best for: Large enterprises with complex microservice architectures needing precise, repeatable RCA.

3. Datadog

Datadog provides a comprehensive observability ecosystem with a generous free tier for experimentation.

RCA approach: Watchdog — an AI engine that continuously analyzes billions of data points for automatic anomaly detection, root cause analysis, and contextual insights across metrics, logs, traces, and security data.

Free access: $0 free tier for Infrastructure Monitoring — up to 5 hosts with 1-day metric retention.

Pricing: Pro starts at $15/host/month (billed annually). Modular pricing across 20+ products.

Strengths: Unified platform, easy onboarding, broad integration ecosystem.
Limitations: Costs can scale quickly with multiple products and high cardinality.

Best for: Teams wanting unified cloud monitoring with AI-assisted incident detection and easy experimentation via the free tier.

4. New Relic

New Relic offers telemetry-centric observability with built-in AI for incident analysis.

RCA approach: Applied Intelligence — an AI module that deduplicates alerts, correlates incidents, and pinpoints root causes across cloud-native infrastructure using ML.

Free access: Free tier includes 100 GB/month data ingest, 1 full platform user, and 50+ capabilities. Usage-based pricing allows low-risk adoption.

Pricing: Usage-based — pay for data ingested and number of users.

Strengths: Flexible pricing, full-stack observability, large integration library.
Limitations: Advanced AI features may require higher tiers.

Best for: Organizations seeking flexible, usage-based pricing with built-in AI for alert correlation and incident analysis.

5. OpenObserve

OpenObserve is an open source observability platform built in Rust for high-performance log, metric, and trace analytics.

RCA approach: Analytics-driven observability — fast search and correlation across logs, metrics, and traces. Not agentic AI, but provides the data foundation for manual or scripted RCA.

Free access: Fully open source under AGPL-3.0. Self-hosted is free forever with unlimited users. Cloud plan also offers a free tier. Self-hosted Enterprise is free up to 200 GB/day ingestion.

Strengths: Claims 140x lower storage cost vs Elasticsearch. Petabyte-scale. Written in Rust for performance.
Limitations: Observability platform, not a dedicated AIOps/RCA tool. Requires engineering effort for investigation workflows.

Best for: Cost-conscious engineering teams needing high-performance observability as a foundation for RCA.

6. Splunk ITSI

Splunk ITSI (IT Service Intelligence) is an enterprise AIOps platform for organizations with heavy log volumes.

RCA approach: ML-powered predictive analytics — uses machine learning and historical data to detect future service degradations. Includes automated event aggregation with out-of-the-box ML policies and alert correlation.

Free access: Trial available. No permanent free tier.

Pricing: Not publicly listed. ITSI is a premium add-on requiring a base Splunk Enterprise or Cloud license. Widely considered one of the most expensive options in the AIOps space — costs scale significantly with data volume.

Strengths: Predictive alerting, deep service-level insights, mature ML capabilities.
Limitations: Significant cost at scale, proprietary query language (SPL), complex implementation.

Best for: Mid-to-large enterprises with existing Splunk investment and heavy log volumes.

7. Grafana Cloud

Grafana Cloud extends the popular open source Grafana ecosystem with cloud-hosted observability and ML-powered diagnostics.

RCA approach: ML-powered Sift for automated diagnostics, plus Correlations features that create interactive links between data sources. Application Observability auto-correlates metrics, logs, and traces.

Free access: Permanent free tier — 10,000 active metric series/month, 50 GB logs/traces/profiles, 3 active users, 14-day retention. No credit card required.

Strengths: Strong community, extensible with thousands of dashboards and plugins, works with Prometheus/Loki/Tempo natively.
Limitations: Operational tuning may be required for effective RCA at scale. ML features are newer additions.

Best for: Teams already using the Grafana/Prometheus stack who want cloud-hosted ML-powered diagnostics.

8. Metoro

Metoro is a developer/SRE-focused AIOps platform built specifically for Kubernetes environments.

RCA approach: AI SRE for Kubernetes — autonomous deployment verification, AI issue detection, root cause analysis, and remediation suggestions. Uses eBPF for telemetry collection.

Free access: Hobby plan — free forever, includes 1 cluster, 1 user, 2 nodes, 200 GB ingested/month.

Strengths: Kubernetes-native, automated deployment verification, APM + log management + infrastructure monitoring in one.
Limitations: Focused on Kubernetes — less suitable for non-containerized environments.

Best for: Kubernetes-native teams wanting an AI SRE that automates deployment verification and incident investigation.

9. BigPanda

BigPanda specializes in transparent, explainable ML-based event correlation for large IT operations teams.

RCA approach: Open Box Machine Learning (OBML) — transparent ML where users can examine automation logic in plain English, edit it, and preview before deploying. Correlates alerts across time, topology, context, and alert type. Claims 95%+ IT noise reduction.

Free access: No free tier or self-serve trial. Access through demo requests and sales engagement.

Strengths: Transparent/explainable AI (not black box), massive noise reduction, customizable correlation rules.
Limitations: Enterprise-only, no self-serve access, requires sales engagement.

Best for: Large IT ops teams drowning in alert noise who need transparent, customizable AI correlation.

10. PagerDuty

PagerDuty is the industry standard for incident response and on-call coordination, with AIOps capabilities available as add-ons.

RCA approach: AIOps add-on provides alert noise reduction (claims 91% reduction), intelligent correlation, and "Probable Origin" for root cause suggestions. Note: RCA features are not included in the free tier — they require the AIOps add-on ($699+/month) on top of a paid plan.

Free access: Free tier includes up to 5 users, 1 on-call schedule, basic incident management, and 700+ integrations. Basic alerting and response only — no RCA.

Pricing: Professional from $21/user/month (annual). AIOps add-on from $699/month. PagerDuty Advance (GenAI) from $415/month.

Strengths: Industry-standard on-call, 700+ integrations, robust mobile app, strong ecosystem.
Limitations: RCA requires expensive add-ons, not included in base plans.

Best for: Teams that already use PagerDuty for on-call and want to add AI-powered correlation and noise reduction.

How to Choose the Right Platform

When evaluating free AIOps RCA tools, prioritize these criteria:

RCA approach — Deterministic AI (Dynatrace), probabilistic ML (BigPanda), or agentic investigation (Aurora)?
Telemetry breadth — Does it cover logs, metrics, traces, and infrastructure state?
Cloud integration — Does it work with your cloud providers and existing monitoring stack?
Free tier limitations — What's actually included? Some "free" plans exclude RCA entirely (PagerDuty).
Self-hosted vs SaaS — Do you need data sovereignty? Only Aurora and OpenObserve offer full self-hosted deployment.
Investigation depth — Does it correlate alerts, or does it actually query your infrastructure?

Start with a free tier or open source instance to validate whether automated RCA reduces your MTTR before scaling to paid plans.

Key Features to Look For

AI/ML approach — Deterministic vs probabilistic vs agentic
Telemetry support — Logs, metrics, traces, and infrastructure state
Cloud provider integration — Native connectors for AWS, Azure, GCP, Kubernetes
Remediation guidance — Does it just identify the cause, or suggest fixes?
Postmortem automation — Auto-generated incident documentation
Knowledge base — Search over runbooks and past incidents
Compliance — SOC 2, HIPAA, GDPR if required

Mean Time to Repair (MTTR) — the average time to detect, diagnose, and resolve an incident — is the key metric. Research shows that AIOps root cause automation can cut MTTR by up to 50%.

Learn more about automated RCA in our Root Cause Analysis: The Complete Guide for SREs and explore how agentic investigation works in What is Agentic Incident Management?. For open source options, see Open Source Incident Management: Why It Matters.

All platform claims verified from official vendor websites. Last verified: April 2026.

incident.io Alternative: Open Source AI Incident Management

Siddharth Singh — Mon, 06 Apr 2026 22:18:30 +0000

Key Takeaway: incident.io is one of the strongest incident management platforms available — used by Netflix, Airbnb, and Etsy with a free Basic tier. But it's closed-source SaaS with no self-hosted option and undisclosed AI. Aurora is an open source (Apache 2.0) alternative focused on autonomous AI investigation with full infrastructure access — free, self-hosted, and works with any LLM.

What is incident.io?

incident.io describes itself as "the all-in-one AI platform for on-call, incident response, and status pages — built for fast-moving teams." It's one of the most well-regarded tools in the space, with customers including Netflix, Airbnb, Etsy, Intercom, and Vanta.

incident.io offers four core products:

Incident Response — Slack-native workflows, catalog, post-mortems
On-Call — Schedules, escalation, alerting with 40+ alert sources
AI SRE — Autonomous investigation, code fix PRs, context search
Status Pages — Public, internal, and customer-specific pages

As Airbnb's Director of SRE Nils Pommerien said: "If I could point to the single most impactful thing we did to change the culture at Airbnb, it would be rolling out incident.io."

What is Aurora?

Aurora is an open source (Apache 2.0) AI agent for automated incident investigation and root cause analysis. Aurora's LangGraph-orchestrated agents autonomously query infrastructure across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes — delivering structured RCA with remediation recommendations.

Aurora is free, self-hosted, and works with any LLM provider including local models via Ollama.

How They Compare

AI Investigation

incident.io AI SRE (incident.io/ai-sre):

Triages and investigates alerts, analyzes root cause
Connects code changes, alerts, and past incidents to uncover what went wrong
@incident chat in Slack — ask questions, get answers within seconds
Spots failing pull requests behind incidents
Searches through thousands of resources for relevant answers
Pulls metrics from monitoring dashboards directly into Slack
Scans public Slack channels for related discussions
Drafts code fixes and opens pull requests directly from Slack
Suggests next steps based on past incidents
AI-native post-mortems
MCP server (Beta) for IDE integration

Aurora AI:

Multi-agent architecture via LangGraph with dynamic tool selection (30+ tools)
Correlates alerts across services and dependencies
Constructs investigation timelines linking deployments, infra events, and telemetry
Generates structured RCA with evidence citations and remediation steps
Human-in-the-loop for write/destructive actions — read-only commands run automatically
Executes kubectl, aws, az, gcloud commands in sandboxed Kubernetes pods
Queries cloud APIs directly — AWS (STS AssumeRole), Azure (Service Principal), GCP (OAuth), OVH, Scaleway
Traverses Memgraph infrastructure dependency graph for blast radius
Searches Weaviate knowledge base (vector search over runbooks and past incidents)
Suggests code fixes with diff preview — human approves and creates PR
Exports postmortems to Confluence and Jira
Works with any LLM provider — choose your model

Key Difference

incident.io's AI SRE correlates data from monitoring tools, source control, and past incidents within Slack. Aurora's agents go deeper — they directly query cloud provider APIs and execute CLI commands in sandboxed pods to gather live infrastructure data during investigation.

On-Call & Alerting

incident.io has a full on-call product:

40+ alert sources ready to go
Schedules: simple, shadow rotations, follow-the-sun
99.99% delivery reliability claimed
AI alert intelligence (noise reduction)
Cover requests and easy overrides
Holiday feeds, compensation calculator
Migration tools from PagerDuty and Opsgenie
Mobile app

Aurora has no on-call capabilities. For on-call, use incident.io, PagerDuty, Grafana OnCall, or Opsgenie alongside Aurora.

Incident Coordination

incident.io excels here:

Slack-native incident response with workflows
Catalog for service ownership and context
Post-mortems with AI drafts
Status pages (public, internal, customer-specific)
Insights and analytics
~69 integrations across monitoring, ticketing, communication, HR

Aurora creates Slack incident channels, tracks action items with Jira sync, and generates postmortems. No status pages, no service catalog, no mobile app.

Feature Comparison

incident.io has, Aurora doesn't:

On-call scheduling, escalation, alerting (40+ sources)
Microsoft Teams support
Status pages (public, internal, customer-specific)
Service catalog
Insights and analytics
Mobile app
MCP server for IDEs (Beta)
AI that searches Slack channels for context
Metrics dashboard pulling from Slack
HR system integrations (BambooHR, Rippling, etc.)
~69 integrations
SOC 2, HIPAA compliance
Netflix, Airbnb, Etsy as customers

Aurora has, incident.io doesn't:

Direct cloud infrastructure querying (AWS, Azure, GCP, OVH, Scaleway APIs)
CLI execution in sandboxed Kubernetes pods
Native vector search knowledge base (Weaviate RAG)
Infrastructure dependency graph (Memgraph)
Terraform/IaC state analysis
Open source (Apache 2.0) — full codebase auditable
Self-hosted deployment (Docker Compose, Helm)
LLM provider flexibility (OpenAI, Anthropic, Google, Ollama for air-gapped)
Free — no per-user pricing

Both have:

AI-powered root cause analysis
AI-suggested code fixes and PR generation
Slack incident channel management
Automated postmortem generation
GitHub and GitLab integration
Datadog, Grafana integration
Action item tracking
RBAC and security controls
Human-in-the-loop for destructive actions

Pricing

incident.io (incident.io/pricing):

Basic: Free forever (1 custom field, 1 workflow, 2 integrations)
Team: $15/user/month (annual) — add on-call for +$10/user/month
Pro: $25/user/month — add on-call for +$20/user/month, AI post-mortems included
Enterprise: Custom pricing — unlimited everything, HIPAA, SCIM, custom RBAC
Standalone On-Call: $20/user/month

Aurora:

Free — Apache 2.0, self-hosted
Costs: infrastructure + LLM API usage
$0 LLM cost with Ollama local models

Example: 20-person team on incident.io Pro + On-Call:
$25 + $20 = $45/user/month x 20 = $900/month

Aurora: $0 + infrastructure + LLM API.

Open Source vs SaaS

incident.io is closed-source SaaS. You cannot self-host, audit the AI's reasoning, or choose your LLM provider.

Aurora is fully open source under Apache 2.0:

Read every line of code the AI uses to investigate
Self-host with zero data leaving your environment
Use any LLM provider or run local models via Ollama
Modify workflows, add custom tools, fork for your needs

When to Choose incident.io

You want the best all-in-one SaaS platform — incident.io is widely regarded as the best UX in the category
Slack-native AI chat matters — @incident in Slack is deeply integrated
You need on-call + response + status pages in one tool
Enterprise customers are important — Netflix, Airbnb, Etsy validation
Free tier works for you — Basic plan is genuinely free forever
Compliance is critical — SOC 2, HIPAA available

When to Choose Aurora

Investigation is your bottleneck — you need AI that directly queries your cloud infrastructure, not just correlates monitoring data
Open source is required — full transparency into how AI investigates your production systems
Self-hosted is required — compliance, data sovereignty, or air-gapped environments
Multi-cloud breadth — you need OVH or Scaleway alongside AWS, Azure, GCP
LLM flexibility — choose your own provider or run local models
Budget is limited — Aurora is free; incident.io Pro + On-Call is $900+/month for 20 users
You want a custom integration — the Arvo AI team builds custom integrations at no cost. Reach out and they'll build it with you.

Using incident.io + Aurora Together

They complement each other well:

Alert fires → incident.io creates channel, pages on-call, updates status page
Same alert → Aurora receives webhook, starts AI investigation
incident.io coordinates response (roles, workflows, comms)
Aurora investigates in background (queries cloud, checks K8s, searches knowledge base)
On-call SRE finds Aurora's RCA in the incident channel
Aurora generates postmortem → exports to Confluence
incident.io tracks follow-up actions

Limitations of Aurora

Aurora focuses on investigation, not full incident lifecycle management:

No on-call scheduling — use incident.io, PagerDuty, or Grafana OnCall alongside Aurora
No status pages — incident.io includes these on all tiers
Slack only — no Microsoft Teams support currently
No mobile app — incident.io has a polished mobile experience
Fewer integrations — Aurora has 25+ vs incident.io's ~69
SOC 2 Type II in progress — not yet certified
No Slack-native AI chat — Aurora's AI works through its web dashboard, not @mentions in Slack channels like incident.io

"incident.io has the best UX in the category — we respect that. Aurora's strength is different: deep cloud infrastructure investigation. If your SRE team is spending hours querying AWS, kubectl, and Grafana manually after getting paged, that's the problem Aurora solves." — Noah Casarotto-Dinning, CEO at Arvo AI

Getting Started with Aurora

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt

Configure your monitoring webhooks, add cloud credentials, and investigations start automatically. See the full documentation.

Learn more at arvoai.ca. For other comparisons, see Aurora vs Traditional Tools, PagerDuty Alternative, and Rootly Alternative.

All claims sourced from official websites. incident.io data from incident.io. Aurora data from github.com/Arvo-AI/aurora. Last verified: April 2026.

FireHydrant Alternative: Open Source AI Incident Management

Siddharth Singh — Mon, 06 Apr 2026 22:05:16 +0000

Key Takeaway: FireHydrant is a solid incident management platform — but it was acquired by Freshworks in December 2025, AI features are locked to the Enterprise tier, and there's no autonomous investigation. Aurora is an open source (Apache 2.0) alternative with AI agents that autonomously investigate root causes across your cloud infrastructure — completely free and self-hosted.

What is FireHydrant?

FireHydrant is an all-in-one incident management platform that helps teams plan, respond to, and learn from incidents. Their tagline: "Fight Fires Faster." They claim teams resolve incidents up to 90% faster with their platform.

In December 2025, FireHydrant was acquired by Freshworks (NASDAQ: FRSH). The platform will become the incident management and reliability layer inside Freshservice, Freshworks' ITSM product.

Notable customers: Backblaze (91% faster mitigation), Bluecore (saving 30-90 minutes per incident), Snyk, LaunchDarkly, AuditBoard, Qlik, Avalara.

What is Aurora?

Aurora is an open source (Apache 2.0) AI agent for automated incident investigation and root cause analysis. When an alert fires, Aurora's LangGraph-orchestrated agents autonomously query your infrastructure across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes — delivering a structured RCA with remediation recommendations.

Aurora is free, self-hosted, and works with any LLM provider.

How They Compare

AI Capabilities

FireHydrant AI (Enterprise tier only):

AI-generated incident summaries from Slack messages
Automated event timelines
Real-time call transcription (Zoom, Google Meet) with key point summarization
AI-drafted retrospectives with contributing factors and suggested action items
Stakeholder update generation

FireHydrant's AI is documentation-focused — it summarizes what happened, transcribes calls, and drafts retrospectives. It does not autonomously investigate root causes or query infrastructure.

Aurora AI:

Multi-agent architecture via LangGraph with dynamic tool selection (30+ tools)
Correlates alerts across services and dependencies
Constructs investigation timelines linking deployments, infra events, and telemetry
Generates structured RCA with evidence citations and remediation steps
Human-in-the-loop for write/destructive actions — read-only commands run automatically
Executes kubectl, aws, az, gcloud commands in sandboxed Kubernetes pods
Queries cloud APIs directly — AWS, Azure, GCP, OVH, Scaleway
Traverses Memgraph infrastructure dependency graph for blast radius
Searches Weaviate knowledge base (vector search over runbooks and past incidents)
Suggests code fixes with diff preview — human approves and creates PR
Works with any LLM provider including local models via Ollama

Incident Response & Coordination

FireHydrant is strong at incident coordination:

Slack and Microsoft Teams chatbot
Automated runbooks (triggered by severity, service, or custom fields)
Incident roles and assignments
Service catalog with dependency mapping and deployment tracking
38+ integrations
MTTx analytics (MTTD, MTTA, MTTR, MTTM)
Mobile notifications (iOS, Android)

Aurora creates and manages Slack incident channels, tracks action items with Jira sync, and sends investigation notifications. Aurora does not have Microsoft Teams support, incident roles, service catalog, or mobile app.

On-Call & Alerting

FireHydrant (branded "Signals"):

Team-based on-call schedules with unlimited escalation policies
SMS, voice, push, Slack, Teams, email, WhatsApp notifications
Alert routing via Common Expression Language (CEL)
Consumption-based alert pricing (not per-seat)
Alert grouping (Enterprise only)

Aurora has no on-call capabilities. For on-call, use PagerDuty, Grafana OnCall, or Opsgenie alongside Aurora.

Feature Comparison

FireHydrant has, Aurora doesn't:

Microsoft Teams support
Incident roles and assignments
Service catalog with dependency mapping
Status pages (public and private)
MTTx analytics dashboards
Mobile notifications (iOS, Android)
Deployment tracking
Call transcription (Zoom, Google Meet)
SOC 2 compliance
38+ integrations
Consumption-based alerting

Aurora has, FireHydrant doesn't:

Autonomous AI investigation (FireHydrant AI is documentation-focused only)
Direct cloud infrastructure querying (AWS, Azure, GCP, OVH, Scaleway)
CLI execution in sandboxed Kubernetes pods
Native vector search knowledge base (Weaviate RAG)
Infrastructure dependency graph (Memgraph)
Terraform/IaC state analysis
AI-suggested code fixes with diff preview
Open source (Apache 2.0) — full codebase auditable
Self-hosted deployment (Docker Compose, Helm)
LLM provider flexibility (OpenAI, Anthropic, Google, Ollama)
Free — no licensing costs

Both have:

Slack incident channel management
Automated postmortem/retrospective generation
Action item tracking with Jira sync
On-call integrations (PagerDuty, Opsgenie)
Datadog, Grafana, New Relic monitoring integrations
GitHub integration
Runbook/workflow automation
RBAC and security controls

Pricing

FireHydrant (firehydrant.com/pricing):

Free trial: 2 weeks, up to 10 responders
Platform Pro: $9,600/year (flat, up to 20 responders)
Enterprise: Custom pricing (required for AI features)
Alerting is consumption-based (separate from platform fee)

Aurora:

Free — Apache 2.0, self-hosted
Costs: infrastructure + LLM API usage
$0 LLM cost with Ollama local models

Note: FireHydrant AI features (summaries, transcripts, triage, retrospectives) are only available on the Enterprise tier. Pro users do not get AI capabilities.

The Freshworks Acquisition Factor

FireHydrant was acquired by Freshworks in December 2025. What this means:

The platform will be integrated into Freshservice (Freshworks' ITSM product)
Current accounts, pricing, and support stay the same during transition
Long-term product direction is now under Freshworks' roadmap
Some teams may want to evaluate alternatives before deeper Freshworks lock-in

Aurora is independently maintained open source — no acquisition risk, no vendor roadmap dependency.

When to Choose FireHydrant

You need full incident coordination — roles, runbooks, status pages, service catalog, analytics
Call transcription matters — real-time Zoom/Google Meet transcription with AI summaries
Microsoft Teams is required — Aurora is Slack-only
You want managed SaaS — no infrastructure to maintain
You're already in the Freshworks ecosystem — Freshservice integration will be seamless

When to Choose Aurora

Investigation is your bottleneck — you need AI that actually investigates, not just summarizes
You need direct cloud querying — AI agents that run commands on AWS, Azure, GCP, K8s
Open source is required — audit how AI investigates your infrastructure
Self-hosted is required — compliance, data sovereignty, or air-gapped environments
Budget is limited — FireHydrant Enterprise (required for AI) is custom pricing; Aurora is free
LLM flexibility — choose your provider or run local models
You're concerned about the acquisition — Aurora has no vendor lock-in risk
You want a custom integration — the Arvo AI team builds custom integrations at no cost. Reach out and they'll build it with you.

Limitations of Aurora

Aurora is powerful for investigation but doesn't replace a full incident coordination platform:

No on-call scheduling — use PagerDuty, Grafana OnCall, or Opsgenie alongside Aurora
No status pages — use Atlassian Statuspage, incident.io, or Instatus
Slack only — no Microsoft Teams support currently
No mobile app — investigation results are accessed via web dashboard
SOC 2 Type II in progress — not yet certified (FireHydrant has SOC 2)
Self-hosted requires infrastructure — you maintain the Docker/K8s deployment

"We built Aurora for one job — investigating why incidents happen. We deliberately didn't build on-call or status pages because tools like PagerDuty and FireHydrant already do those well. Aurora is the investigation layer that plugs into your existing stack." — Noah Casarotto-Dinning, CEO at Arvo AI

Getting Started with Aurora

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt

Configure your monitoring webhooks, add cloud credentials, and investigations start automatically. See the full documentation.

Learn more at arvoai.ca. For other comparisons, see Aurora vs Traditional Tools, PagerDuty Alternative, and Rootly Alternative.

All claims sourced from official websites. FireHydrant data from firehydrant.com. Aurora data from github.com/Arvo-AI/aurora. Last verified: April 2026.

Resolve.ai Alternative: Open Source AI for Incident Investigation

Siddharth Singh — Thu, 02 Apr 2026 21:44:19 +0000

Key Takeaway: Resolve.ai is a $1B-valued AI SRE platform used by Coinbase, DoorDash, and Salesforce — but pricing requires contacting sales with no public pricing page. Aurora is an open source (Apache 2.0) alternative that delivers autonomous AI investigation with sandboxed cloud execution, infrastructure graphs, and knowledge base search — completely free and self-hosted.

What is Resolve.ai?

Resolve.ai is an AI-powered autonomous SRE platform founded in 2024 by Spiros Xanthos (former SVP at Splunk, co-creator of OpenTelemetry) and Mayank Agarwal. It raised $125M in Series A at a reported $1 billion valuation, backed by Lightspeed and Greylock with angels including Fei-Fei Li and Jeff Dean.

Resolve.ai positions as "machines on call for humans" — a multi-agent AI system that autonomously investigates production incidents across code, infrastructure, and telemetry.

Notable customers: Coinbase (73% faster time to root cause), DoorDash (87% faster investigations), Salesforce, MongoDB, Zscaler, Toast, Pinecone.

What is Aurora?

Aurora is an open source (Apache 2.0) AI agent for automated incident investigation and root cause analysis. When an alert fires, Aurora's LangGraph-orchestrated agents autonomously query your infrastructure across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes — correlating data from 25+ tools and delivering a structured RCA with remediation recommendations.

Aurora is free, self-hosted, and works with any LLM provider including local models via Ollama.

How They Compare

AI Investigation Approach

Resolve.ai:

Multi-agent architecture with parallel hypothesis testing
Formulates multiple theories per incident, deploys sub-agents to investigate each simultaneously
Correlates alerts across services and dependencies
Constructs causal timelines linking code changes, infra events, and telemetry
Generates root cause analysis with confidence scores
Human-in-the-loop approval gates before automated actions
Per-customer fine-tuned models

Aurora:

Multi-agent architecture via LangGraph with dynamic tool selection (30+ tools)
Correlates alerts across services and dependencies (AlertCorrelator + Memgraph graph)
Constructs investigation timelines linking deployments, infra events, and telemetry
Generates structured RCA with evidence citations and remediation steps
Human-in-the-loop for write/destructive actions — read-only commands run automatically
Executes kubectl, aws, az, gcloud commands in sandboxed Kubernetes pods (non-root, read-only filesystem, capabilities dropped, seccomp enforced)
Queries cloud APIs directly — AWS (STS AssumeRole), Azure (Service Principal), GCP (OAuth), OVH, Scaleway
Traverses Memgraph infrastructure dependency graph for blast radius
Searches Weaviate knowledge base (vector search over runbooks and past incidents)
Works with any LLM provider — choose your own model

Cloud & Infrastructure

Resolve.ai:

AWS and GCP confirmed
Azure is not listed on their integrations page
Kubernetes support confirmed
Deploys an on-premise "satellite" agent as a secure gateway — core platform runs in Resolve's cloud

Aurora:

AWS, Azure, GCP, OVH, Scaleway — all five with native authentication
Deep Kubernetes integration via outbound WebSocket kubectl-agent
Fully self-hosted — Docker Compose or Helm chart
No data leaves your environment

Integrations

Resolve.ai (resolve.ai/integrations):

Monitoring: Grafana, Datadog, Splunk, Prometheus, Dynatrace, Elastic, Chronosphere, Kloudfuse, OpenSearch
Infrastructure: Kubernetes, AWS, GCP
Code: GitHub
Chat: Slack
Knowledge: Notion
Custom: MCP, APIs, Webhooks
Total: ~17+ confirmed

Aurora (github.com/Arvo-AI/aurora):

Monitoring: PagerDuty, Datadog, Grafana, New Relic, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, Splunk
Cloud: AWS, Azure, GCP, OVH, Scaleway
Infrastructure: Kubernetes, Terraform, Docker
CI/CD: GitHub, Bitbucket, Jenkins, CloudBees, Spinnaker
Docs: Confluence, Jira, SharePoint
Network: Cloudflare, Tailscale
Communication: Slack
Total: 25+ confirmed

Knowledge & Learning

Resolve.ai:

Learns from runbooks, wikis, chats, and historical incidents
Builds a knowledge graph of infrastructure components
Captures tribal knowledge from production systems
Per-customer fine-tuned models that improve from feedback (thumbs up/down)

Aurora:

Built-in Weaviate vector store for semantic search over runbooks, postmortems, and documentation
Memgraph infrastructure dependency graph maps relationships across all cloud providers
Learns from past investigations stored in the knowledge base

Code Fixes & Remediation

Resolve.ai: Generates remediation PRs via GitHub with supporting context. Suggests kubectl commands and scripts. All actions require human approval before execution.

Aurora: Suggests code fixes with diff preview — human reviews and creates PR with one click via GitHub and Bitbucket. Executes read-only CLI commands in sandboxed pods. Generates postmortems exportable to Confluence and Jira.

Feature Comparison

Resolve.ai has, Aurora doesn't:

Automatic JIRA ticket updates during investigation
Enterprise support with SLAs
Available on AWS Marketplace

Aurora has, Resolve.ai doesn't:

Azure, OVH, and Scaleway cloud support
Open source (Apache 2.0) — full codebase auditable
Self-hosted deployment (Docker Compose, Helm)
LLM provider flexibility (OpenAI, Anthropic, Google, Ollama for air-gapped)
Slack incident channel creation and management
PagerDuty, New Relic, BigPanda, ThousandEyes, Coroot integrations
Terraform/IaC state analysis
Bitbucket, Jenkins, CloudBees, Spinnaker integrations
Confluence and SharePoint integration
Network integrations (Cloudflare, Tailscale)
Free — no licensing costs whatsoever

Both have:

Autonomous AI incident investigation
Multi-agent architecture
Root cause analysis with evidence
AI-suggested code fixes (human-approved PRs)
Infrastructure dependency/knowledge graph
Knowledge base search (runbooks, wikis, past incidents)
Kubernetes investigation
AWS and GCP support
Datadog, Grafana, Splunk, Dynatrace integrations
Slack integration
RBAC and security controls
AI that learns from user feedback
Causal timeline construction with dependency chain mapping
Human-in-the-loop for destructive actions
Per-customer tuning (Resolve.ai via fine-tuned models; Aurora via open source customization)
SOC 2 Type II compliance (Resolve.ai: certified; Aurora: in progress)

Pricing

Resolve.ai:

No public pricing page
Custom enterprise pricing (contact sales)
No free tier or self-service signup
Target: large enterprise SRE teams

Aurora:

Free — Apache 2.0, self-hosted
Costs: infrastructure (VM or K8s cluster) + LLM API usage
$0 LLM cost with Ollama local models
No contracts, no sales calls, no per-user pricing

The price difference is the core story. Resolve.ai delivers enterprise AI investigation for enterprise budgets. Aurora delivers open source AI investigation for everyone else.

Open Source vs Enterprise SaaS

Resolve.ai is a closed-source, cloud-hosted enterprise platform. You cannot audit the AI's reasoning, choose your own LLM, or self-host. Your incident data flows through Resolve's infrastructure (they state they don't persist raw data or train across customers).

Aurora is fully open source. You can:

Read every line of code the AI uses to investigate your infrastructure
Self-host with zero data leaving your environment
Use any LLM provider — or run local models for fully air-gapped operation
Modify investigation workflows, add custom tools, fork for your needs
Contribute back to the project

When to Choose Resolve.ai

You're a large enterprise company with budget for enterprise AI tooling
Managed fine-tuned models — you want the vendor to handle per-customer model training rather than customizing open source yourself
You need certified compliance today — SOC 2 Type II, HIPAA, GDPR already certified (Aurora's SOC 2 is in progress)
Managed service preferred — you don't want to maintain AI infrastructure

When to Choose Aurora

Budget matters — you can't justify custom enterprise pricing for AI investigation
Open source is required — you need full transparency into how AI investigates your production systems
Self-hosted is required — compliance, data sovereignty, or air-gapped environments
Multi-cloud breadth — you need Azure, OVH, or Scaleway alongside AWS and GCP
LLM flexibility — you want to choose your own provider or run models locally
You're a startup or mid-market — Resolve.ai has no mid-market pricing
You want a custom integration — the Arvo AI team actively builds custom integrations for companies at no cost. If there's a feature gap, reach out and they'll build it with you.

Getting Started with Aurora

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt

Configure your monitoring webhooks (PagerDuty, Datadog, Grafana), add cloud provider credentials, and investigations start automatically. See the full documentation for deployment guides.

Rootly Alternative: Open Source AI Incident Management

Siddharth Singh — Thu, 02 Apr 2026 21:28:21 +0000

Key Takeaway: Rootly is an AI-native incident management platform with on-call, workflows, and AI SRE agents — starting at $20/user/month with AI SRE priced separately. Aurora is an open source (Apache 2.0) AI agent focused purely on autonomous incident investigation and root cause analysis. Rootly orchestrates your entire incident lifecycle. Aurora automates the hardest part — figuring out why something broke.

What is Rootly?

Rootly describes itself as an "AI-native incident management platform" — an all-in-one tool for detecting, managing, learning from, and resolving incidents. Founded in 2021, it's used by teams at Replit, NVIDIA, LinkedIn, Figma, and hundreds more, with a 4.8/5 rating on G2.

Rootly offers three products:

Incident Response — Slack/Teams-native workflows, playbooks, roles, status pages, retrospectives
On-Call — Schedules, escalation policies, alert routing, live call routing, mobile app
AI SRE — Autonomous AI agents for root cause analysis, remediation, and alert triage

What is Aurora?

Aurora is an open source AI agent that automates incident investigation. When a monitoring tool fires an alert, Aurora's LangGraph-orchestrated agents autonomously query your infrastructure across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes — correlating data from 25+ tools and delivering a structured root cause analysis with remediation recommendations.

Aurora doesn't manage your incident lifecycle. It investigates the root cause.

How They Compare

Incident Response & Coordination

Rootly is a full incident lifecycle platform:

Slack and Microsoft Teams native incident channels
Automated workflows (create channels, page responders, update status)
Incident roles (commander, communications lead, etc.)
Playbooks and runbooks
Status pages (internal and external)
Action item tracking with Jira sync
DORA metrics and advanced analytics
Mobile app (iOS and Android)

Aurora is not a full incident coordination platform — no roles or status pages. However, Aurora does create and manage Slack incident channels, tracks action items with Jira sync, sends investigation notifications, and supports @aurora mentions in any channel for conversational investigation.

On-Call Management

Rootly has a full on-call product:

Schedules with shadow rotations, holiday calendars, PTO overrides
Escalation policies with gap detection
SMS, voice, push notifications (bypass Do Not Disturb)
Live call routing
On-call pay calculator
99.99% uptime claim

Aurora has no on-call capabilities. No schedules, no paging, no escalation. For on-call, use Rootly, PagerDuty, Grafana OnCall, or Opsgenie alongside Aurora.

AI Investigation

This is where the tools diverge most.

Rootly AI SRE (rootly.com/ai-sre):

Correlates alerts with code changes, deploys, and config changes
Generates root cause analysis with confidence scores
Surfaces similar past incidents and proven solutions
Drafts remediation steps and PRs with suggested fixes
AI Meeting Bot that transcribes incident bridges in real time
@rootly AI chat in Slack/Teams for summaries and task assignment
MCP server for IDEs (Cursor, Windsurf, Claude Code)
Chain-of-thought visibility ("see why a root cause is flagged")
Whether it directly queries cloud infrastructure APIs is unverified

Aurora AI Investigation:

Autonomous multi-step investigation using LangGraph-orchestrated agents
Dynamically selects from 30+ tools per investigation
Executes kubectl, aws, az, gcloud commands in sandboxed Kubernetes pods (non-root, read-only filesystem, capabilities dropped, seccomp enforced)
Queries cloud APIs directly — AWS (STS AssumeRole), Azure (Service Principal), GCP (OAuth), OVH, Scaleway
Traverses Memgraph infrastructure dependency graph for blast radius
Searches Weaviate knowledge base (vector search over runbooks, past postmortems)
Generates structured RCA with timeline, evidence citations, and remediation
Generates code fix pull requests via GitHub and Bitbucket
Exports postmortems to Confluence and Jira

Knowledge Base

Rootly: Surfaces similar past incidents during investigations. Integrates with Glean for broader knowledge search. No native vector search product.

Aurora: Built-in Weaviate-powered vector store. Upload runbooks, past postmortems, and documentation — the AI agent searches them using semantic similarity during every investigation.

Postmortems

Rootly: AI-generated retrospectives with context, timelines, and custom templates. Collaborative editing. Jira sync for action items.

Aurora: AI-generated postmortems with timeline, root cause, impact assessment, and remediation steps. One-click export to Confluence and Jira.

Feature Comparison

Rootly has, Aurora doesn't:

On-call scheduling, escalation policies, paging (SMS/voice/push)
Microsoft Teams support (Aurora is Slack-only)
Automated incident workflows (create channels, page responders, update status)
Status pages (internal and external)
Incident roles
DORA metrics and analytics
Mobile app (iOS, Android)
MCP server for IDEs
AI Meeting Bot for incident bridges
SOC 2 Type II, HIPAA, GDPR, CCPA compliance
70+ integrations

Aurora has, Rootly doesn't (or is unverified):

Direct cloud infrastructure querying (AWS, Azure, GCP, OVH, Scaleway APIs)
CLI command execution in sandboxed Kubernetes pods
Native vector search knowledge base (Weaviate RAG)
Infrastructure dependency graph (Memgraph)
Terraform/IaC state analysis
Open source (Apache 2.0) — full codebase auditable
Self-hosted deployment (Docker Compose, Helm)
LLM provider flexibility including local models (Ollama for air-gapped)
Free — no per-user or per-incident pricing

Both have:

AI-powered root cause analysis
Code fix PR generation
Automated postmortem generation
PagerDuty, Datadog, Grafana integrations
GitHub integration
Confluence integration
HashiCorp Vault integration
BYOK for LLM providers
Slack incident channels
Action item tracking with Jira sync

Pricing

Rootly (rootly.com/pricing):

Incident Response Essentials: $20/user/month
On-Call Essentials: $20/user/month
AI SRE: Contact sales (no published price)
Enterprise tiers: Contact sales
Bundle discounts available for IR + On-Call + AI SRE
Startup discount: up to 50% off (<100 employees, <$50M raised)
Free 14-day trial

Aurora:

Free — Apache 2.0, self-hosted
Costs: infrastructure (VM or K8s cluster) + LLM API usage
$0 LLM cost possible with Ollama local models

Example: 20-person SRE team

For Rootly IR + On-Call: $20 + $20 = $40/user/month x 20 = $800/month (before AI SRE add-on, which is priced separately via sales).

For Aurora: $0 + infrastructure + LLM API.

Note: Rootly pricing from rootly.com/pricing. AI SRE pricing is not publicly listed.

Open Source vs SaaS

Rootly is SaaS-only. The core platform is proprietary. They have open source tooling on GitHub (Terraform provider with 400,000+ downloads, Backstage plugin, CLI, SDKs) but not the platform itself.

Aurora is fully open source under Apache 2.0. The entire codebase — backend, frontend, agent orchestration — is on GitHub. You can:

Audit exactly what the AI does on your infrastructure
Modify investigation workflows and add custom tools
Fork and customize for your organization
Run fully air-gapped with local LLMs via Ollama
Keep all incident data in your own environment

When to Choose Rootly

Rootly is the better choice when:

You need a full incident lifecycle platform — on-call, workflows, status pages, roles, retrospectives, DORA metrics in one tool
Slack/Teams-native workflows matter — Rootly's incident channels and AI chat are deeply embedded in collaboration tools
Compliance requirements — SOC 2 Type II, HIPAA, GDPR out of the box
You want managed SaaS — no infrastructure to maintain
You need a mobile app — iOS and Android for on-call
Enterprise support — dedicated support, SLAs, BAA for HIPAA

When to Choose Aurora

Aurora is the better choice when:

Investigation is your bottleneck — your team spends hours diagnosing incidents manually
You need deep cloud investigation — AI agents that directly query AWS, Azure, GCP, and Kubernetes
You want open source — full transparency into how AI investigates your infrastructure
Self-hosted is required — compliance, data sovereignty, or air-gapped environments
Budget is limited — free forever, no per-user pricing
LLM flexibility matters — bring any provider, including local models
You already have on-call — PagerDuty, Grafana OnCall, or Opsgenie handles paging; you need the investigation layer
You want a custom integration — Aurora is open source and the Arvo AI team actively builds custom integrations for companies that need them — at no cost. If there's a feature gap, reach out and they'll build it with you.

Using Rootly + Aurora Together

They're not mutually exclusive. Rootly manages your incident lifecycle; Aurora investigates the root cause:

Alert fires → Rootly creates incident channel, pages on-call
Same alert → Aurora receives webhook, starts AI investigation
Rootly coordinates the response (roles, comms, status page)
Aurora investigates in the background (queries cloud, checks K8s, searches knowledge base)
On-call SRE finds Aurora's completed RCA with root cause and remediation
Aurora generates postmortem → exports to Confluence
Rootly tracks action items → syncs to Jira

Getting Started with Aurora

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt

Configure your monitoring webhooks (PagerDuty, Datadog, Grafana), add cloud provider credentials, and investigations start automatically. See the full documentation for deployment guides.

PagerDuty Alternative for Root Cause Analysis: Why SRE Teams Are Adding AI Investigation

Siddharth Singh — Wed, 01 Apr 2026 21:36:15 +0000

Key Takeaway: PagerDuty is the industry standard for alerting and on-call management — but it doesn't investigate why incidents happen. Aurora is an open source AI agent that plugs into PagerDuty via webhooks and autonomously investigates root causes across AWS, Azure, GCP, and Kubernetes. They're complementary tools, but for teams spending hours on manual RCA, Aurora fills the gap PagerDuty doesn't cover.

PagerDuty has over 30,000 customers and dominates on-call management. It's excellent at what it does: detecting alerts, routing them to the right person, coordinating incident response, and tracking SLAs.

But here's the problem: PagerDuty pages you. Then you're on your own.

The actual investigation — SSHing into servers, querying CloudWatch, checking Kubernetes pod logs, correlating deployments with error spikes — is still manual. According to the VOID (Verica Open Incident Database), the median incident involves 3.5 contributing factors, and the investigation phase consumes the majority of mean time to resolve (MTTR).

This is the gap Aurora fills.

PagerDuty vs Aurora: Different Tools, Different Jobs

This isn't a "which is better" comparison. PagerDuty and Aurora solve different problems:

	PagerDuty	Aurora
Primary job	Alert routing, on-call, coordination	Root cause investigation
Answers the question	"Who needs to know and how do we coordinate?"	"Why did this happen and what should we fix?"
Trigger	Monitoring tool fires alert	PagerDuty webhook (or Datadog, Grafana, etc.)
Output	Engineer gets paged, war room opens	Structured RCA with timeline, root cause, remediation

They work together. Aurora ingests PagerDuty incident.triggered webhooks. When PagerDuty pages your SRE, Aurora is already investigating in the background.

What PagerDuty Does Well

PagerDuty's strengths are real and well-established:

On-call scheduling — Flexible rotations, escalation policies, shift overrides
Alert routing — 700+ integrations for ingesting alerts from every monitoring tool
Multi-channel paging — SMS, phone, push notifications, email
Incident coordination — War rooms, stakeholder communications, status pages
SLA tracking — Urgency-based alerting and escalation
AI noise reduction — AIOps add-on claims 91% alert noise reduction via intelligent correlation and deduplication

PagerDuty has also added AI features through PagerDuty Advance, including:

AI incident summaries ("catch me up" in Slack)
AI-generated status updates
AI postmortem drafts (Beta)
SRE Agent for triage and approved remediation actions
Probable Origin for pattern-based root cause suggestions

Where PagerDuty Stops

Despite the AI additions, PagerDuty's investigation capabilities have limits:

No autonomous multi-step investigation. PagerDuty's SRE Agent surfaces past incidents and patterns, but it doesn't autonomously query your AWS accounts, check Kubernetes pod status, correlate Terraform changes, or trace dependency graphs. The investigation itself is still on the engineer.

No native cloud infrastructure querying. PagerDuty receives alerts from CloudWatch, Azure Monitor, etc. — it doesn't query them directly. It can't run kubectl get pods or aws cloudwatch get-metric-data on your behalf during an investigation.

No knowledge base with vector search. PagerDuty's RAG capability is partial — it requires configuring Amazon Q Business as an external integration. There's no native vector search over your runbooks and past postmortems.

No code fix suggestions. PagerDuty can surface recent code changes that may be related to an incident, but it doesn't generate remediation code or create pull requests.

AI features are paid add-ons. AIOps starts at $699/month. PagerDuty Advance starts at $415/month. These are on top of per-user pricing ($21-$41+/user/month depending on tier).

What Aurora Does Differently

Aurora is an open source (Apache 2.0) AI agent that automates the investigation phase — the part that happens after you get paged.

Autonomous Investigation

When Aurora receives an alert webhook, its LangGraph-orchestrated AI agents:

Analyze the alert context (severity, service, timing)
Dynamically select from 30+ tools to investigate
Execute kubectl, aws, az, gcloud commands in sandboxed Kubernetes pods
Query logs, metrics, and recent deployments across cloud providers
Search the knowledge base for relevant runbooks and past incidents
Traverse the infrastructure dependency graph for blast radius
Synthesize everything into a structured root cause analysis

No human in the loop during investigation. The SRE gets paged by PagerDuty and finds a completed RCA waiting in Aurora.

Multi-Cloud Native

Aurora connects directly to your cloud infrastructure:

Provider	Authentication
AWS	STS AssumeRole (temporary credentials)
Azure	Service Principal
GCP	OAuth
OVH	API key
Scaleway	API token
Kubernetes	Kubeconfig via outbound WebSocket agent

25+ Verified Integrations

Category	Tools
Monitoring	PagerDuty, Datadog, Grafana, New Relic, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, Splunk
Cloud	AWS, Azure, GCP, OVH, Scaleway
Infrastructure	Kubernetes, Terraform, Docker
CI/CD	GitHub, Bitbucket, Jenkins, CloudBees, Spinnaker
Docs & Knowledge	Confluence, Jira, SharePoint
Network	Cloudflare, Tailscale
Communication	Slack

Knowledge Base with RAG

Aurora includes a built-in Weaviate-powered vector store. Upload your runbooks, past postmortems, and documentation — the AI agent searches them during every investigation using semantic similarity, not just keyword matching.

AI Code Fix Suggestions

Aurora can generate pull requests with remediation code via its GitHub and Bitbucket integrations. It doesn't just tell you what's wrong — it suggests how to fix it with actual code.

Automated Postmortems

Structured postmortem documents generated automatically with:

Incident timeline with timestamps
Root cause identification with evidence and citations
Impact assessment
Remediation steps (taken and recommended)
One-click export to Confluence or Jira

Feature Comparison

Feature	PagerDuty	Aurora
On-call scheduling	Yes (core)	No
Alert routing & escalation	Yes (core)	No
SMS/phone/push paging	Yes (core)	No
Status pages	Yes (add-on, from $89/mo)	No
SLA/SLO tracking	Yes	No
Autonomous AI investigation	Partial (SRE Agent for triage)	Yes (full multi-step)
Native cloud querying	No (receives alerts)	Yes (AWS, Azure, GCP, OVH, Scaleway)
CLI execution on infra	Via Runbook Automation add-on	Yes (sandboxed K8s pods)
Knowledge base (RAG)	Via Amazon Q Business integration	Yes (native Weaviate)
Infrastructure graph	No	Yes (Memgraph)
AI postmortems	Beta (via Jeli)	Yes (with Confluence export)
AI code fix PRs	No	Yes (GitHub, Bitbucket)
Open source	No (Rundeck only)	Yes (Apache 2.0)
Self-hosted	No (SaaS only)	Yes (Docker, Helm)
LLM provider choice	No (undisclosed, fixed)	Yes (OpenAI, Anthropic, Google, Ollama)
Integrations	700+	25+
Pricing	From $21/user/mo + AI add-ons ($415-$699/mo)	Free (self-hosted)

Cost Comparison

For a team of 20 SREs on PagerDuty Business with AI features:

Line Item	PagerDuty	Aurora
Base platform	$41/user/mo x 20 = $820/mo	$0
AIOps	$699/mo	Included
PagerDuty Advance (GenAI)	$415/mo	Included
Status pages	$89/mo	Not included
Total	~$2,023/mo	$0 + infra + LLM API

Aurora's costs are infrastructure (a VM or K8s cluster) and LLM API usage. With Ollama running local models, the LLM cost is also $0.

Note: PagerDuty pricing verified from pagerduty.com/pricing as of March 2026. Aurora is free under Apache 2.0.

When to Use PagerDuty + Aurora Together

The strongest setup is running both:

PagerDuty receives alerts from your monitoring tools (Datadog, CloudWatch, Grafana)
PagerDuty pages the right on-call engineer via SMS/phone
Aurora receives the same alert via PagerDuty webhook (incident.triggered)
Aurora's AI agents investigate autonomously in the background
The on-call SRE opens Aurora and finds a completed RCA with root cause, timeline, and remediation
Aurora generates the postmortem and exports it to Confluence

PagerDuty handles the who and when. Aurora handles the why and how to fix it.

When Aurora Alone Might Be Enough

For smaller teams or budget-conscious organizations:

You don't need enterprise on-call — Your team is small enough that a simple rotation works
You already have alerting — Datadog, Grafana, or CloudWatch can send webhooks directly to Aurora
Investigation is your bottleneck — You're spending more time diagnosing than coordinating
You need self-hosted — Compliance or security requires keeping incident data on-premise
Budget is limited — PagerDuty + AI add-ons at $2,000+/mo isn't feasible

Aurora can ingest webhooks directly from any monitoring tool — PagerDuty is not required.

Getting Started

git clone https://github.com/Arvo-AI/aurora.git
cd aurora
make init
make prod-prebuilt

Configure your PagerDuty webhook to point at Aurora, add your cloud provider credentials, and investigations start automatically.