Forem: Causely

How to Turn Slow Queries into Actionable Reliability Metrics with OpenTelemetry

Severin Neumann — Wed, 04 Feb 2026 10:11:00 +0000

Slow SQL queries degrade user experience, cause cascading failures, and turn simple operations into production incidents. The traditional fix? Collect more telemetry. But more telemetry means more things to look at, not necessarily more understanding.

Instead of treating traces as a data stream we might analyze someday, we should be opinionated about what matters at the moment of decision. As we argued in The Signal in the Storm, raw telemetry only becomes useful when we extract meaningful patterns.

In this guide, you’ll build a repeatable workflow that turns OpenTelemetry database spans into span-derived metrics you can dashboard and alert on—so you can identify what’s slow, what matters most, and what just regressed.

We’ll make this concrete with slow SQL queries, serving two use cases:

Optimization : Which queries yield the most value if made faster, weighted by traffic?
Incident response : Which queries are behaving abnormally right now?

We’ll build a lab where your app emits OpenTelemetry traces, and we distill those into actionable metrics, starting with simple slow query detection, then adding traffic-weighted impact, and finally anomaly detection.

Want to skip the theory? Jump to the Lab. But the context helps you understand what you’re building.

What Makes a Query Slow?

“Slow” isn’t a single problem. It’s a symptom with fundamentally different causes. A 50ms query might be fine for a reporting dashboard but catastrophic for checkout. As High Performance MySQL emphasizes, understanding why a query is slow determines how to fix it. Here are the most common problems that may cause slow queries:

Excessive Work

The database does more than necessary—typically full table scans due to missing or unusable indexes. Without an index on customer_id, a simple SELECT * FROM orders WHERE customer_id = $1 grows from 20ms at 10K rows to minutes at 10M rows. The query didn’t change; the data volume did. See Use The Index, Luke! for the fundamentals.

Aggregations and joins compound this. Even indexed queries can explode when the planner misjudges cardinality and chooses the wrong join strategy.

Resource Contention

Perfectly optimized queries can be slow when waiting for resources. Lock contention blocks queries until other transactions release rows. Connection pool exhaustion adds latency before the query even starts. A query spending 95% of its time waiting for locks won’t be fixed by query optimization—it needs transaction redesign.

Environmental Pressure

CPU saturation, I/O bottlenecks, and memory pressure can slow any query. The same SQL with the same plan performs completely differently under resource contention.

Plan Regressions

Performance degrades when execution plans change—even with identical queries and data. Parameter-sensitive plans optimize for one set of values but fail for others. Stale statistics after bulk loads cause the planner to choose terrible strategies. The PostgreSQL Performance Tips documentation covers how to catch these regressions.

Pathological Patterns

Some slowness doesn’t appear in slow query logs. The N+1 problem executes 100 fast queries (2ms each) sequentially, adding 200ms latency plus network overhead. No individual query is “slow,” but the pattern is catastrophic.

The classic workflow: DB-native tooling + manual triage

Databases ship with excellent diagnostic tools: slow query logs, query stores like PostgreSQL’s pg_stat_statements, and plan inspection with EXPLAIN. These tell you what’s expensive inside the database.

What they don’t provide is context. Which service triggered the slow query? Is it user-facing or background work? Does it correlate with the latency spike you’re investigating? You’re left with a list of slow queries and no signal about which ones matter most.

Typically, someone bridges this gap manually: a developer notices a slow endpoint, brings the query to a DBA, and they optimize it together. This works, but that manual linking is exactly what we can automate.

Bringing Context to Slow Queries

Database tools tell you what is slow, but not why it matters. When you find a slow query in your logs, you’re missing critical context: Which service triggered it? Is it user-facing or background work? Does it correlate with the latency spike you’re investigating?

Distributed traces provide this context. Each database span is embedded in a request context—it knows which service, endpoint, and user triggered it.

Instead of correlating database logs and traces after the fact, we analyze slow queries directly from traces with all the application context built in.

The Building Blocks

Now that we understand the philosophy and the value of context-rich traces, let’s look at the building blocks we’ll use to implement slow query analysis.

The Observability Stack

For our lab, we use the OpenTelemetry Collector paired with docker-otel-lgtm—a pre-packaged stack from Grafana that bundles Loki, Grafana, Tempo, and Mimir in a single container. This gives us a complete observability environment with minimal setup.

The Application

Our sample application is a simple Go-based “Album API” that serves music album data from PostgreSQL. It’s intentionally designed to produce the kind of intermittent slow queries that are common in production. They services use otelsql to instrument database calls, emitting spans with the stable OpenTelemetry database semantic conventions.

The Dashboards

We’ll build three dashboards, each adding a layer of insight:

A simple view of the queries by duration
Queries weighted by traffic to surface optimization opportunities
Anomaly detection to identify queries deviating from their normal behavior

Lab Setup

Let’s put the theory into practice. We’ll clone a sample application, start the observability stack, and explore three progressively more sophisticated approaches to slow query analysis. All you need is Docker installed.

Clone and Run

git clone https://github.com/causely-oss/slow-query-lab
cd slow-query-lab
docker-compose up -d

Once running, open Grafana at http://localhost:3001—that’s where we’ll explore our dashboards.

Queries by Duration

The first dashboard takes the most direct approach: query Tempo for database spans and aggregate them to find queries that take the longest time. This is what you’d naturally build when you first start exploring traces for slow query analysis.

What It Shows

The Slow SQL - By Duration dashboard queries traces directly using TraceQL:

{ span.db.system != "" } | select(span.db.query.text, span.db.statement)

This finds all spans with database attributes, then uses Grafana transformations to:

Group by root operation (API endpoint) and SQL statement
Aggregate duration into mean, max, and count
Sort by average duration (slowest first)

The result is a table showing your slowest queries, which endpoints triggered them, and how often they occur.

Slowest queries by root operation

What’s Good About This

This approach gives you immediate visibility into queries with full application context:

You can see exactly which SQL statements are taking the most time
You know which API endpoints trigger them
You have the count to understand frequency
You can click through to individual traces for debugging

It’s a first improvement over raw database logs because you’re already seeing the application context that makes slow queries actionable.

The Limitation

Here’s the problem: sorting by average duration doesn’t tell you which queries matter most.

Consider two queries:

Query	Avg Duration	Count
Complex report	2.3s	5
Search	150ms	10,000

The complex report is “slower” by average duration, so it appears first. But the search query, despite being faster on average, runs 2,000 times more often. Its aggregate impact on your users is far greater.

This dashboard tells you what’s slow, but not what’s impactful. For that, we need to consider traffic volume.

Traffic-Weighted Impact Analysis

The second dashboard addresses this limitation by introducing an impact score : the product of average duration and call count.

What It Shows

The Slow SQL - Traffic Weighted dashboard uses the same TraceQL query but adds a calculated field:

Impact = Avg Duration × Count

This simple formula captures a key insight: a moderately slow query that runs thousands of times has more total impact than a very slow query that runs rarely. The dashboard sorts by impact score, surfacing the queries that matter most to your users.

The dashboard also adds:

Service breakdown : See which service triggered each query
Latency distribution : Visualize duration over time, not just averages
Top queries by impact : A quick view of where to focus optimization efforts

Highest impact queries by root operation

What’s Good About This

Traffic-weighted impact gives you a much better prioritization signal for optimization work:

High-volume, moderately-slow queries surface above rare-but-slow ones
You can justify optimization work with concrete impact numbers
The service and endpoint context helps you route issues to the right team

When someone asks “which slow queries should we optimize first?”, this dashboard gives you a defensible answer. It’s exactly what you need for planning performance improvements.

The Limitation

But this dashboard is for optimization, not incident response. Even with traffic-weighted impact, it can’t answer a critical question:

“What has changed?”

Suppose your search query has an impact score of 150,000. Is that normal? Is it higher than yesterday? Higher than last week? The dashboard shows you a snapshot of current state, but it has no concept of baseline.

This matters enormously during incidents. When latency spikes, you don’t just want to know “search queries are slow”—you want to know “search queries are slower than normal”. You need to distinguish between:

A query that’s always been slow (known behavior, maybe acceptable)
A query that just became slow (new problem, needs investigation)

Without a baseline, every slow query looks the same. You’re left manually comparing current values to your memory of what’s “normal,” or digging through historical data to establish context.

This is the gap that the third dashboard addresses.

Symptom Detection with Anomaly Baselines

Because of these limitations, the third dashboard changes our approach: instead of just querying traces, we distill metrics from spans and then apply anomaly detection to identify deviations from normal behavior.

The Setup

For this dashboard, we add the spanmetrics connector to the OpenTelemetry Collector. Here’s the relevant part of the collector configuration:

connectors:
  spanmetrics:
    dimensions:
      - name: db.system
        default: "unknown"
      - name: db.query.text
      - name: db.statement
      - name: db.name
        default: "unknown"
    exemplars:
      enabled: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform, batch]
      exporters: [spanmetrics, otlphttp/lgtm]

    metrics:
      receivers: [spanmetrics]
      processors: [batch]
      exporters: [otlphttp/lgtm]

The spanmetrics connector examines every database span and generates histogram metrics for query latency, labeled by:

service_name: Which service made the query
db_system: Database type (postgresql)
db_query_text or db_statement: The SQL query
db_name: Database name

These metrics are stored in Mimir (the Prometheus-compatible backend in docker-otel-lgtm), where we can apply PromQL-based anomaly detection.

Anomaly Detection with Adaptive Baselines

The sample app includes Prometheus recording rules from Grafana’s PromQL Anomaly Detection framework. These rules calculate:

Baseline : A smoothed average of historical values (what’s “normal”)
Upper band : Baseline + N standard deviations (upper threshold)
Lower band : Baseline - N standard deviations (lower threshold)

When current values exceed the bands, we have an anomaly—a clear signal that something has changed.

What It Shows

The Slow SQL - Anomaly Detection dashboard displays:

Current latency plotted against the adaptive baseline bands
Anomaly indicators when latency exceeds normal bounds
Per-query breakdown so you can see which specific queries are anomalous

The key insight is the visual comparison: instead of just showing “p95 latency is 450ms”, it shows “p95 latency is 450ms, which is above the expected range of 200-350ms.”

Query latency with anomaly bands

Why This Is Better

This dashboard answers the question the previous one couldn’t: “What has changed?”

A query that’s always slow (450ms baseline) won’t trigger anomalies when it runs at 450ms
A query that’s normally fast (50ms baseline) will trigger anomalies if it suddenly runs at 200ms
You get automatic context for what’s “normal” without maintaining manual thresholds

The anomaly detection acts as a symptom detector. It tells you: “This query is behaving differently than it usually does.” That’s a high-signal insight you can act on immediately.

From Metrics to Symptoms

Notice what we’ve achieved with this architecture:

Raw telemetry (traces) flows from the application
Distillation (spanmetrics connector) extracts metrics from those traces
Anomaly detection (Prometheus rules) identifies deviations from baseline
Symptoms (anomalous queries) surface for investigation

We went from thousands of trace spans to a handful of anomaly signals that tell you exactly where to look.

Taking This to Production

Metric Cardinality

Raw SQL in metric labels will explode your metrics backend—SELECT * FROM orders WHERE customer_id = 12345 becomes a separate series per customer. Use prepared statements (so instrumentation captures templates, not literals), normalize query text, or use aggregation_cardinality_limit in the spanmetrics connector.

Privacy

SQL may contain sensitive data. The Collector is the ideal place to redact: drop or transform sensitive attributes before shipping downstream. This aligns with distillation: sanitize at the edge, not centrally.

Anomaly Detection Baseline

Adaptive rules need 24-48 hours of data to establish baselines. Start with wider bands and tighten as confidence grows.

The Remaining Gap: From Symptoms to Root Causes

Even with anomaly detection, you’re still looking at symptoms. In real-world incident scenarios, especially in large environments, slow queries are just one of many symptoms that pop up at once. You’re not only trying to understand the cause of this one; you’re triaging a flood of alerts and correlating many symptoms to find the real root cause.

When the dashboard shows “search query latency spiked,” you know something changed. But you don’t know why it changed. The root cause might be:

A missing index after a schema migration
Query plan regression due to stale statistics
Lock contention from a concurrent batch job
Resource pressure from a noisy neighbor on the database host
Upstream service degradation causing retry storms

Connecting the symptom (“search query is slow”) to the root cause (“index was dropped during last night’s migration”) requires causal reasoning—understanding the relationships between system components and tracing the chain of causation from effect back to cause.

You can absolutely do this reasoning yourself. Look at deployment timestamps, check for schema changes, investigate resource metrics, correlate with other symptoms. Good engineers do this every day.

But it’s manual, time-consuming, and doesn’t scale.

Going Beyond Symptoms with Causely

This is where Causely comes in: Causely extracts slow queries (and other symptoms) as distilled insights out of the box—the same pattern we implemented manually. But it goes further:

Causal model: Slow queries are connected into a model of your system’s dependencies. You can see what they impact (which endpoints, which users) and what causes them (resource constraints, upstream failures, configuration changes).
Root cause identification: Instead of showing you a list of symptoms to investigate, Causely traces causation chains to identify the underlying root cause. “Search queries are slow because the index was dropped.”
Actionable recommendations: AskCausely helps you get to “what should we change?”—whether that’s adding an index, reverting a deployment, or addressing the upstream pressure that made the query slow in the first place.

The pattern we built in this post—distill, detect anomalies, surface symptoms—is the foundation. Causely is the natural next step: turning symptoms into root causes at scale.

Want to see how Causely connects your slow queries to their root causes? Try it yourself.

Ask Causely about slow queries

When Asynchronous Systems Fail Quietly, Reliability Teams Pay the Price

Severin Neumann — Wed, 28 Jan 2026 20:19:59 +0000

In our previous post, Queue Growth, Dead Letter Queues, and Why Asynchronous Failures Are Easy to Misread, we described a failure pattern that plays out repeatedly in modern systems built on asynchronous messaging.

A queue starts to grow slowly. Nothing looks obviously broken at first. Publish calls are succeeding and consumers are still running, just not quite keeping up. Over time, messages begin to age out, and dead-letter queues start accumulating entries. Downstream services that depend on those messages begin to behave unpredictably. There are partial data, delayed processing and subtle customer-facing issues that are hard to tie back to a single event. By the time the impact is visible in latency or error rates elsewhere in the system, the original cause is buried several layers upstream and hours in the past.

Teams do not miss these failures because they lack data. They miss them because the signals do not point clearly to the cause.

Over the past several weeks, we’ve expanded Causely’s asynchronous and messaging queue capabilities to make these failures explicit, explainable, and actionable. This includes:

Expanding Causal Model for Amazon SNS and SQS and RabbitMQ
A new Producer Publish Rate Spike root cause
And adding Queue Size Growth and Dead-letter Queue as first class symptoms to our model

Reliability Blind Spot in Messaging-Driven Architectures

Asynchronous communication is foundational to how modern systems scale. The same advantages systems like Kafka and RabbitMQ provide, decoupling services and absorbing traffic spikes, also introduce new reliability challenges.

The core issue is not that these systems fail quietly, but that cause and effect are separated. A producer can overload the system without returning errors. A broker can continue accepting traffic while consumers fall behind. By the time downstream symptoms appear, the triggering behavior has often already passed.

For engineering managers and or those on the frontline of the on-call slack channel, this creates a familiar and frustrating dynamic. Reliability degrades without a clear trigger. Incident response turns into a debate about whether the producer or consumer is responsible. Teams chase anomalies across dashboards while backlogs continue to grow. By the time a decisive action is taken, the customer impact is already real.

Why Traditional Observability Falls Short

Metrics, logs, and traces are excellent at answering local questions. They tell you what a service is doing, how long an operation took, or how many messages are currently sitting in a queue.

What they do not provide is causal understanding across asynchronous boundaries.

In messaging-driven systems, cause and effect are separated in time and space. A spike in publish rate from one service may not create visible impact until hours later, in a different service, owned by a different team. A slow consumer may be the result of downstream backpressure rather than a defect in the consumer itself. Dead-letter queues tell you that messages failed, but not why the system reached that state.

Without a causal model of how producers, exchanges, queues, and consumers interact, teams are forced to infer failures indirectly. That inference is slow, fragile, and heavily dependent on tribal knowledge. Under pressure, it leads to overcorrection, unnecessary rollbacks, and missed root causes.

Expanding the Causal Model for Messaging Systems

To close this gap, we have significantly expanded Causely’s causal model for asynchronous messaging systems.

Rather than treating queues as opaque buffers, Causely now models messaging infrastructure the way it actually operates in production. Producers, exchanges, queues, and consumers are represented as distinct entities with explicit relationships and data flows. This applies across common technologies, including Amazon SQS, Amazon SNS, and RabbitMQ, whether used in simple queue mode or exchange-based pub/sub patterns.

By modeling the topology directly, Causely can reason about how work enters the system, how it is routed, where it accumulates, and how pressure propagates across services. This makes it possible to explain failures that previously required intuition and guesswork.

Causely Dataflow Map makes it easy for engineers to understand how data moves between services and exchanges and queues that make up Amazon SQS and SNS and RabbitMQ

Making Queue Growth and Dead-Letter Failures First-Class Signals

We have also expanded the causal model to treat queue size growth and dead-letter queue activity as first-class symptoms, not secondary indicators.

This changes how asynchronous failures are diagnosed. Instead of surfacing queue metrics as passive signals, Causely reasons about them causally, linking backlog growth and dead-letter events directly to the producers, consumers, and operations involved.

As a result, queue-related failures are no longer inferred indirectly from downstream latency or error spikes. The failure mode is explicit, explainable, and traceable to the point where intervention is most effective.

A New Root Cause: Producer Publish Rate Spike

One of the most common and least understood asynchronous failure modes is a sudden change in publish behavior. Causely now includes a dedicated root cause for this pattern: Producer Publish Rate Spike.

This occurs when a service, HTTP path, or RPC method begins publishing messages at a significantly higher rate than normal. The increase may be triggered by a code change, a configuration update, or an unexpected shift in traffic patterns. Downstream queues absorb the initial surge, but consumers cannot keep up indefinitely. Queue depth grows, message age increases, and backpressure begins to affect the rest of the system.

What makes this failure particularly dangerous is that the producer often looks healthy. Publish requests succeed, error rates remain low, and nothing appears obviously wrong at the source. Without causal reasoning, teams frequently blame consumers or infrastructure capacity, missing the true trigger entirely.

Causely now detects this condition explicitly. It ties unexpected increases in publish rate to queue growth, consumer pressure, and downstream service degradation, making the failure both visible and explainable.

Understanding the cause of increased queue depths , causing performance degradation

What This Changes for Reliability Teams

For teams responsible for revenue-critical services, these capabilities change how asynchronous failures are handled in practice.

Instead of reacting after queues are saturated and customers are impacted, teams can see which producer initiated the failure, how pressure propagated through the messaging system, and where intervention will have the greatest effect. Slow consumers, misconfigured routing, and unexpected publish spikes are distinguished clearly rather than conflated into a single “queue issue.”

This shortens incident response, reduces unnecessary mitigation, and eliminates the finger-pointing that often arises when failures span multiple teams. More importantly, it enables a proactive reliability posture in systems that are constantly changing.

Asynchronous Reliability Without Guesswork

Asynchronous architectures are essential for scale, but they demand a different approach to reliability than synchronous request paths.

With its expanded messaging and asynchronous causal model, Causely provides deterministic, explainable reasoning over how data flows through your system. Teams do not need to stitch together dashboards to reconstruct timelines after the fact. They do not need to trust black-box AI summaries that cannot explain their conclusions. They no longer have to exhaustively eliminate possibilities to arrive at a root cause.

Instead, they get clear answers to the questions that matter most: what is breaking, why it is breaking, and where to act first to protect reliability and revenue.

Alerts Aren’t the Investigation

Severin Neumann — Thu, 22 Jan 2026 16:59:39 +0000

PagerDuty fires: CheckoutAPI burn rate (2m/1h). Grafana shows p99 going from ~120ms to ~900ms. Retries doubled. DB CPU is flat, but checkout pods are throttling and a downstream dependency’s error budget is evaporating. Ten minutes in, you’ve collected artifacts, not understanding.

If you’ve been on call, you’ve seen this movie.

This is also why plenty of “AI-powered observability” rollouts still don’t change the lived experience of on-call. Leadership expects response times to improve. On-call gets richer dashboards, smarter summaries, and more plausible explanations. To be fair, those do help with faster lookup and briefing in the room, but the social reality stays the same: alerts get silenced in PagerDuty, rules get tagged as “flappy,” and old pages keep firing long after anyone can justify what they were meant to protect. The problem isn’t effort. It’s that the page still doesn’t reliably collapse into shared understanding. Call it the Page-to-Understanding Gap: the time and coordination cost of turning a threshold into a system story.

Alerts are supposed to start an investigation. Too often, they start translation: what is the system doing right now? That translation slows containment, splinters context, and stretches customer impact.

That decoding work is what incident response depends on, yet it’s rarely made explicit. It’s why MTTR gains often plateau even after teams invest heavily in monitoring and dashboards.

Alerts are a paging interface, not a language of explanation

Alerting is optimized for one job: interrupt a human at the right moment.

So alerts are built from what’s easiest to express at scale: thresholds, proxy signals, and rules that “usually work.” They encode operational history, not system truth.

That’s not a failure of alerting. It’s what alerting is for, but it also makes alerts a shaky foundation for understanding. They’re a simplified label over messy reality. When alerts aren’t grounded in clear definitions of “healthy” behavior, the signal loses meaning and teams stop trusting it.

One alert name can mean multiple different realities

Take a familiar page: “latency high,” or the modern equivalent: an SLO burn rate page.

Burn rate fires on checkout. You assume checkout is slow, start in the service dashboard, see p99 up, then notice retries doubled. Meanwhile, Slack is already split: “DB” vs. “checkout.” Fifteen minutes later you realize a downstream dependency is brownouting and checkout is just drowning in retries. The giveaway is usually the shape: one hop shows rising timeouts and retry storms while upstream looks "healthy" until it saturates. The alert didn’t lie—it just didn’t tell you what you needed first.

The page looks identical. The mechanism isn’t.

So teams build muscle memory around “what usually causes this,” and it works—until the system changes just enough that it stops working. Scale and change are exactly what modern organizations optimize for, so the failure mode is guaranteed. When that happens, the alert doesn’t just wake you up. It points you in the wrong direction.

Many alerts can describe the same underlying behavior

The reverse problem happens just as often. A single degradation creates a cascade of pages across services: latency, errors, saturation, queueing, burn rate alarms. Each is technically “true,” but treating them as separate problems creates thrash.

People split into parallel investigations, duplicate context gathering, and argue about which page is “the real one.” Context fragments across Slack threads and war rooms, ownership ping‑pongs, and escalations get noisy. By the time you agree which page is “primary,” you’ve already created a coordination incident. The outcome isn’t just wasted engineer time—it’s that nobody has one shared narrative everyone can repeat while impact is unfolding. The incident becomes less about understanding and more about sorting competing signals. This is how alert fatigue turns into incident fatigue.

The silent gap: important behavior you don’t alert on

The costly ones are the slow degradations that ship impact before they page.

A dependency gets a little slower. Retries creep up. One critical route starts timing out for a slice of customers. Averages look fine. Thresholds don’t trip. You don’t notice until the page fires, or until customers do.

It’s not that teams don’t care. It’s that these behaviors don’t fit neatly into alert rules: risk builds up, dependencies decay, partial impact hides in aggregates, and propagation only makes sense once you’ve traced it end to end.

Most alerts only get defined after you’ve understood the behavior in the middle of an incident. The post-mortem produces a new rule and a brief feeling of closure. Then traffic shifts, dependencies evolve, and the next incident arrives with a different shape. You’re never done.

So teams find these late, after impact is already underway, when time is most expensive.

Why teams don’t switch investigation entry points

When teams adopt a new investigation system, they often ask a simple question: “Does it match our alerts?”

What they’re really asking is: “Can I trust this in the first 90 seconds?” Because in the first minute, the primary goal isn’t elegance, it’s not making it worse.

A system that generates more hypotheses doesn’t help if it can’t connect the page to what the system is doing in a way the on-call trusts.

If the system describes an incident in a different language than the alert model engineers rely on, mismatches get interpreted as duplication, contradiction, or risk. The result is predictable: people consult it late, after they’ve already committed to a direction.

Even correct insights arrive too late to change behavior.

Why this problem is getting worse

Systems are becoming more dynamic: more dependencies, faster deploys, and more integration points. Deploy frequency keeps climbing—and AI-assisted coding is only accelerating it—so the number of failure paths keeps growing. Meanwhile, alert fatigue is already high, and teams are hesitant to change workflows mid-incident.

Better tooling can speed up lookup and correlation, but it can’t compensate for an alerting model that no longer maps cleanly to real system behavior.

So the interpretation workload keeps rising. Every page demands more interpretation, more cross-checking, more manual stitching of symptoms into a coherent story.

What’s actually broken

Most organizations are operating with two different languages: the language of paging and the language of understanding. The persistent MTTR plateau is the Page-to-Understanding Gap between them.

Incidents start with the first, but the work happens in the second.

A better way to think about alerts

Alerts are not the investigation. They’re a notification that something is going sideways.

The goal is not to tune thresholds until the noise feels tolerable. It’s to shorten the time from page to shared understanding: what behavior is emerging, what changed, what’s being impacted—and whether it matters to the business.

Treating that translation as unwritten know‑how is not a workflow quirk. It’s a structural weakness. If your incident response starts with decoding alerts, you’re spending your best engineers on interpretation instead of containment.

Queue Growth, Dead-Letter Queues, and Why Asynchronous Failures Are Easy to Misread

Severin Neumann — Tue, 20 Jan 2026 19:18:45 +0000

Asynchronous pipelines sit at the core of most modern systems. Message brokers accept traffic, consumers process it in the background, and downstream services depend on the results.

When these systems fail, the failure rarely shows up where it starts.

Teams often notice stale data, degraded behavior, or latency spikes elsewhere in the system. By the time those symptoms appear, the underlying problem has usually been present for some time.

In many real-world failures, two signals appear earlier: queue growth and dead-letter queues. They are widely monitored, but they are still widely misunderstood.

The common misunderstanding

Queues are often treated as infrastructure components rather than behavioral signals. When a queue grows, it is attributed to load. When messages land in a DLQ, it is treated as a retry policy doing its job. Investigation tends to focus downstream, where symptoms are visible.

This framing obscures where asynchronous systems actually break. In many failures, message brokers continue to accept traffic normally. Producers succeed. Nothing looks obviously down. The problem is that consumers are no longer able to keep up reliably or consistently. That distinction matters.

Queue growth is not just volume

Queue growth occurs when messages arrive faster than they can be processed successfully over time. This does not require a traffic spike. It can result from:

Consumers slowing due to code changes or resource pressure
Dependencies becoming latent or unreliable
Retry rates increasing
Backpressure failing to engage
Partition skew concentrating work unevenly

In these cases, the broker behaves correctly. Messages are accepted. The queue grows quietly. What is accumulating is lag. A sustained backlog means work is no longer flowing through the system at the intended rate, even if no explicit failures are visible yet.

Why this matters before anything looks broken

Asynchronous systems are designed to absorb instability. Queues buffer mismatches. Retries smooth over failures. Backlogs delay visible impact. This is useful, but it also postpones feedback. As queues grow:

Processing time increases
Derived state falls behind
Downstream services operate on increasingly stale or incomplete data

The transition from “degraded” to “broken” often appears sudden because the system has been accumulating lag for some time before any external threshold is crossed.

Dead-letter queues signal a different failure

Dead-letter queues exist to capture messages that cannot be processed successfully. Messages land there after repeated failures, timeouts, or deterministic errors. DLQs prevent infinite retries and protect the main pipeline. What they represent is not transient instability, but persistent processing failure under current system behavior.

A non-empty DLQ means some class of messages cannot be handled as the system is currently operating. That incompatibility can come from:

Broken contracts between producers and consumers
Partial or skewed deployments
Schema drift
Unhandled edge cases
Dependencies that fail consistently rather than intermittently

DLQs often grow alongside backlogs, but they can also appear independently.

Why these problems are so common

In real systems, producers and consumers evolve independently. Load shifts. Dependencies degrade. Retry behavior changes system dynamics in non-obvious ways. It is common for:

Brokers to continue accepting traffic
Queues to grow steadily
Consumers to fail intermittently or slow down
Processing failures to accumulate quietly

Operationally, queues and DLQs sit between services. They rarely have clear ownership. They are easy to monitor superficially and hard to reason about in context. As a result, many teams only notice these issues once downstream behavior degrades.

Queue growth and DLQs are related, but distinct

Queue growth and DLQs are often discussed together, but they answer different questions.

Queue growth asks: "Are messages flowing through the system fast enough?"

DLQs ask: "Are some messages failing to be processed at all?"

In many incidents, sustained queue growth precedes DLQs. Consumers slow down, retries increase, and retry limits are eventually exceeded. In others, DLQs appear immediately due to deterministic processing failures, even while queue depth looks healthy. Treating one as a proxy for the other creates blind spots.

The deeper diagnostic challenge

Most teams diagnose asynchronous failures indirectly. They look at:

Latency spikes
Error rates
Timeouts
User-visible symptoms

Those signals matter, but they are downstream effects. Earlier and more precise signals exist inside the message pipeline itself: where messages are accepted, where they slow down, and where they fail to be processed reliably. When those signals are ignored or misinterpreted, teams spend time chasing symptoms rather than isolating where the workflow is actually breaking.

A better way to think about queues

Queues are not just buffers. Queue growth is not harmless backlog. Dead-letter queues are not operational exhaust. They are indicators of whether asynchronous workflows are functioning as intended or quietly degrading under real conditions.

The goal is not to watch queue depth, instead, it’s to continuously understand flow: where work accumulates, why it accumulates, and which downstream interactions it impacts.

Understanding them is not an optimization. It is foundational to operating reliable, event-driven systems.

Slight Reliability EP 113: AI Use-cases for SRE with Shmuel Kliger

Severin Neumann — Mon, 12 Jan 2026 19:31:00 +0000

From the day we invented computers we've been struggling to keep applications running and delivering services to the business. Is this latest wave of AI helping or hurting us?

This week I'm joined by Causely founder Shmuel Kliger to dive into...

🌊 The three waves of AI hype over the decades (the history of AI)

☠️ The dangers of over-promising and under-delivering what AI can do

🧠 What is causal reasoning?

😱 Is AI replacing SREs?

🔮 AI as a way to allow humans to solve higher level problems

Find the full conversation on YouTube: https://www.youtube.com/watch?v=e1L9YE7igz4

Causely Expands Datadog Integration to Deliver Causal Intelligence Across Hybrid Environments

Severin Neumann — Mon, 22 Dec 2025 14:43:25 +0000

Causely is expanding its Datadog integration to address a problem every senior engineering team eventually runs into: observability data keeps growing, but confidence during incidents does not. Even with Datadog APM, infrastructure metrics, and monitors deployed everywhere, engineers are still forced to interpret symptoms and argue about which change or dependency actually caused an outage. The issue is not missing telemetry. It is the lack of a system-level understanding of cause and effect.

This limitation becomes especially visible in modern, hybrid architectures. Services span Kubernetes clusters, standalone EC2 instances, ECS tasks, and legacy infrastructure, all connected through real production traffic. Datadog can surface signals across these environments, but understanding how failures propagate across those boundaries remains a manual, error-prone exercise. The result is slower recovery, repeated incidents, and reduced confidence in change.

With this expanded Datadog integration, Causely gives teams a unified, causal model of their entire application across Kubernetes and non-Kubernetes environments. This model that explains why services are impacted, not just where symptoms appear.

From Observability Signals to System Understanding

Many teams already rely on Datadog APM, infrastructure metrics, and monitors as the backbone of their observability stack. With Causely’s expanded support, those same Datadog signals can now be used to build a complete and accurate causal model of the system without changing existing instrumentation.

Causely supports Datadog APM dual shipping, which allows trace data to be sent directly from the Datadog collector into Causely’s mediator. Teams continue using Datadog exactly as they do today, while Causely consumes the same traces for causal reasoning. This approach avoids additional agents, avoids data duplication, and does not introduce new egress costs.

Just as importantly, Causely now supports services running outside Kubernetes and keeps causality intact across the hybrid boundary. By tagging Datadog APM traces with host identity metadata, Causely can stitch together services running on EC2 with those running inside Kubernetes clusters. What previously broke at environment boundaries becomes a single, end-to-end behavioral model of how the application actually runs in production.

Datadog monitors can also be ingested directly into Causely and treated as symptoms rather than conclusions. Instead of reacting to alerts in isolation, Causely uses them as signals that inform its understanding of what is happening in the system and why. That’s how you get faster convergence, fewer false leads, and higher confidence in the fix.

A Real-World Hybrid Application Scenario

Consider a typical production application. Customer-facing APIs and frontend services run in a Kubernetes cluster. Background workers, billing services, or legacy processing jobs run on standalone EC2 instances. The application depends on shared infrastructure such as Postgres, Redis, and external APIs. Datadog is already deployed across all of it.

Under normal conditions, everything appears healthy. Then, during a traffic spike, latency starts creeping up in one of the Kubernetes services. Shortly after, Datadog monitors begin firing for elevated error rates in downstream components. Engineers open dashboards, inspect traces, and try to correlate timelines across environments. The symptoms are visible, but the cause is not obvious.

This is where Causely changes the workflow.

_Causely leverages Datadog APM dual shipping as input to build its own model of service dependencies, infrastructure, and data flows. _

Automatically Pinpointing the True Cause

Using Datadog traces, infrastructure metadata, and monitor events, Causely continuously reconstructs end-to-end request paths and dependencies across Kubernetes and EC2. That continuity holds across environment boundaries, so you don’t have to manually stitch together “what talks to what” in the middle of an incident. Instead of reacting to individual alerts, Causely continuously builds and maintains a behavioral model of the entire system. This model captures how services, infrastructure, and data flows interact, and how specific failure modes produce observable symptoms.

Datadog APM traces provide the raw evidence of system behavior, including service interactions, request paths, and downstream dependencies. Datadog monitors are ingested and mapped as symptoms within Causely’s knowledge base. Together, these signals allow Causely to maintain an up-to-date causal model that explicitly links observed symptoms to the conditions and changes that produced them.

Because this causal model is updated continuously, Causely can explain not just what is failing, but what changed first, how the impact propagated, and why specific services or endpoints are affected. The result is a precise, system-level explanation of performance degradation that teams can act on immediately.

_Using the system-level understanding, Causely applies its knowledge base of known failure patterns to infer the exact root cause driving service degradation. _

From Incident Response to Reliability Assurance

This expanded Datadog integration is not just about faster root cause analysis during incidents. By continuously modeling system behavior, Causely enables teams to validate reliability before changes reach production, monitor how reliability evolves over time, and detect drift caused by infrastructure or configuration changes.

Modern systems are hybrid by default, and reliability problems do not respect environment boundaries. To operate confidently at scale, teams need more than visibility. They need to understand how their systems behave and why failures occur.

With expanded Datadog support across Kubernetes and EC2, Causely helps teams move from alert-driven firefighting to causal reliability engineering. The result is fewer war rooms, faster resolution, and the confidence to ship changes without fear.

To learn more about using Causely with Datadog, explore the integration guide or reach out to see a unified service graph in action.

Thank You, FluxCD: How it helps us, and how you can use it too!

Severin Neumann — Tue, 16 Dec 2025 17:47:11 +0000

The second post in our “thank you” series, just in time for the end of the year.

In the first one, we said thanks to Grafana for donating Beyla and making it easier for teams to get to usable telemetry quickly. This time we want to zoom out to something that quietly runs under the hood at Causely every day: GitOps with FluxCD.

Causely is a member of the Cloud Native Computing Foundation (CNCF). That’s not just a logo on the website for us: our entire product and our own operations lean heavily on CNCF projects. We build on OpenTelemetry, we run on Kubernetes, and we dogfood our own reliability engine against that stack.

Another key piece in that puzzle is FluxCD. It is what takes “the desired state in git” and makes it true in our clusters, repeatedly. It’s the heartbeat behind our weekly releases.

Why Flux Helps

If you’re operating modern Kubernetes environments, you’ve probably felt this tension. On one hand, you want velocity: teams push changes constantly, services multiply, and configurations evolve all the time. On the other hand, every manual kubectl apply is a potential one-off change that no one can fully reconstruct later. Over time, clusters drift away from whatever was last written down as “how things should be,” and you are left relying on muscle memory and shell history.

Flux solves exactly that problem by turning git into the control surface and reacting to changes within seconds of a merge.

For us, that has very practical consequences.

Releases are commits, not hand-crafted ceremonies.
A new Causely version is rolled out by changing a tag or a value in git; Flux notices the change within seconds, reconciles the cluster, and either converges to the new state or loudly tells us why it couldn’t.
Environments stay in sync because the same manifests back our test clusters, staging, and production. The differences between them are intentional and visible in overlays, not hidden in one-off fixes on a live production cluster.
And drift becomes a signal, not a mystery: when the cluster does not match git, Flux shows it, which turns the familiar “what changed?” question into a quick investigation rather than a full-blown incident archaeology.

The net effect is that we can move quickly without giving up control. Our reliability engine depends on a stable substrate; Flux helps us keep it that way.

Best Practices We’ve Learned Using Flux

We didn’t get there on day one. It took a set of habits to turn FluxCD from a cool project into a core platform primitive.

Treat manifests like product code
Keep Kustomize overlays boring
Watch Flux like any other production controller
Make promotion a path, not an event

Treat manifests like product code

All of our Kubernetes manifests live in git. Not most of them, not “the important ones” – all of them. That sounds obvious, but it changes behavior in subtle ways.

Reviews happen before things break, because a change to a HelmRelease or a Kustomize overlay goes through the same review process as a feature change.
The commit history becomes an operational log: when we see a strange spike in errors, we can line it up against recent git changes, including configuration tweaks that would otherwise live only in someone’s bash history.
Our issue tracker also stays connected to reality, because we reference issue numbers in commit messages, so the question “why did we change this setting?” always has a direct link back to the discussion that justified it.

In the early days, we still made the occasional manual change in production, usually in the name of speed. Those changes always came back to haunt us as confusing states that no one could fully explain. Once we committed to git as the only source of truth and forced ourselves to route every change through it, the platform became much more predictable.

Keep Kustomize overlays boring

We use Kustomize to manage environment-specific differences across test clusters, staging, production, and chaos environments. The rule we eventually settled on is simple: overlays describe differences, not alternative universes.

In practice, that means we maintain a clean base with shared resources such as namespaces, common HelmReleases, and shared configuration. On top of that base, we keep the environment overlays as thin as possible. They patch what truly needs to change, such as cluster names, resource limits, or a particular feature flag, rather than redefining whole stacks.

Whenever we tried to be clever with external references or overlays that diverged heavily from one another, troubleshooting became harder. Keeping overlays compact and predictable means we can scan a diff and understand at a glance what will change in a given cluster. Before committing, we render Kustomize configs locally as a quick sanity check that catches typos and misaligned paths before Flux has to complain about them.

Watch Flux like any other production controller

GitOps is not “set and forget.” Flux is a control loop running in production and, when it is unhappy, your platform will slowly drift.

We treat Flux like a critical controller. We watch reconciliation health and consider a stuck HelmRelease or Kustomization as important as any failing deployment. When Flux cannot talk to git, or when an apply keeps failing, that is something we alert on rather than something we notice days later in a dashboard. And when “nothing seems to be changing” in a cluster despite recent commits, Flux logs are one of the first places we look.

This mindset becomes even more important when GitOps extends beyond just core applications and starts to manage your observability stack, gateways, and even Causely itself.

Make promotion a path, not an event

Flux really shines when you treat deployments as a series of git-based promotions instead of isolated production pushes. A typical Causely release starts with a change landing in a test environment: we use clusters like test1 and test2 for this. We verify that the change behaves as expected there, including how it interacts with telemetry and Causely’s own reasoning about incidents. Once we are happy, we promote the same change to staging by updating the relevant overlay or values. Only after staging behaves as expected do we roll the change into production.

Alongside this path, we maintain dedicated chaos clusters, chaos1 and chaos2, where we deliberately break things to see how the system responds. Because everything flows through git, we can rehearse failure modes without fear of leaving behind strange manual fixes that only exist on one cluster. Keeping cluster-specific configuration isolated and well documented is what allows us to run realistic experiments in those chaos clusters without letting that complexity bleed into production.

Try Flux On Your Own

To really understand Flux, it helps to feel git driving your cluster. The smallest useful experiment is a git repository, a local kubernetes cluster, and Flux bootstrapped from that repository. The nice part is that flux bootstrap already does most of the heavy lifting: it creates the repository, installs the controllers, and wires everything together for you.

You can run the following guide on your laptop with kind.

Start by installing the Flux CLI. The easiest way is via the official install script; if you prefer Homebrew, apt, or other package managers, the Flux documentation lists those options as well in the Flux installation guide.

curl -s https://fluxcd.io/install.sh | sudo bash

Next, export your GitHub credentials so Flux can authenticate and create the repository for you. If you are logged in with the GitHub CLI (gh auth login), you can derive both the user name and token directly from it.

export GITHUB_USER="$(gh api user --jq '.login')"
export GITHUB_TOKEN="$(gh auth token)"

Now, create a local Kubernetes cluster and verify that Flux can run there.

kind create cluster --name flux-playground

When your cluster is ready, bootstrap Flux into it. This command will create a repository called flux-playground-gitops under your GitHub account, install Flux into the flux-system namespace, and configure it to track the ./clusters/flux-playground path in that repo.

flux bootstrap github \
  --owner=$GITHUB_USER \
  --repository=flux-playground-gitops \
  --branch=main \
  --path=./clusters/flux-playground \
  --personal

Clone the newly created repository to your machine and change into it.

gh repo clone flux-playground-gitops
cd flux-playground-gitops

You are now ready to define the OpenTelemetry demo as a git-managed workload by adding a manifest for the demo under clusters/flux-playground.

cat > clusters/flux-playground/oteldemo.yaml <<'EOF'
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: open-telemetry
  namespace: flux-system
spec:
  interval: 1m
  url: https://open-telemetry.github.io/opentelemetry-helm-charts---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: otel-demo
  namespace: flux-system
spec:
  interval: 1m
  chart:
    spec:
      chart: opentelemetry-demo
      sourceRef:
        kind: HelmRepository
        name: open-telemetry
        namespace: flux-system
  targetNamespace: otel-demo
  install:
    createNamespace: true
EOF

At this point your git repository fully describes both Flux itself and the OpenTelemetry demo that Flux will deploy.

Commit and push these changes so Flux can reconcile the new state.

git add .
git commit -m "Add OpenTelemetry demo via Flux"
git push origin main

Within seconds of the push, Flux will see the new revision, apply the changes, and start rolling out the OpenTelemetry demo. You can watch the pods come up.

watch kubectl get pods -n otel-demo

After a few minutes you will see the OpenTelemetry demo microservices starting in the otel-demo namespace.

You now have a real application being managed by Flux from git: the desired state lives in a repository, Flux reconciles it into the cluster within seconds of your merge, and you never had to use kubectl apply for the actual deployment.

Flux + Causely: GitOps All the Way Down

If you followed the small lab above, you already have a local cluster, Flux installed, and the OpenTelemetry demo running under GitOps control. From there, adding Causely is just one more git-driven change.

Conceptually, the flow is simple. You obtain a Causely access token, store it as a Kubernetes Secret, and add the Causely FluxCD manifests to the same git repository that Flux already manages. Git drives the rollout, Flux reconciles it into the cluster, the OpenTelemetry demo generates realistic behavior, and Causely explains what is going on.

Our documentation has a full FluxCD installation guide with more background and variations. The example below is meant to be something you can copy and adapt directly from your existing flux-playground-gitops setup.

First, retrieve your Causely access token from Causely and keep it handy. Then, create a namespace for Causely and a kubernetes secret with your token. The Causely FluxCD manifests expect a secret named causely-secrets and use Flux's native post-build substitution to inject CAUSELY_TOKEN into the HelmRelease, so the token never has to be committed to git:

kubectl create namespace causely 
kubectl create secret generic causely-secrets \ 
  --from-literal=CAUSELY_TOKEN=your-actual-gateway-token-here \ 
  -n causely

Next, from inside your GitOps repository, clone the public causely-deploy repository and copy the FluxCD manifests into your own cluster configuration.

cd flux-playground-gitops 
git clone https://github.com/causely-oss/causely-deploy.git  
mkdir -p clusters/flux-playground/causely 
cp causely-deploy/kubernetes/fluxcd/causely/*.yaml clusters/flux-playground/causely/

With this setup, your git repository now contains everything Flux needs to deploy Causely, and you can commit and push these changes so Flux can reconcile the new state.

git add clusters/flux-playground 
git commit -m "Add Causely via Flux" 
git push origin main

After the push, Flux notices the change, applies the new manifests, and starts deploying Causely into your cluster. You can watch it the same way you watched the OpenTelemetry demo roll out.

flux get kustomizations -A 
kubectl get pods -n causely

Once the Causely pods are healthy, you can return to the Causely portal, where the cluster you just configured will appear. Over time, topology fills in and, as issues arise, you will see root cause views associated with the services in your demo.

At that point, you have a complete loop on a single laptop: Git drives change, Flux applies it, the OpenTelemetry demo generates behavior, and Causely explains what happens when things go wrong. If you want more variations, production-grade knobs, or to run this across multiple clusters, the FluxCD installation guide in our docs walks through additional options in detail.

Closing: Thank You, Flux

FluxCD is a great example of the kind of infrastructure we love in the CNCF ecosystem. It nudges teams toward good habits, turns the question “how did this get here?” into something with a clear, auditable answer, and helps keep complex Kubernetes estates boring and predictable.

As a CNCF member building on OpenTelemetry, Kubernetes, and the wider cloud-native stack, we are genuinely grateful for projects like Flux that quietly raise the floor for everyone.

So: thank you to the Flux maintainers and community for building and maintaining an engine that lets us practice what we preach about control, desired state, and autonomous reliability. If you are running kubernetes and still relying on manual deploys or one-off scripts, Flux is worth a serious look.

And if you want to see what happens when you combine GitOps with causal reasoning, we are always happy to show you how Causely fits into that picture.

Book a demo today ->

Causely Named a Gartner Cool Vendor in AI for IT Operations 2025

Severin Neumann — Tue, 09 Dec 2025 20:25:00 +0000

We are excited to share that Gartner has named Causely a Cool Vendor for AI in IT Operations for 2025. For us, this recognition reflects what we’re seeing across engineering and operations teams everywhere. Systems are changing faster than traditional tools can keep up, and teams need a more reliable way to understand how their applications behave as they evolve.

At the pace modern cloud-native systems move, reacting after symptoms appear is not enough. Teams need a reliability operating system that works continuously alongside their applications, maintains an up-to-date understanding of how everything fits together, and provides the context required for safe automation and proactive reliability.

Why this recognition matters

Modern cloud-native applications evolve constantly. Every deploy, configuration change, traffic surge, and infrastructure adjustment can reshape system behavior. The pace is so fast that even the best teams struggle to reason about what is happening and why.

Traditional observability dashboards can show what happened after symptoms appear, and AI copilots can help speed up triage, but reacting after the fact isn’t good enough for business-critical applications.

Teams need something that runs continuously alongside their systems, understands how everything fits together, and helps them see how changes will affect performance before they land in production. That’s the foundation for proactive reliability, not just faster incident response.

What Gartner recognized about Causely

Gartner recognized Causely for taking a fundamentally different approach to reliability in modern systems. Instead of reacting to symptoms after they spread, Causely maintains a live causality graph that reflects how services, dependencies, and performance constraints relate to one another as the environment evolves. By continuously analyzing telemetry, Causely identifies the underlying driver behind emerging changes in golden signals, even when failures cascade across multiple services, including the code change, configuration update, or operational event that first introduced risk.

This continuous causal inference is what enables proactive reliability. Causely provides clear direction on where to focus and what action is most likely to reduce performance risk, long before issues escalate. The same causal model supports both pre-production and production, helping teams understand how behavior will shift during testing, rollout, and real-world load.

Causely runs locally as a lightweight, containerized system and processes telemetry without exporting raw data. This eliminates the need for central pipelines, avoids sampling or data volume constraints, and gives teams high-fidelity insight that integrates directly into their engineering workflows through APIs, webhooks, and an MCP server. This structured context supports both human decisions and safe automation.

For us, this recognition validates the direction we have been building toward. The future of reliability is proactive, predictive, and grounded in causal understanding.

Who should care?

Causely is designed for teams responsible for building and operating modern distributed systems. It gives engineering and operations organizations a clearer understanding of how system behavior changes over time and how those changes affect reliability. Whether preparing a release, managing growth, or diagnosing unexpected behavior, teams need deeper clarity to make confident decisions. Continuous causal inference provides that clarity.

Looking ahead

We’re grateful to Gartner for this recognition and excited about what it represents. Reliability is entering a new chapter. Systems are more dynamic, AI workloads are growing, and teams need deeper clarity to keep everything running smoothly.

Causely’s mission is to provide that reliability. Continuous causal inference helps teams prevent issues before they escalate, support high-velocity engineering without sacrificing reliability, and give both humans and automation the context they need to act safely.

We’re excited for what’s ahead and proud to help shape the future of reliable, intelligent, and resilient systems.

Want to learn what this could look like for your organization? Get started with Causely today.

Gartner subscribers can view the full report for more information.

Announcing Reliability Delta: Clear, Objective Insight into Whether Your Release Made Your System Better or Worse

Severin Neumann — Thu, 04 Dec 2025 21:23:59 +0000

Your team has been grinding for days, tuning a critical service to improve performance without lighting your cloud bill on fire. It’s the kind of systemic change you can’t hand off to an AI coding agent. After countless reviews, experiments, and late nights, the update is finally in production.

You take a breath. Maybe even consider sleeping.

Then Slack lights up:

“Did it work?” — CTO

You stare at dashboards. Nothing’s red. But you still don’t actually know:

Did reliability improve, or quietly regress?
Did the change shift bottlenecks or introduce new stress points?
Are you now closer to the edge under peak load?

In a 50 to 100+ microservice environment with dense service-to-service dependencies, even small regressions can cascade silently. And slowing down isn’t an option. Leadership needs faster delivery and fewer incidents.

This is exactly why we built Reliability Delta.

Reliability Delta: A Deterministic Answer to “Did This Change Make Things Better or Worse?”

Reliability Delta turns subjective guesswork (like manual diffing of dashboards, correlation hunts, “nobody is complaining” anecdotes) into clear, evidence-based reliability signals.

It’s powered by Causely’s continuously updated understanding of your environment:

Causality Mapping

Causely builds a Bayesian network that models how issues propagate across services, enabling true cause-and-effect visibility.

Attribute Dependency Graph

A DAG of functional dependencies generated from live topology and Causely’s attribute models, highlighting how attributes influence one another.

These models allow Causely to compare two snapshots—two releases, two load tests, or two moments in time—and determine whether system behavior:

Improved
Regressed
Or shifted in ways you need to investigate

The result: deterministic signals engineers can trust.

Use Cases for Reliability Delta

1. Validate Every Release Instantly

Know immediately whether your change introduced risk.

Feature flags and canaries help, but they don’t guarantee safety. What matters is whether the system is behaving normally.

Reliability Delta automatically surfaces:

Behavior changes isolated to a specific flag, tenant, or traffic segment
Downstream effects in pipelines, async jobs, and data flows
Subtle regressions that don’t trip alerts but violate known patterns of normal

This isn’t “no alerts fired = good.”

This is evidence-based release confidence.

2. Understand Load Test Results Beyond Pass/Fail

“With Causely’s Reliability Delta, we can quantify how each release behaves under identical load. It surfaces changes in bottlenecks, stress patterns, and causal relationships that traditional load tests miss. At our scale, having that level of confidence before shipping is critical.” - Cade Moore, Performance Engineering Lead at Hard Rock Digital

Did this release push you closer to the breaking point?

A load test passing doesn’t mean you’re safe.

Reliability Delta shows:

How bottlenecks shifted compared to last time
Whether the same load now produces more stress
Early signs of fragility or shrinking performance margins

It answers the question load tests never answer:

“Are we drifting toward failure or away from it?”

3. Detect Reliability Drift Over Time

Systems naturally drift through config changes, dependency updates, scaling events, and organic load shifts.

By capturing snapshots periodically, you can:

Spot slow-building risk
Track reliability trends
Validate that ongoing changes are improving SLO posture, not eroding it

This moves teams from reactive firefighting to proactive reliability assurance.

4. Validate Experiments with Confidence

Know immediately whether your experiment improved or degraded system behavior.

Teams frequently adjust timeouts, concurrency, sampling, queue behavior, or other system parameters, but these changes rarely trigger alerts, and standard dashboards make it hard to see their true impact.

Reliability Delta lets you validate experiments with clear before-and-after evidence by automatically highlighting:

• Shifts in bottlenecks or stress patterns across services

• Degradations hidden behind “passing” performance metrics

• Unexpected side effects in downstream dependencies

• Whether the experiment made the system more resilient or more fragile

This isn’t trial-and-error tuning. It is evidence-based experiment validation.

Why Reliability Delta Matters for Modern Engineering Teams

If you’re accountable for revenue-critical systems—measured by 99.9%+ SLOs, delivery pace, and incident reduction—you need more than observability dashboards. You need a deterministic framework for evaluating how change affects system behavior.

Reliability Delta gives you:

Objective, repeatable comparisons between versions
Root-cause-aware analysis using causal models
Clear guardrails leadership can trust
Confidence to ship fast without risking SLOs or customer experience

It transforms subjective judgment into trusted, actionable reliability signals—so every release, load test, and system change is safer, faster, and more predictable.

Ship Faster with Confidence

Reliability isn’t something you can eyeball anymore. With Reliability Delta, engineering leaders get the missing layer between observability and automation: clear causal evidence of how changes affect system behavior. It ensures your team can move fast, protect SLOs, and deliver with the confidence that every release is safer than the last.

To learn more, see our docs: https://docs.causely.ai/in-action/reliability-delta/

eAfterWork EP 9: What Every Leader Needs to Know with Severin Neumann

Severin Neumann — Wed, 03 Dec 2025 11:36:00 +0000

In this episode of eAfterWork, we’re going straight to the source: Severin, OpenTelemetry maintainer, member of the OpenTelemetry Governance Committee, and one of the people who writes and maintains the official OpenTelemetry documentation.

He’ll help us understand why OpenTelemetry matters for both technical and non-technical leaders, how it’s shaping the future of observability, and what you really need to know to make the right decisions.

Watch the full conversation on YouTube: https://www.youtube.com/watch?v=LkaytYEAJnQ

Purposeful OpenTelemetry

Severin Neumann — Wed, 26 Nov 2025 16:27:02 +0000

Organizations collect far more telemetry than they use. The result? Exponential costs, PII risks, and still no root cause when incidents happen.

In this technical session, Yuri Oliveira (OllyGarden) and Severin Neumann (Causely) demonstrate:

How to identify instrumentation problems at the source
Why traces and semantic conventions enable purposeful telemetry
How quality telemetry enables causal reasoning at scale

Watch the full conversation on YouTube: https://www.youtube.com/watch?v=sTOa0MxAm78

How Causely and Google Gemini Are Powering Autonomous Reliability

Severin Neumann — Tue, 28 Oct 2025 23:51:02 +0000

As systems scale and interactions multiply, reliability can’t be assured through dashboards and alerts. When hundreds of interdependent services rely on managed components, asynchronous communication, and shared databases, engineers spend valuable hours chasing symptoms because they lack a system that infers causality across dependencies.

Causely addresses this gap through its Causal Reasoning Engine, which models how dependencies interact in real time and accurately determines the cause of observed service latency and errors. By inferring the cause of performance degradation and understanding the affected dependencies, Causely enables automated actions to assure performance.

Now, through a new collaboration with Google Gemini, engineering teams can act on those insights faster and more intuitively.

Why We Started with Gemini

Reliability engineering depends on both accuracy and trust. LLMs excel at interpreting vast, unstructured data, but without a principled understanding of cause and effect, their outputs are prone to hallucination.

Causely provides that missing foundation. Its Causal Reasoning Engine models how services, dependencies, and resources interact. These causal models provide deterministic truth about the causes of performance anomalies, and their blast radius. LLMs build on this foundation by translating the results of this causal inference into natural language explanations and action plans that help teams act with confidence. The result is a real-time, closed loop between insight and action.

We chose Gemini because of its contextual reasoning, enterprise-grade security, and deep integration with Google Cloud workloads. Gemini’s ability to interpret natural language, generate structured code, and summarize technical context complements Causely’s deterministic causal inference engine, turning complex telemetry into clear and reliable insights.

How Causely Uses Gemini to Enhance Autonomous Reliability

While Causely remains interoperable with any LLM that a customer wishes to use, we’ve integrated Gemini into Causely for two new features that make interacting with our Causal Reasoning Engine more intuitive and powerful.

Ask Causely

Ask Causely empowers users to ask complex questions about their environment, check service health, and identify both existing root causes and potential failure points. Ask Causely leverages Gemini’s natural language understanding and generation capabilities to deliver a conversational and seamless experience. It uses multiple Gemini models and takes advantage of Gemini’s generative features to provide a white-glove reliability experience.

Ask Causely leverages Gemini's natural language understanding and generation capabilities.

Gemini is integrated into several stages of the Ask Causely pipeline, offering a high degree of flexibility, control, and integration within Causely’s autonomous reliability framework. To ensure timely and accurate results, Causely uses low-latency Gemini models to support data-intensive operations such as log summarization, entity extraction, and contextual signal analysis across diverse telemetry sources.

Key aspects of the integration include:

Adaptive Model Selection: Causely strategically deploys low-latency Gemini models to ensure quick responses while using higher-capability reasoning models to convert Causely’s causal diagnoses into clear, actionable remediations.
Grounded search for reliable knowledge: Ask Causely uses Gemini’s grounded search capability to deliver accurate, context-aware remediations based on trusted external sources such as vendor documentation, Stack Overflow, and GitHub.
Tool calling and code generation for live system intelligence : Ask Causely uses Gemini’s tool calling and code generation to query live services, interpret telemetry, and surface insights from Causely’s causal engine on identified symptoms and root causes.
Code generation for automation: Gemini’s code generation enables Causely’s Code Agents to analyze time series and topology data, generate diagnostic workflows, automate remediations, and perform dynamic analysis during active incidents.
Entity recognition: Gemini’s strong entity recognition helps Causely rapidly locate and correlate critical services, nodes, and components within complex environments.
Embeddings for enterprise grounding: Gemini embeddings enable Causely to integrate internal documentation and historical incidents to deliver organization-aware, contextually grounded insights.
Interpretable causal insights: Gemini translates Causely’s causal signal extraction from unstructured telemetry into clear, human-readable explanations and actionable remediations.

Causal Explanation and Remediation

Causely uses Gemini to generate SLO-aware, application-specific explanations and actionable remediations grounded in verified causal data and analysis. Causely infers the precise cause of observed anomalies and gathers the most relevant logs and events from across the environment to enable automated action. Gemini then contextualizes this evidence with Causely’s live causal graph to produce coherent, human/machine-readable descriptions and remediations that reflect the underlying issue, its operational impact, and suggested remediation steps

Supporting features include:

Log and event contextualization: Gemini interprets logs and events selected by Causely’s reasoning engine, connecting raw telemetry to observed symptoms and their SLO implications.
Causal-grounded remediation actions: Recommendations are based on observed symptoms and tailored to the affected application or service context.
Automated post-incident summaries: Gemini compiles structured summaries that capture causal explanations, operational impact, and applied remediations, ensuring consistency and traceability across incidents.

Open and Flexible

We started with Gemini as an initial and foundational proof point. But our platform was built to be multi-cloud and model-agnostic. Causely runs across public clouds or on-prem environments, and we’ll continue to develop integrations with other large language models, giving customers flexibility without lock-in. If you have a particular model and use case in mind, please contact us!

The Future: From Reactive to Autonomous Reliability

Causely and Google Gemini together mark a step toward autonomous service reliability, where systems can understand, explain, and prevent issues before users are affected. This shift moves reliability from reactive firefighting to proactive, explainable prevention.

See It in Action

Explore the new Gemini-powered capabilities in Causely:

View Causely on Google Cloud Marketplace ➜
Request a Demo ➜