Forem: Karina Babcock

When Everything Is Instrumented, and You Still Don't Know What's Broken

Karina Babcock — Mon, 11 Aug 2025 15:28:18 +0000

Why microservices need causal reasoning, not just observability

You did the right things.

Your microservices are fully instrumented.
You’ve got distributed tracing and a modern observability stack.
You even built custom dashboards.

But during a major production incident, it still took hours to figure out what was wrong.

Sound familiar?

In our recent webinar, 'Rethinking Reliability for Distributed Systems,' Causely co-founder Endre Sara shared a story we hear far too often: a large-scale customer, running mature microservices in Kubernetes with full observability coverage, still struggles to understand what’s broken during a high-stakes business event.

Why microservices break differently

Distributed systems aren't just complex, they're dynamic. Services spin up and down. Async data flows hide causal relationships. Teams own different pieces. Dashboards fill up with symptoms, not answers.

That's what happened to a large enterprise team Endre worked with. They had:

Mature Kubernetes operations
Kafka for async comms
Comprehensive tracing + telemetry
A seasoned SRE team And still, they couldn't find the root cause of a high-stakes incident.

The problem wasn't a lack of data. It was a lack of causal reasoning.

Watch the recording: Rethinking Reliability for Distributed Systems

In this session, Endre walks through:

Why microservices environments overwhelm traditional observability
How causal reasoning changes incident response
What teams can do to move from firefighting to foresight Whether you’re drowning in alerts or struggling to explain why something broke, this talk offers a clear new perspective and a path forward.

Launching our new integration with OpenTelemetry

Karina Babcock — Wed, 05 Mar 2025 16:07:13 +0000

Bridging the gap between observability data and actionable insight

Observability has become a cornerstone of application reliability and performance. As systems grow more complex—spanning microservices, third-party APIs, and asynchronous messaging patterns—the ability to monitor and debug these systems is both a necessity and a challenge.

OpenTelemetry (OTEL) has emerged as a powerful, open source framework that standardizes the collection of telemetry data across distributed systems. It promises unprecedented visibility into logs, metrics, and traces, empowering engineers to identify issues and optimize performance across multiple languages, technologies and cloud environments.

But with great visibility comes a hidden cost. While OTEL democratizes observability, it also exacerbates the “big data problem” of modern DevOps.

This is where Causely comes in—today, we announced a new integration with OTEL that bridges the gap between OTEL's data deluge and actionable insights. In this post, we’ll explore the strengths and limitations of OpenTelemetry, the challenges it introduces, and how Causely transforms raw telemetry into precise, cost-effective analytics.

The OpenTelemetry opportunity

Microservices are a tangled web of interdependencies that communicate over REST or gRPC. Asynchronous systems like Kafka shuttle messages between loosely coupled services. Infrastructure dynamically scales resources to meet demand. Observability has become the glue that holds these systems together, enabling engineers to monitor performance, troubleshoot issues, and ensure reliability.

At the heart of the observability revolution is OpenTelemetry (OTEL), an open-source standard that unifies the instrumentation and collection of telemetry data across logs, metrics, and traces. Its modular architecture, community-driven development, and broad compatibility with existing observability tools have made OTEL the de facto choice for modern DevOps teams.

What does OpenTelemetry do?

OpenTelemetry provides APIs, SDKs, and tools to capture three primary types of telemetry data:

Logs: Detailed, timestamped records of system events (e.g., errors, warnings, and custom events).
Metrics: Quantitative measurements of system health and performance (e.g., CPU usage, request latency, error rates).
Traces: End-to-end views of requests flowing through distributed systems, mapping dependencies and execution paths.

With OTEL, engineers can instrument their code to emit these telemetry signals, use an OpenTelemetry Collector to aggregate and process the data, and export it to observability backends like Prometheus, Tempo, or Elasticsearch.

Why OpenTelemetry is a game-changer

OpenTelemetry addresses a critical pain point in observability: fragmentation. Historically, different tools and platforms required unique instrumentation libraries, making it difficult to standardize observability across an organization. OTEL simplifies this by providing:

Vendor-Agnostic Instrumentation: A single API to instrument applications regardless of the backend.
Centralized Data Collection: The OpenTelemetry Collector serves as a pluggable data pipeline, consolidating telemetry from various sources.
Interoperability: Native support for popular backends like Prometheus, Tempo, and other vendors, allowing teams to integrate OTEL into their existing observability stack.

Technical example: Debugging latency issues

Consider a microservices-based e-commerce application experiencing high latency during checkout. With OTEL traces, you can get a lot of information about the performance of this service but it is hard to find out what is responsible for the latency. For example: https://github.com/esara/robot-shop/blob/instrumentation/dispatch/main.go#L172


func processOrder(headers map[string]interface{}, order []byte) { 
    start := time.Now() 
    log.Printf("processing order %s\n", order) 
    tracer := otel.Tracer("dispatch") 

    // headers is map[string]interface{} 
    // carrier is map[string]string 
    carrier := make(propagation.MapCarrier) 
    // convert by copying k, v 
    for k, v := range headers { 
       carrier[k] = v.(string) 
    } 

    ctx := otel.GetTextMapPropagator().Extract(context.Background(), carrier) 

    opts := []oteltrace.SpanStartOption{ 
       oteltrace.WithSpanKind(oteltrace.SpanKindConsumer), 
    } 
    ctx, span := tracer.Start(ctx, "processOrder", opts...) 
    defer span.End() 

    span.SetAttributes( 
       semconv.MessagingOperationReceive, 
       semconv.MessagingDestinationName("orders"), 
       semconv.MessagingRabbitmqDestinationRoutingKey("orders"), 
       semconv.MessagingSystem("rabbitmq"), 
       semconv.NetAppProtocolName("AMQP"), 
    )

By exporting these traces to a backend like Tempo, engineers can visualize the request flow and identify bottlenecks, such as consuming messages from RabbitMQ in the dispatch service and inserting an order in a MongoDB database.

The Big Data problem of observability

OpenTelemetry’s ability to capture detailed telemetry data is a double-edged sword. While it empowers engineers with unprecedented visibility into their systems, it also introduces challenges that can hinder the very goals observability aims to achieve. The sheer volume of data collected—logs, metrics, and traces from thousands of microservices—can overwhelm infrastructure, slow down workflows, inflate costs, and most importantly drown engineers with data. This “big data problem” of observability is a natural consequence of OpenTelemetry’s strengths but must be addressed to make the most of its potential.

OpenTelemetry collects a lot of data

At its core, OpenTelemetry is designed to be exhaustive. This design ensures engineers can instrument their systems to capture every possible detail. For example:

A high-traffic e-commerce site might generate logs for every HTTP request, metrics for CPU and memory usage, and traces for each request spanning multiple services.

OpenTelemetry auto instrumentation libraries are an easy way to instrument HTTP, GRPC, messaging, database and caching libraries in all languages, but they generate metrics and traces for every call between every microservice, managed service, database and third-party API.

Consider a production environment running thousands of microservices, each processing hundreds of requests per second. Using OpenTelemetry:

Logs: A single request might generate dozens of log entries, resulting in millions of logs per minute.
Metrics: Resource utilization metrics are emitted periodically, adding continuous streams of quantitative data.
Traces: Distributed traces can contain hundreds of spans, each adding its own metadata.

While this level of detail is invaluable for debugging and optimization, it quickly scales beyond what many teams are prepared to manage. The amount of data makes it difficult to troubleshoot problems, manage escalations, be proactive about deploying new code, and plan for future investments.

The cost of data

The problem with this massive volume of telemetry data isn’t just about storage; it’s also about processing and time-to-insight. Let’s break it down:

Networking Costs: Transmitting telemetry data from distributed systems, microservices, or edge devices to central storage or processing locations incurs significant bandwidth usage. This can result in substantial networking costs, especially for real-time telemetry pipelines or when dealing with geographically dispersed infrastructure.

Storage Costs: Logs, metrics, and traces consume vast amounts of storage, often requiring specialized solutions like Elasticsearch, Amazon S3, or Prometheus’s TSDB. These systems must scale horizontally, adding significant operational overhead.
Compute Costs: Telemetry data needs to be parsed, indexed, queried, and analyzed. Complex queries, such as joining multiple traces to identify bottlenecks, can place a heavy burden on compute resources.
Time Costs: During a high-severity incident, every second counts. Pinpointing the root cause is like looking for a needle in a haystack. With OpenTelemetry, the haystack is much bigger, making the task harder and longer.

Time-to-insight delays

Imagine a scenario where an outage occurs in a distributed system. An engineer might start by querying logs for errors, then switch to metrics to identify anomalies, and finally inspect traces to pinpoint the failing service. Each query takes time, and engineers often waste effort chasing irrelevant leads. This delay increases Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR), directly impacting uptime and user satisfaction.

Noise vs. signal

Another challenge is separating the signal (useful insights) from the noise (redundant or irrelevant data). With OTEL:

Logs can be overly verbose, capturing routine events that clutter debugging efforts.

Metrics might lack the context needed to tie resource anomalies back to specific root causes.

Traces can become overwhelming in high-traffic systems, with thousands of spans providing more detail than is actionable.

While OTEL excels at capturing data, it doesn’t inherently prioritize it. This creates a bottleneck for engineers who need actionable insights quickly.

The need for top-down analytics

Along with the benefits of modern observability tooling come challenges that need to be addressed. OpenTelemetry (OTEL) may unify telemetry data collection, but its bottom-up approach leaves teams drowning in redundant metrics, irrelevant logs, and sprawling traces. Without a clear purpose, teams end up collecting everything “just in case,” overwhelming engineers with noise and diluting the actionable insights needed to keep systems running.

A top-down approach to observability flips the script. Instead of starting with what data is available, it begins with defining the goals: root cause analysis, SLO compliance, or performance optimization. By focusing on purpose, teams can build the analytics required to achieve those goals and then collect only the data necessary to power those insights.

For example:

If the goal is root cause analysis, focus on traces that map dependencies across microservices, rather than capturing every granular log.
If the goal is performance optimization, prioritize metrics that highlight latency bottlenecks over exhaustive resource utilization data.
This shift reduces noise, minimizes data storage and processing costs, and accelerates time-to-insight.

The cost of ignoring purpose

The current approach to observability is plagued by fragmentation. Point tools like APMs, native Kubernetes instrumentation, and cloud-specific monitors operate in silos, each with its own data model and semantics. This forces engineers to manually correlate information across dashboards, increasing time to resolution and undermining efficiency. Over time, the storage, compute, and human costs of managing fragmented data become unsustainable.

Ask yourself:

How much of your telemetry data is redundant or irrelevant?
Are your engineers spending more time troubleshooting tools than resolving incidents?
Is your observability stack delivering insights or merely adding complexity?
Without a unified purpose and targeted analytics, observability becomes another “big data problem,” and your total cost of ownership (TCO) spirals out of control.

Causely can help

Causely transforms OpenTelemetry’s raw telemetry data into actionable insights by applying a top-down, purpose-driven approach. Instead of drowning in logs, metrics, and traces, Causely’s platform leverages built-in causal models and advanced analytics to automatically pinpoint root causes, prioritize issues based on service impact, and predict potential failures before they occur. This turns observability from a reactive big data challenge into a system that continuously assure application reliability and performance.

How Causely brings focus

Causely’s platform addresses these challenges head-on. Its causal reasoning starts with defining what matters: actionable insights to keep systems performing reliably and efficiently. Using built-in causal models and top-down analytics, Causely automatically pinpoints root causes and eliminates noise. By integrating with OTEL and other telemetry sources, Causely ensures that only the most critical data is collected, processed, and presented in real time.

For example:

In a microservices architecture, Causely maps dependencies and pinpoints the root cause of cascading failures, reducing MTTR.
Similarly, with async messaging systems like Kafka, Causely pinpoints the bottlenecks that cause the consumer lag or delivery failures with actionable context, ensuring faster resolution.
In cases where a third-party software is a root cause of issues, Causely pinpoints the root cause by analyzing services impact.
This approach not only reduces the TCO of observability but also ensures teams can focus on delivering value rather than managing data.

How Causely works with OpenTelemetry

Causely Reasoning Platform is a model-driven, purpose-built Agentic AI system delivering multiple AI workers built on a common data model.

Causely integrates seamlessly with OpenTelemetry, using its telemetry streams as input while applying context and intelligence to deliver precise, actionable outputs. Here’s how Causely solves common observability challenges:

Automated topology discovery: Causely automatically builds a dependency map of your entire environment, identifying how applications, services, and infrastructure components interact. OpenTelemetry’s traces provide raw data, but Causely’s topology discovery transforms it into a visual graph that highlights critical paths and dependencies.
Root cause analysis in real time: Using causal models, Causely automatically maps all potential root causes to the observable symptoms they may cause. Causely uses this mapping in real time to automatically pinpoint the root causes based on the observed symptoms, prioritizing those that directly impact SLOs. For instance, when request latency spikes are detected across multiple services, Causely pinpoints whether the spikes stem from a database query (and which database), a messaging queue (and which queue), or an external API (and which one), reducing MTTD and MTTR.
Proactive prevention: Beyond solving problems, Causely helps prevent them. Its analytics can simulate “what-if” scenarios to predict the impact of configuration changes, workload spikes, or infrastructure upgrades. For example, Causely can warn you if scaling down a Kubernetes node pool might lead to resource contention under expected load.

Example 1: Causely, OTEL, and microservices

In a distributed e-commerce platform, a checkout service experiences intermittent failures. OpenTelemetry traces capture the flow of requests, but the data alone doesn’t explain the root cause. Causely’s causal models analyze the traces and identify that a dependent payment service is timing out due to a slow database query. This insight allows the team to address the issue without wasting time on manual debugging.

Example 2: Causely, OTEL, and third-party software

A team - using a third-party CRM API - notices degraded response times during peak hours. OpenTelemetry provides metrics showing increased latency, but engineers are left guessing whether the issue lies with their application or the external service. Causely reasons about the API latency and third-party requests and identifies that the CRM is rate-limiting requests, prompting the team to implement retry logic.

Example 3: Causely, OTEL, and async messaging with Kafka

A Kafka-based event pipeline shows sporadic delays in message processing. While OpenTelemetry traces highlight lagging consumers, it doesn’t explain why. Causely, reasoning about the behavior of the consumer microservices, identifies the root cause in the application’s mutex locking which is causing the slow consumption. The engineering team can focus on improving the locking of the data structure without the messaging infrastructure team having to scale up resources and waste time debugging Kafka.

Reducing the big data burden

Causely’s approach minimizes the data burden by focusing on relevance. Unlike traditional observability stacks that collect and store massive volumes of telemetry data, Causely processes raw metrics and traces locally, pushing only relevant context (e.g., topology and symptoms) to its backend analytics. This reduces storage and compute costs while ensuring engineers get the insights, they need, without delay.

Conclusion: Transforming observability with Causely

OpenTelemetry has redefined observability by standardizing how telemetry data is collected and processed, but its bottom-up approach leaves teams overwhelmed by the sheer volume of logs, metrics, and traces. Observability shouldn’t be about how much data you collect—it’s about how much insight you can gain to keep your systems running efficiently. Without clear prioritization and contextual insights, the observability stack can quickly become a costly burden—both in terms of infrastructure and engineering time.

Causely integrates seamlessly with OpenTelemetry and helps bring order to the chaos, empowering teams to make smarter, faster decisions that directly impact reliability and user experience. Causely uses causal models, automated topology discovery and real-time analytics to pinpoint root causes, prevent incidents, and optimize performance. This reduces noise, eliminates unnecessary data collection, and allows teams to focus on delivering reliable systems rather than managing observability overhead.

Ready to move beyond data overload and transform your observability strategy? Book a demo or start your free trial to see how Causely can help you take control of your telemetry data and build more reliable cloud-native applications.

In 2025, I resolve to be proactive about reliability

Karina Babcock — Tue, 21 Jan 2025 20:07:35 +0000

What developers can do in 2025 to be proactive and prevent incidents before they happen, without sacrificing development time

Making changes to production environments is one of the riskiest parts of managing complex systems. Even a small, seemingly harmless tweak—a configuration update, a database schema adjustment, or a scaling decision—can have unintended consequences. These changes ripple across interconnected services, and without the right tools, it’s nearly impossible to predict their impact.

Business needs speed and agility in introducing new capabilities and features but without scarifying reliability, performance and predictability. Hence, engineers need to know how a new feature, or even a minor change, will affect performance, reliability, and SLO compliance before it goes live. But existing observability tools lack the required capabilities to provide useful insights that enable engineers to safely deploy new features. The result? Reduced productivity, slowdown in feature development, and reactive firefighting when changes go wrong -- all leading to downtime, stress, and diminished user trust.

Recently, we worked with a customer who sought to shift their reactive posture to one that more aggressively seeks out problems before they happen. With this customer, a bug within one of their microservices (the data producer) caused the producer to stop updating the Kafka topic with new events. This created a backlog of events for all the consumers of this topic. As a result, their customers were looking at stale data. Problems like this lead to poor customer experience and revenue loss, so many customers need to adopt a preventative mode of operations.

This post explores how this trend can be disrupted by providing the analytics and the reasoning capabilities to transform how changes are made, empowering teams to anticipate risks, validate decisions, and protect system stability—all before the first line of code is deployed.

Being proactive is easier said than done

Change management is a process designed to ensure changes are effective, resolve existing issues, and maintain system stability without introducing new problems. At its core, this process requires a deep understanding of both the system’s dependencies and its state before and after a change.

Production changes are risky because engineers typically lack sufficient visibility into how changes will impact the behavior of entire systems. What seems like an innocuous change to an environment file or an API endpoint could have far-reaching ramifications that aren’t always obvious to the developer.

While observability tools have come a long way in helping teams monitor systems, they lack the analytics required to understand, analyze, and predict the reliability and performance behavior of cloud-native systems. As a result, engineers are left to “deploy and hope for the best” ... and get a 3AM call when things didn’t work as expected. And, while we live in a veritable renaissance of developer tooling, most of these tools focus on developer productivity, not developer understanding, of whole systems and the consequences of changes made to one component or service.

It’s hard to predict the impact of code changes

When planning a change, the priority besides adding new functionality is to confirm whether it addresses the specific service degradations or issues it was designed to resolve. Equally important is ensuring that the change does not introduce new regressions or service degradations.

Achieving these goals requires a comprehensive understanding of the system’s architecture, particularly the north-south (layering) dependencies and the east-west (service-to-service) interactions. Beyond mapping the topology, it is crucial to understand the data flow within the system—how data is processed, transmitted, and consumed—because these flows often reveal hidden interdependencies and potential impact areas.

Even minor configuration changes can create cascading failures in distributed systems. For instance, adjusting the scaling parameters of an application might inadvertently overload a backend database, causing performance degradation across services. Engineers often rely on experience, intuition, or manual testing, but these methods can’t account for the full complexity of modern environments.

Unpredictable performance behavior of microservices

As we discussed in our previous post, loosely coupled microservices communicate with each other and share resources. But which services depend on which? And what resources are shared by which services? These dependencies are continuously changing and, in many cases, unpredictable.

A congested database may cause performance degradations of some services that are accessing the database. But which one will be degraded? Hard to know. Depends. Which services are accessing which tables through what APIs? Are all tables or APIs impacted by the bottleneck? Which other services depend on the services that are degraded? Are all of them going to be degraded?

These are very difficult questions to answer. As a result, analyzing, predicting, and even just understanding the performance behavior of each service is very difficult. Furthermore, using existing brittle observability tools to diagnose how a bottleneck cascades across services is practically impossible.

There’s a lack of “what-if” analysis tools for testing resilience

Even though it’s important to simulate and test the impact of changes before deployment, the tools currently available are sorely lacking. Chaos engineering tools like Gremlin and Chaos Monkey simulate failures, but don’t evaluate the impact of configuration changes. Tools like Honeycomb provide event-driven observability, but don’t help much with simulating what will happen with new builds. Inherently, if the tools can’t analyze the performance behavior of the services, they can’t support any “what-if” analysis.

Developer tools are inherently focused on the “build and deploy” phases of the software lifecycle, meaning they prioritize pre-deployment validation over predictive insights. They don’t provide answers to critical questions like: “How will this change impact my service’s reliability or my system’s SLOs?” or “Will this deployment create new bottlenecks?”

Predictive insights require correlating historical data, real-time metrics, dependency graphs, and most importantly deep understanding of the microservices performance behaviors. Developer tools simply aren’t built to ingest or analyze this kind of data at the system level.

Developer and operations tools today are both insufficient

Developer tools are essential for building functional, secure, and deployable code, but they are fundamentally designed for a different domain than observability. Developer tools focus on ensuring “what” is built and deployed correctly, while observability tools aim to identify “when” something is happening in production. The two domains overlap but rarely address the full picture.

Bridging this gap often involves integrating developer workflows—such as CI/CD pipelines—with observability systems. While this integration can surface useful metrics and automate parts of the release process, it still leaves a critical blind spot: understanding “why” something is happening. Neither traditional developer tools nor current observability platforms are designed to address the complexity of dynamic, real-world systems.

To answer the “why,” you need a purpose-built system to unravel the interactions, dependencies, and behaviors that drive modern production environments.

Building for reliability

Building reliable performing applications was never easy, but it has become much harder. As David Shergilashvili correctly states in his recent post Microservices Bottlenecks, “In modern distributed systems, microservices architecture introduces complex performance dynamics that require deep understanding. Due to their distributed nature, service independence, and complex interaction patterns, microservices systems' performance characteristics differ fundamentally from monolithic applications.”

Continuing to collect data and presenting it to developers in pretty dashboards with very little or no built-in analytics to provide meaningful insights won’t get us to build reliable distributed microservices applications.

To accurately assess the impact of a change, the state of the system must be assessed both before and after the change is implemented. This involves monitoring key indicators such as system health, performance trends, anomaly patterns, threshold violations, and service-level degradations. These metrics provide a baseline for evaluating whether the change resolves known issues and whether it introduces new ones. However, the ultimate goal goes beyond metrics; it is to confirm that the known root causes of issues are addressed and that no new root causes emerge post-change.

We need to build systems that enable engineers to introduce new features quickly, efficiently, and most importantly safely, i.e., without risking the reliability and performance of their applications. Reasoning platforms with built-in analytics need to provide actionable insights that anticipate implications and prevent issues.

To learn about the required capabilities of these systems, read the rest of the article on causely.ai.

In 2025, I resolve to eliminate escalations and finger pointing

Karina Babcock — Thu, 16 Jan 2025 21:47:43 +0000

Originally posted to causely.ai by Steffen Geissinger

Make escalations less about blame and more about progress

Microservices architectures introduce complex, dynamic dependencies between loosely coupled components. In turn, these dependencies lead to complex, hard to predict interactions. In these environments, any resource bottleneck, or any service bottleneck or malfunction, will cascade and affect multiple services, crossing team boundaries. As a result, the response often spirals into a chaotic mix of war rooms, heated Slack threads, and finger-pointing. The problem isn’t just technical—it’s structural. Without a clear understanding of dependencies and ownership, every team spends more time defending their work than solving the issue. It’s a waste of effort that undermines collaboration and prolongs downtime.

Yesterday, we resolved to spend less time troubleshooting in 2025.

Troubleshooting and escalation are closely intertwined. A single unresolved bottleneck can ripple outward, forcing multiple teams into reactive mode as they struggle to isolate the true root cause. This dynamic creates inefficiencies and delays, with teams often focusing on band-aiding symptoms instead of remediating and solving the root causes. To eliminate this friction, we need systems that do more than detect anomalies—they must provide a seamless view of dependencies, understand and analyze the performance behaviors of the microservices, assign ownership intelligently, and guide engineers toward resolution with precision and context.

Take, for example, an application developer who notices high request duration for users who are trying to interact with their application. This application communicates with many different services, and it happens to run within a container environment on public cloud infrastructure. There are more than 50 possible root causes that might be causing the high request duration issue. That developer would need to investigate garbage collection issues, disk congestion, app-locking problems, and node congestion among many other potential root causes until accurately determining that a congested database is the source of their problem. The only proper way to determine root cause is by considering all the cause-and-effect relationships between all the possible root causes and the symptoms they may cause. This process can often take hours or days before the correct root cause is pinpointed, resulting in a variety of business consequences (unhappy users, missed SLOs, SLA violations, etc.).

In this post, we’ll explore the challenges of multi-team escalations, and the capabilities needed to address them. From automated dependency mapping to explainable triage workflows, we’ll show how observability can be transformed from chaos into clarity, making escalations less contentious and far more productive.

Escalations can cripple teams

Escalations create inefficiencies that extend downtime, frustrate teams, and waste resources. These inefficiencies stem from a combination of structural and technical gaps in how dependencies are understood, root causes are isolated, and ownership is assigned. Here are some of the key challenges that make escalations so painful today:

There is a lack of cross-team visibility into dependencies
It can be hard to predict or analyze the performance behaviors of loosely coupled dependent microservices
It can be difficult to isolate the root cause among all affected services
Legacy observability tools must be stitched together to provide even partial visibility into issues

Lack of cross-team visibility

Microservices architectures are complex and full of deeply interconnected components. An issue in one can cascade into others. Without clear visibility into these dependencies, teams are left guessing which components are impacted and which team should take ownership.

Your favorite observability tools help you visualize dependencies, but they lack real-time accuracy. These maps can quickly become outdated in environments with frequent changes. Some of them are great for aggregating logs, but don’t offer much insight into service relationships. Engineers are often left to piece together dependencies manually.

Unpredictable performance behavior of microservices

Loosely coupled microservices communicate with each other and share resources. But which services depend on which? And what resources are shared by which services? These dependencies are continuously changing and, in many cases, unpredictable.

As a result, predicting, understanding and analyzing the performance behavior of each service is very difficult. Using existing brittle observability tools to diagnose how a bottleneck cascades across services is practically impossible.

Difficulty identifying root causes among all affected services

Determining what’s a cause and what’s a symptom can be an incredibly time-consuming aspect of troubleshooting and escalations. Further, the person or team identifying a problem may well be looking at only their local maxima: the part of the system they work on or are directly affected by. They often don’t see the full picture of all intertwined systems. Identifying the root cause among all affected services can be inordinately difficult.

Even if you have tools that are excellent for visualizing time-series data, you must still rely on engineers to manually correlate metrics. APM tools can help you examine application performance but require significant manual effort to link symptoms to underlying causes, especially in microservices-based, cloud-native applications.

Legacy observability tooling only gives you partial functionality

While both established and up-and-coming tools offer valuable capabilities, they often address only one part of the problem, leaving critical gaps. Dependency visibility, performance analysis and root cause isolation need to be integrated seamlessly to reduce the chaos of escalations. Today’s tools, however, are fragmented, requiring engineers to bridge the gaps manually, costing valuable time and effort during incidents. Solving these problems demands a holistic approach that ties all these elements together in real time.

How escalations should be handled

Escalations have negative consequences for organizations of all sizes. Let’s work together to build systems that render escalations less about blame and more about opportunities to foster trust and collaboration.

These systems will require certain capabilities, which are explained further in the full article here.

In 2025, I resolve to spend less time troubleshooting

Karina Babcock — Mon, 13 Jan 2025 16:31:37 +0000

What SREs and developers can do in 2025 to make troubleshooting more manageable

Troubleshooting is an unavoidable part of life for SREs and developers alike, and it often feels like an endless grind. The moment a failure occurs, the clock starts ticking. If the failure impacts a mission critical application, every second counts. Outages can cost hours of wasted productivity, to say nothing of lost revenue and pricing concessions when you’ve violated an SLO. Pinpointing the root cause requires sifting through piles of logs, metrics that blur together, and false positives. Troubleshooting becomes a search for a needle in a haystack, and to make things even more complex, the needle may not even be in the haystack. Furthermore, when the failure originates in your scope of control, the pressure intensifies—you’re expected to resolve it quickly, minimize downtime, and restore service without disrupting the rest of your work. It’s a reactive process, and it’s draining.

But it doesn’t have to be this way. By adopting systems that solve the root cause analysis problem and automate troubleshooting, you can shift troubleshooting from a time-consuming, heavy-lifting chore to a streamlined task. Automated root cause analysis cuts through the noise and pinpoints the issue in no time.

With the right approach, troubleshooting becomes a quick, manageable part of your day, freeing you to focus on building systems that don’t just react better but fail less often.

What do we mean by troubleshooting?

In a distributed microservices environment, troubleshooting often begins with an alert from the monitoring system or user feedback about degraded performance, such as increased latency or error rates. Typically, these issues are first observed in the service exposed to end users, such as an API gateway or frontend service. However, the root cause often lies deeper within the service architecture, making initial diagnosis challenging. The development team must begin by confirming the scope of the issue, correlating the alert with specific user-reported problems to identify whether it is isolated or systemic.

The next step involves tracing the source of the alert within the service ecosystem. Using distributed tracing tools like OpenTelemetry, the team tracks requests as they propagate through various microservices, identifying where bottlenecks or failures occur. Concurrently, a service dependency map, often visualized through monitoring platforms, provides a bird’s-eye view of interactions between services, databases, caches, and other dependencies, helping to pinpoint potential hotspots in the architecture.

Once the potential hotspots are identified, developers turn to metrics and logs for further insights. Resource utilization metrics, such as CPU, memory, and disk I/O, are analyzed to detect bottlenecks, while logs reveal specific errors or anomalies like timeouts or failed database queries. This analysis attempts to correlate symptoms with the timeline of the issue, offering clues to its origin. Often, the team experiments with quick fixes, such as scaling up CPU, memory, or storage for the affected services or infrastructure. While these adjustments might temporarily relieve symptoms, they rarely address the root cause and must be rolled back if ineffective.

When resource adjustments fail, a deeper dive into the affected components is necessary. Distributed traces provide detailed insights into slow transactions or failures, highlighting which services or calls are problematic. Developers then use continuous profiling tools to examine runtime data for each service, identifying resource-intensive methods, excessive memory allocations, or inefficient call paths. This granular analysis helps uncover inefficiencies or regressions in code performance.

If the issue involves a database, further investigation focuses on query performance. Database profiling tools are used to analyze query execution times, frequency, and data volume. Developers assess whether queries are taking longer than usual, retrieving excessive data, or being executed too frequently. This step often reveals issues such as missing indexes, inefficient joins, or unoptimized queries, which could be contributing to overall service degradation. By iteratively analyzing and addressing these factors, the root cause of the problem is eventually resolved, restoring system stability and performance.

Troubleshooting is reactive, time-consuming, and exhausting. Developers should be focusing their time and energy (and their company’s investment) on innovation, yet troubleshooting forces them to turn their attention elsewhere.

Troubleshooting doesn’t have to dominate your role; with the right systems, it can become efficient and manageable.

Troubleshooting is hard

When trying to find the root cause of service or application outages or degradations, developers face numerous challenges:

It’s hard to pinpoint which service is the source of the degradations amid a flood of alerts
It’s hard to diagnose and remediate the root cause
It’s hard to see the forest from the trees

It’s hard to pinpoint which service is the source of the degradations amid a flood of alerts

Failures propagate and amplify through the environment. A congested database or a congested resource will cause application starvation and service degradation that cascades throughout the system. Even if you deploy an observability tool to monitor the database or the resource, you may observe nothing on the database or the resource. And if you deploy an observability tool to monitor the applications and services, you will be flooded with alerts about application starvation and service degradations. Given the flood of alerts, pinpointing the bottleneck is very complex. As described above, it entails a time-consuming, heavy lifting, manual process under pressure.

The more observability tools you deploy, the more data you collect and the harder the problem gets. More alerts, more noise, more data you need to sift through. This is a journey to nowhere, a trajectory you want to reverse.

Diagnosing and remediating the root cause of a problem is hard

Pinpointing the congested service, database or resource is hard and inefficient. Even if you know where the root cause is, you may not know what the root cause is. Without knowing what the root cause is, you can’t know what to remediate nor how to remediate.

Whether pinpointing where the bottleneck is or pinpointing what the root cause is, engineers rely on manual workflows to sift through logs, metrics, and traces. While new observability tools have emerged over the past decade focusing on cloud-native application infrastructure, and the traditional old guards have expanded their coverage to monitor the new technology landscape, neither has solved the problem. Some may do a better job than others in correlating anomalies or slicing and dicing the information for you, but they leave it to you to diagnose and pinpoint the root cause, leaving the hardest part unsolved. Furthermore, most of them require time consuming setup and configuration, deep expertise to operate, and deep domain knowledge to realize their benefits.

In practice, this means engineers are still performing most of the diagnostic work manually. The tools may be powerful, even elegant, but they don’t address the core challenge: diagnosing and remediating root causes remains a slow, resource-intensive process, particularly when time is of the essence during an incident. These gaps prolong resolution times, increase stress, and reduce time for proactive system improvements.

It’s hard to see the forest from the trees

Once the engineers turn to dashboards to investigate further, they stare at dashboards created by bottom-up tools. These tools collect a lot of data (often at great cost) and present this data in their dashboards without regard to the purpose of the information and the problem that needs to be solved. Engineers sift through metrics, logs, and time-series data, trying to understand context, composition, and dependencies so they can manually piece together patterns and correlations. This is highly labor-intensive and drives the engineer to get lost in the weeds without understanding the big picture of how the business or the service is impacted. Are service level objectives (SLOs) being violated? Are SLOs at risk?

Take your favorite observability tool. It probably excels at visualizing time-series data. However, it requires engineers to manually connect trends across dashboards and services, which can be especially challenging in distributed systems. Similarly, application performance management (APM) tools provide rich metrics and infrastructure insights, but the sheer volume of data presented in their dashboards can overwhelm users, making it difficult to focus on the most relevant information.

These tools, while powerful, often fall short in helping engineers see the forest from the trees. Instead of guiding engineers toward the right priorities and actionable insights about the broader system or the root cause, or even better, automatically pinpointing the root cause and remediating, they frequently amplify the noise. Irrelevant data, ambiguous relationships, and false positives force engineers to wade through excessive details, wasting time and delaying resolution. The lack of a top-down perspective makes it harder to understand how symptoms connect to underlying problems, leaving engineers stuck in the weeds.

The negative consequences of troubleshooting today

The way troubleshooting is done today has serious ramifications for organizations, teams, and individuals. It affects business outcomes and quality of life.

Failing to meet the SLAs

Whether the goal is 5-nines, 4-nines, or even only 3-nines, if we continue to manually troubleshoot, we will never meet these SLAs. The table below illustrates how many minutes in a month the given SLA allows for downtime.

3-nines means 99.9% uptime—in other words, all services are performing reliably at least 99.9% of the time. So, if any of the services is degraded for more than 43.2 minutes in a month, the 3-nines SLA is not met. Because of the length of time manual troubleshooting entails, a single incident in the month will cause us to miss delivering on a 3-nines SLA. And 3-nines is not even so great!

High Mean Time to Detect and Resolve (MTTD/MTTR)

The longer it takes to detect and resolve an issue, the greater the impact on customers and the business. Traditional troubleshooting workflows, which often rely on reactive and manual processes, are inherently slow. Engineers are forced to navigate through an overwhelming volume of alerts, sift through logs, and correlate metrics without clear guidance. This delay can lead to:

Prolonged outages that damage user trust and satisfaction.
Breaches of service level objectives (SLOs), which can result in financial penalties for organizations with stringent service level agreements (SLAs)
Snowballing effects, where unresolved issues trigger secondary failures, compounding the problem and making resolution even more challenging.

Individual stress and burnout from constant reactive tasks

The reactive nature of troubleshooting takes a significant toll on individual engineers. When every incident feels like a race against the clock, the pressure to resolve issues quickly can become overwhelming. Engineers often work under constant stress, juggling:

Interruptions to their regular work, leading to disrupted schedules and decreased productivity.
Escalations where they are expected to step in as subject matter experts, often during nights or weekends.
Repeated exposure to alert noise, which can cause decision fatigue and desensitization to critical alerts.

This relentless pace contributes to burnout. All it takes is a few hours of perusing the /r/sre subreddit to see that burnout is a very common issue among SREs and developers tasked with maintaining system reliability. Burnout not only affects individuals but also leads to higher attrition rates, disrupting team continuity and increasing hiring and training costs.

Reduced time for proactive reliability engineering

Troubleshooting dominates the time and energy of engineering teams, leaving little room for proactive reliability initiatives. As we will see later this week, proactive reliability engineering has extraordinary promise for the entire company: product/engineering, operations, business leaders. But instead of focusing on preventing incidents, engineers are stuck in a reactive loop. This trade-off results in:

Delayed implementation of improvements that could enhance system stability and scalability.
Accumulation of technical debt.
A vicious cycle where the lack of proactive work increases the likelihood of future incidents, perpetuating the troubleshooting burden. By constantly reacting to problems rather than proactively addressing underlying issues, teams lose the ability to innovate and build resilient systems. This dynamic not only affects engineering morale but also has broader implications for an organization’s ability to compete and adapt in fast-paced markets.

How troubleshooting should look

If we all recognize that the state of the art of troubleshooting is awful today, let’s work together to imagine a future where troubleshooting is routine and fast:

Systems automatically pinpoint root causes within your domain quickly and accurately. Modern troubleshooting workflows must prioritize speed and precision. Systems should go beyond flagging symptoms and directly pinpoint the underlying cause within your domain.
Actionable information provides necessary context upfront. Systems need to focus on identifying the actions and ideally automating the automatable.
Troubleshooting workflows are streamlined. Workflows should be intuitive and efficient, designed to minimize context switching and maximize focus with unified dashboards that integrate with your operational workflows.

These systems must have certain capabilities to be effective:

Causality. The ability to capture, represent, understand and analyze cause and effect relations.
Reasoning. Generic analytics that can reason about causality and automatically pinpoint root causes based on observed symptoms.
Automatic topology discovery. The ability to automatically discover the environment, the entities and the relationships between them.
With these systems, proper troubleshooting can drive positive business outcomes, such as:
Delivering on SLOs and meeting SLAs. Reduce the number of incidents.
Faster issue resolution, minimizing downtime. Reduce mean time to detect (MTTD) and mean time to resolve or recover (MTTR), keeping systems operational and minimizing the impact on users.
Improved productivity by reducing time spent on reactive tasks. Enable engineers to focus on high-value innovation.

Causely automates troubleshooting

Our Causal Reasoning Platform is a model-driven, purpose-built AI system delivering multiple analytics built on a common data model. It is designed to make troubleshooting much simpler and more effective by providing:

Out-of-the-box Causal Models

Causely is delivered with built-in causality knowledge capturing the common root causes that can occur in cloud-native environments. This causality knowledge enables Causely to automatically pinpoint root causes out-of-the-box as soon as it is deployed in an environment. There are at least a few important details to share about this causality knowledge:

It captures potential root causes in a broad range of entities including applications, databases, caches, messaging, load balancers, DNS compute, storage, and more.
It describes how the root causes will propagate across the entire environment and what symptoms may be observed when each of the root causes occurs.
It is completely independent from any specific environment and is applicable to any cloud-native application environment.

Automatic topology discovery

Cloud-native environments are a tangled web of applications and services layered over complex and dynamic infrastructure. Causely automatically discovers all the entities in the environment including the applications, services, databases, caches, messaging, load balancers, compute, storage, etc., as well as how they all relate to each other. For each discovered entity, Causely automatically discovers its:

Connectivity - the entities it is connected to and the entities it is communicating with horizontally
Layering - the entities it is vertically layered over or underlying
Composition - what the entity itself is composed of Causely automatically stitches all of these relationships together to generate a Topology Graph, which is a clear dependency map of the entire environment. This Topology Graph updates continuously in real time, accurately representing the current state of the environment at all times.

Root cause analysis

Using the out-of-the-box Causal Models and the Topology Graph as described above, Causely automatically generates a causal mapping between all the possible root causes and the symptoms each of them may cause, along with the probability that each symptom would be observed when the root cause occurs. Causely uses this causal mapping to automatically pinpoint root causes based on observed symptoms in real time. No configuration is required for Causely to immediately pinpoint a broad set of root causes (100+), ranging from applications malfunctioning to services congestion to infrastructure bottlenecks.

In any given environment, there can be tens of thousands of different root causes that may cause hundreds of thousands of symptoms. Causely prevents SLO violations by detangling this mess, pinpointing the root cause that’s putting your SLOs at risk, and driving remediation actions before SLOs are violated. For example, Causely proactively pinpoints if a software update changes performance behaviors for dependent services before those services are impacted.

Service impact analysis

Causely automatically analyzes the impact of the root causes on SLOs, prioritizing the root causes based on the violated SLOs and the ones that are at risk. Causely automatically defines standard SLOs (based on latency and error rate) and uses machine learning to improve its anomaly detection over time. However, environments that already have SLO definitions in another system can easily be incorporated in place of Causely’s default settings.

Contextual presentation

The results are intuitively presented in the Causely UI, enabling users to see the root causes, related symptoms, the service impacts and initiate remedial actions. The results can also be sent to external systems to alert teams who are responsible for remediating root cause problems, to notify teams whose services are impacted, and to initiate incident response workflows.

Prevention analysis

Teams can also ask "what if'' questions to understand the impact that potential problems might have if they were to occur to support the planning of service/architecture changes, maintenance activities and improving the resilience of services. 

Postmortem analysis

Teams can also review prior incidents and see clear explanations of why these occurred and what the effect was, simplifying the process of postmortems, enabling actions to be taken to avoid re-occurrences. 

Conclusion

Troubleshooting doesn’t have to dominate a developer’s or SRE’s nightmare when the right systems are in place. Empower yourself with the only system that solves the root cause analysis problem to make troubleshooting a small, manageable part of your job.

Book a meeting with the Causely team and let us show you how to stop troubleshooting and consistently meet your reliability expectations in cloud-native environments.

Tackling CPU Throttling in Kubernetes for Better Application Performance

Karina Babcock — Mon, 02 Dec 2024 19:26:16 +0000

CPU throttling is a frequent challenge in containerized environments, particularly for resource-intensive applications. It happens when a container surpasses its allocated CPU limits, prompting the scheduler to restrict CPU usage. While this mechanism ensures fair resource sharing, it can significantly impact performance if not properly managed. CPU throttling can be a major obstacle for applications like web APIs, video streaming platforms, and gaming servers. Addressing this issue involves two key steps: identifying throttling and implementing effective solutions.

What is CPU Throttling?

CPU throttling in containers occurs due to resource constraints set by control groups (cgroups). Kubernetes and other container orchestrators rely on cgroups to enforce resource limits. When a container attempts to use more CPU than its assigned quota, it gets throttled, delaying execution of tasks. (When containers have CPU limits defined, they will be converted to a cgroup CPU quota.)

Working on the vendor side for over a decade, I have seen the impact CPU throttling can have on different services across many industries. Here are three top-of-mind examples, both from my days at Turbonomic and from recent conversations with customers at Causely:

Financial Systems

Example: A stock trading platform uses containers to handle real-time market data feeds and execute trades. Throttling during peak trading hours delays data processing, potentially causing missed opportunities or incorrect order placements.
Impact: Missed deadlines for transaction processing.

Gaming Servers

Example: Online multiplayer games hosted in containers experience throttling, leading to delayed responses (lag) during gameplay. Players may experience slow rendering of in-game actions or disconnections during high traffic.
Impact: Latency and poor user experience.

Video Streaming Platforms

Example: A video-on-demand service runs encoding jobs in containers to transcode videos. Throttling increases encoding times, leading to delayed content availability or poor streaming quality for users.
Impact: Degraded video quality and buffering issues.

How to Identify CPU Throttling

It’s often difficult to catch CPU throttling because it can happen even when the host CPU usage is low. It’s critical to have the right level of monitoring set up in order to see CPU throttling when it happens, or even better, before it becomes a problem.

Monitor Your cgroup Metrics

Linux cgroups provide detailed metrics about CPU usage and throttling. Look for the cpu.stat file within the container’s cgroup directory (usually under /sys/fs/cgroup):
Within the cpu.stat file there are three key metrics:

nr_throttled: Number of times the container was throttled.
throttled_time: Total time spent throttled – which I believe is in nanoseconds
nr_periods: Total CPU allocation periods.

Example:
cat /sys/fs/cgroup/cpu/cpu.stat

Output:

nr_periods 12345
nr_throttled 543
throttled_time 987654321

If nr_throttled or throttled_time is high relative to nr_periods, then you have CPU throttling on your container.

Monitor Container Orchestration Metrics

If you’re running Kubernetes, you can use the kubectl top pod command to get metric data on the highest utilized pods. Try the command below to get metrics for a pod and all the associated containers:
kubectl top pod --containers

This is a very manual process, and you will need to compare the CPU usage against the defined limits in the pod’s resource config. If you run a describe command on the pod it will show you this information. This also means you will need to know which pod is having an issue on it. Usually when issues arise on an application it’s going to take some time to drill down to a component that might be performing poorly. You will need the Kubernetes metric server to be installed in order to run commands like kubectl top but being able to access metrics like container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total offer valuable insights into CPU usage and throttling.

Application Performance Metrics

Although they can be expensive, application performance monitoring (APM) tools provide invaluable insights into CPU throttling, offering detailed visibility that can help uncover the issue. These tools can often track throttling over time, identify exactly when it first occurred, and, in some cases, even predict future throttling trends based on usage patterns. Many organizations use a combination of monitoring tools to get a comprehensive view of their systems. APM tools also highlight the symptoms of CPU throttling, which may manifest as:

Prolonged request durations, leading to slower application response times.
Decreased throughput, resulting in fewer transactions or tasks processed within a given timeframe.
Irregular CPU usage patterns, which can signal performance instability or inefficiencies.

By combining the capabilities of APM tools with metrics collected from Kubernetes, teams can proactively address CPU throttling and ensure optimal application performance.

Best Practices to Manage CPU Throttling in Kubernetes

There are many ways to fix CPU throttling and even a few ways you can prevent it. Most root causes of CPU throttling are overcommitted nodes, or misconfigured CPU limits. Below are some ways to fix CPU throttling when it occurs and some best practices to avoid it in the future.

Adjust CPU Limits

Update resource limits in your container or pod configuration. Kubernetes resource specs can be updated like in the example below. Usually what I see from the customers I have worked with is they will set the limit just above the peak usage in the last 30, 60, or even 90 days. For non-critical workloads I have seen a few companies set this limit to 80% of max usage, and a few companies use more advanced techniques like calculating percentiles:

resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "1000m"

Increase the limits.cpu value to reduce throttling frequency.
Set requests.cpu to ensure better performance during contention. Note that if you do not set the request, Kubernetes will automatically set the request to the limit.

Use Autoscaling like Horizontal Pod Autoscaler

The Horizontal Pod Autoscaler (HPA) in Kubernetes helps address CPU throttling by dynamically adjusting the number of pods in a deployment based on real-time resource usage. Resources like CPU and memory are monitored, and when certain thresholds are met, HPA kicks in to provision more pods. In more idle periods it will also scale down the number of pods to help you run more efficiently. By distributing the workload across more pods, HPA reduces the CPU demands on individual pods, thereby mitigating CPU throttling.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 80

In this example, if average CPU utilization across all pods exceeds 80% then it will add more pods as necessary within the bounds of 2-10 (Min and Max Replicas).

Analyze Node Resource Allocation

Check overall CPU availability on nodes using a describe command:
kubectl describe node <node-name>
Ensure nodes aren’t overcommitted. Use taints and tolerations to control scheduling and ensure high-priority workloads run on dedicated nodes. Nodes that are overcommitted run the risk of not having available CPUs. If containers’ requests are higher than the node CPU availability, you are going to run into scheduling problems. Or if the limits are set too high relative to CPU availability on the node and the workload suddenly increases, then you are going to have contention.

Tweak CPU CFS Settings

Containers use the Completely Fair Scheduler (CFS) by default. The CFS in Kubernetes is a mechanism inherited from the Linux kernel that enforces CPU usage limits on containers. It works by using two key parameters from Linux cgroups: cpu.cfs_quota_us and cpu.cfs_period_us. These parameters allow Kubernetes to control the amount of CPU time a container can use over a specific period:

cpu.cfs_quota_us: Maximum microseconds of CPU time allowed per period.
cpu.cfs_period_us: Length of a scheduling period in microseconds.

To prevent CPU throttling: Increase cpu.cfs_quota_us to provide more CPU time:
echo 200000 > /sys/fs/cgroup/cpu/cpu.cfs_quota_us

I have seen this create issues before though so be careful with this adjustment as it can lead to overcommitment. In other words, if too many tasks are scheduled and you increase the amount of time a container can use the CPU, then it will create delays and throttling. Start by playing around with this in Dev or Test clusters before you make any changes to prod… duh.

Use CPU Pinning

This is more of an edge case, but instead of using CPU shares and limits, pin containers to specific CPUs for predictable performance. The Kubernetes CPU Manager controls how CPUs are allocated to containers. The enable CPU pinning, the static CPU manager policy must be used, which provides exclusive CPU allocation. Just enable the policy in the Kubelet configuration file:
cpuManagerPolicy: static
With the flag set to “static,” containers are allocated exclusive CPUs on the nodes in a cluster. Kubernetes assigns the container specific CPUs and it runs only on those cores. The big challenges with CPU pinning are overhead and scalability. When you manage pinned workloads it requires detailed planning to avoid fragmentation and underutilization. CPU pinning is good for workloads that are sensitive to CPU throttling but not ideal for volatile and dynamic workloads.

CPU Throttling is a Double-Edged Sword in Kubernetes

While CPU throttling plays a crucial role in resource management and stability, it can also hinder application performance if not managed correctly. By understanding how CPU throttling works and implementing best practices, you can optimize your Kubernetes environment, ensuring efficient resource use and enhanced application performance. As Kubernetes continues to grow and evolve, keeping a close eye on resource management will be key to maintaining robust and responsive applications.

Observability talks sure to make waves at KubeCon

Karina Babcock — Fri, 01 Nov 2024 15:15:22 +0000

Author: @enlin_xu_3050bde9796a6fe9

KubeCon North America 2024 is around the corner! This year I’m especially excited, as it’s my first KubeCon since we launched Causely. The energy at KubeCon is unmatched, and it’s a great opportunity to catch up with familiar faces and make new connections in the community.

Cloud-native and open source technologies are foundational to everything we’re building at Causely, and I’m looking forward to diving into the latest developments with observability tools that will shape the reliability of countless modern applications. We’re excited about how observability tools, such as Prometheus, OpenTelemetry, Grafana Beyla, and Odigos, are improving how systems are monitored and understood. These tools will underpin the reliability engineering strategy of countless modern application environments.

In this post, I’ll unpack some of the exciting innovations happening in the open source observability space, and highlight specific talks at KubeCon 2024 I’m looking forward to attending. Add them to your schedule if this is an area you’re following too, and I’ll see you there!

What’s New in the Observability Landscape

The observability space is evolving rapidly. Several key developments with OpenTelemetry and eBPF are reshaping how we approach monitoring and tracing in distributed systems. This helps make metric collection and auto-instrumentation easier and more flexible than ever.

Prometheus and OpenTelemetry are even better together

The collaboration between Prometheus and OpenTelemetry is delivering a more cohesive experience for capturing system metrics. Here’s what’s new:

Using the OTel Collector’s Prometheus Receiver: OpenTelemetry now allows for streamlined ingestion of Prometheus metrics, creating a more unified data pipeline.
Exploring OTel-Native Metric Collection Options: For those deeply involved in Kubernetes monitoring, tools like Kubernetes Cluster Receiver and the Kubeletstats Receiver provide robust options for flexible metric collection.

These integrations create a powerful foundation for system insights that can fuel everything from troubleshooting to strategic infrastructure optimizations.

Odigos makes migration to OpenTelemetry easy

Transitioning from proprietary observability tools can be challenging. If you’re looking to transition from proprietary observability tools to OpenTelemetry, Odigos is a great enabler. Here’s why:

Simplified Migration: Odigos takes the complexity out of moving to OpenTelemetry, helping reduce the friction and downtime that can accompany large-scale tooling changes.
Access to Enhanced Capabilities: With OpenTelemetry, you gain access to a broader ecosystem, opening doors to new integrations and data visualizations that enrich your observability setup.

Streamline eBPF-based auto-instrumentation

Combining Grafana Beyla with Grafana Alloy makes eBPF-powered observability easy and minimally intrusive. Whether you’re working with standalone systems or Kubernetes clusters, this integration provides high-precision monitoring capabilities without heavy overhead.

KubeCon 2024 Deep Dives and Hands-On Sessions

The schedule at KubeCon North America 2024 will feature several sessions dedicated to observability. Here are some highlights I’m looking forward to:

OpenTelemetry: The OpenTelemetry Hero’s Journey – Working with Open Source Observability

Takeaway: Dive into the current capabilities of OpenTelemetry, including correlated metrics, traces, and logs for a complete observability picture. The talk also addresses gaps in today’s open source observability tools.

Inspektor Gadget: eBPF for Observability, Made Easy and Approachable

Takeaway: This project lightning talk will explain how Inspektor Gadget simplifies eBPF distribution and deployment, allowing you to build fast, efficient data collection pipelines that plug into popular observability tools.

Optimizing LLM Performance in Kubernetes with OpenTelemetry

Takeaway: Speakers Ashok Chandrasekar (Google) and Liudmila Molkova (Microsoft) will help you gain practical insights into observing and optimizing Large Language Model (LLM) deployments on Kubernetes. This session covers everything from client and server tracing with OpenTelemetry to advanced autoscaling strategies.

Unifying Observability: Correlating Metrics, Traces, and Logs with Exemplars and OpenTelemetry

Takeaway: Speakers Kruthika Prasanna Simha and Charlie Le (Apple) will help attendees learn how to correlate data across metrics, traces, and logs, with exemplars. This session demonstrates practical visualization techniques in Grafana, making it easy to move from high-level metrics down to detailed traces.

Now You See Me: Tame MTTR with Real-Time Anomaly Detection

Takeaway: Speakers Kruthika Prasanna Simha and Raj Bhensadadia (Apple) will dive into the latest in real-time anomaly detection, with insights into applying machine learning to time series data in cloud-native environments.

Find Causely at KubeCon 2024!

If the above topics are also interesting to you, we’d love to meet up. At Causely, we’re passionate about helping organizations unlock the full potential of their observability data to continuously assure service reliability. We believe that flying with overwhelming volumes of observability data is just as bad as flying blind.

Here’s where you can find us at KubeCon:

🥂 Come to our happy hour! We’re co-hosting a happy hour with friends from NVIDIA, Alma Security, Edera, and 645 Ventures on November 12th.
🦈 Find the shark! Look for a shark at the show on the afternoon of November 13th. Here’s a hint: she’ll be holding something sweet.

The use of eBPF – in Netflix, GPU infrastructure, Windows programs and more

Karina Babcock — Mon, 30 Sep 2024 15:43:42 +0000

Originally posted to causely.io by Will Searle

Takeaways from eBPF Summit 2024

How are organizations applying eBPF to solve real problems in observability, security, profiling, and networking? It’s a question I’ve found myself asking as I work in and around the observability space – and I was pleasantly surprised when Isovalent’s recent eBPF Summit provided some answers.

For those new to eBPF, it’s an open source technology that empowers observability practices. Many organizations and vendors have adopted it as a data source (including Causely, where we use it to enhance our instrumentation for Kubernetes).

Many of the eBPF sessions highlighted real challenges companies faced and how they used eBPF to overcome them. In the spirit of helping others, my cliff notes and key takeaways from eBPF Summit are below.

Organizations like Netflix and Datadog are using eBPF in new, creative ways

The use of eBPF in Netflix

One of the Keynote presentations was delivered by Shweta Saraf who described specific problems Netflix overcame using eBPF, such as noisy neighbors. This is a common problem faced by many companies with cloud-native environments.

Netflix uses eBPF to measure how long processes spend in the CPU scheduled state. When processes are taking too long, it usually indicates a performance bottleneck on CPU resources — like CPU throttling or over-allocation. (Netflix’s compute and performance team released a blog recently with much more detail on the subject.) In solving the noisy neighbor problem, the Netflix team also created a tool called bpftop which is designed to measure the CPU usage of the eBPF code they instrumented.

The Netflix team released bpftop for the community to use, and it will ultimately help organizations implement efficient eBPF programs. This is especially useful if an eBPF program is hung, allowing teams to quickly identify any overhead that an eBPF program has. We have come full circle: monitoring our monitoring programs 😁.

The use of eBPF in Datadog

Another use case for eBPF – and one that can be easily overlooked – is in chaos engineering. Scott Gerring, a technical advocate at Datadog, shared his experience on the matter. This quote resonated with me: “with eBPF… we have this universal language of destruction” – controlled destruction that is.

The benefit of eBPF is that we can inject failures into cloud-native systems without having to re-write the code of an application. Interestingly, there are open source projects out there for chaos engineering that already use eBPF, such as ChaosMesh.

Scott listed a few examples like Kernel Probes that are attached to the openat system that will cause access denied errors for 50% of calls made by system processes that a user can select or define. Or, using the traffic control subsystem to drop packets for sockets on process you want to mark for failure.

eBPF will underpin AI development

Isovalent Co-founder and CTO Thomas Graf presented the eBPF roadmap and what he is most excited about. Notably: eBPF will deliver value in enabling the GPU and DPU infrastructure wave fueled by AI. AI is undoubtably one of the hottest topics in tech right now. Many companies are using GPUs and DPUs to accelerate AI and ML (Machine Learning) tasks, because CPUs cannot deliver the processing power demanded by today’s AI models.

As Tom mentioned, whether the AI wave produces anything meaningful is up for debate, but companies will undoubtedly try, and they will make significant investments in GPUs and DPUs along the way. The capabilities of eBPF will be applied to this new wave of infrastructure in the same manner they did for CPUs.

GPUs and DPUs are expensive, so companies do not want to waste processing power on programs that will drive up utilization. The efficiency of eBPF programs can help maximize the performance of costly GPUs. For example, eBPF can be used for GPU profiling by hooking into GPU events such as memory, sync, and kernel launches. Unlocking this type of data can be used to understand which kernels are used most frequently, improving efficiencies of AI development.

eBPF support for Windows is growing

Another interesting milestone in eBPF’s journey is the support for Windows. In fact, there is a growing Git Repository for eBPF programs on Windows that exists today: https://github.com/microsoft/ebpf-for-windows

The project supports Windows 10 or later and Windows Server 2019 or later, and while there is not feature parity yet to Linux, there is a lot of development in this space. The community is hard at work porting over the same tooling for eBPF on Linux, but it is a challenging endeavor as the hook points for Linux eBPF components (like Just-In-Time compilation or eBPF bytecode signatures) will differ on Windows.

It will be exciting to watch the same networking, security, and observability eBPF capabilities on Linux become available for Windows.

The need for better observability is fueling eBPF ecosystem growth

eBPF tools have been created by the community for both applications and infrastructure use cases. There a 9 major projects for applications and over 30 exciting emerging projects for applications. Notably, while there are a few production-ready runtimes and tools within the infrastructure ecosystem (like Linux and LLVM Compiler), there are many emerging projects such as eBPF for Windows.

With a user base across Meta, Apple, Capital One, LinkedIn, and Walmart (just to name a few), we can expect the number of eBPF projects to grow considerably in the coming years. The overall number of projects is actually forecasted in the triple digits by the end of 2025.

One of the top catalysts for growth? The urgent need for better observability. Of all the topics at last year’s KubeCon in Chicago, observability ranked the highest, beating competing topics like cost and automation. As with any other tool, eBPF can allow organizations gather a lot of data, but the “why” is important. Are you using that data to create more noise and more alerts, or can you apply it to get to the root cause of problems that surface, or for other applications?

It is exciting to watch the eBPF community develop and implement creative new ways to use eBPF and the 2024 eBPF summit was (and still is) an excellent source of real-world eBPF use cases and community-generated tooling.

Preventing Out-of-Memory (OOM) Kills in Kubernetes: Tips for Optimizing Container Memory Management

Karina Babcock — Mon, 23 Sep 2024 15:34:26 +0000

Running containerized applications at scale with Kubernetes demands careful resource management. One very complicated but common challenge is preventing Out-of-Memory (OOM) kills, which occur when a container’s memory consumption surpasses its allocated limit. This brutal termination by the Kubernetes kernel’s OOM killer disrupts application stability and can affect application availability and the health of your overall environment.

In this post, we’ll explore the reasons that OOM kills can occur and provide tactics to combat and prevent them.

Before diving in, it’s worth noting that OOM kills represent one symptom that can have a variety of root causes. It’s important for organizations to implement a system that solves the root cause analysis problem with speed and accuracy, allowing reliability engineering teams to respond rapidly, and to potentially prevent these occurrences in the first place.

Deep dive into an OOM kill

An Out-Of-Memory (OOM) kill in Kubernetes occurs when a container exceeds its memory limit, causing the Kubernetes kernel’s OOM killer to terminate the container. This impacts application stability and requires immediate attention.

Several factors can trigger OOM kills in your Kubernetes environment, including:

Memory limits exceeded: This is the most common culprit. If a container consistently pushes past its designated memory ceiling, the OOM killer steps in to prevent a system-wide meltdown.
Memory leaks: Applications can develop memory leaks over time, where they allocate memory but fail to release it properly. This hidden, unexpected growth eventually leads to OOM kills.
Resource overcommitment: Co-locating too many resource-hungry pods onto a single node can deplete available memory. When the combined memory usage exceeds capacity, the OOM killer springs into action.
Bursting workloads: Applications with spiky workloads can experience sudden memory surges that breach their limits, triggering OOM kills.

As an example, a web server that experiences a memory leak code bug may gradually consume more and more memory until the OOM killer intervenes to prevent a crash.

Another case could be when a Kubernetes cluster over-commits resources by scheduling too many pods on a single node. The OOM killer may need to step in to free up memory and ensure system stability.

The devastating effects of OOM kills: Why they matter

OOM kills aren’t normally occurring events. They can trigger a cascade of negative consequences for your applications and the overall health of the cluster, such as:

Application downtime: When a container is OOM-killed, it abruptly terminates, causing immediate application downtime. Users may experience service disruptions and outages.
Data loss: Applications that rely on in-memory data or stateful sessions risk losing critical information during an OOM kill.
Performance degradation: Frequent OOM kills force containers to restart repeatedly. This constant churn degrades overall application performance and user experience.
Service disruption: Applications often interact with each other. An OOM kill in one container can disrupt inter-service communication, causing cascading failures and broader service outages.

If a container running a critical database service experiences an OOM kill, it could result in data loss and corruption. This leads to service disruptions for other containers that rely on the database for information, causing cascading failures across the entire application ecosystem.

Combating OOM kills

There are a few different tactics to combat OOM kills in attempt to operate a memory-efficient Kubernetes environment.

Set appropriate resource requests and limits

For example, you can set a memory request of 200Mi and a memory limit of 300Mi for a particular container in your Kubernetes deployment. Requests ensure the container gets at least 200Mi of memory, while limits cap it at 300Mi to prevent excessive consumption.

resources:

  requests:

    memory: "200Mi"

  limits:

    memory: "300Mi"

While this may mitigate potential memory use issues, it is a very manual process and does not deal at all with the dynamic nature of what we can achieve with Kubernetes. It also doesn’t solve the source issue, which may be a code-level problem triggering memory leaks or failed GC processes.

Transition to autoscaling

Leveraging autoscaling capabilities is a core dynamic option for resource allocation. There are two autoscaling methods:

Vertical Pod Autoscaling (VPA): VPA dynamically adjusts resource limits based on real-time memory usage patterns. This ensures containers have enough memory to function but avoids over-provisioning.
Horizontal Pod Autoscaling (HPA): HPA scales the number of pods running your application up or down based on memory utilization. This distributes memory usage across multiple pods, preventing any single pod from exceeding its limit. The following HPA configuration shows an example of scaling based on memory usage:

apiVersion: autoscaling/v2beta2

kind: HorizontalPodAutoscaler

metadata:

  name: my-app-hpa

spec:

  scaleTargetRef:

    apiVersion: apps/v1

    kind: Deployment

    name: my-app

  minReplicas: 2

  maxReplicas: 10

  metrics:

    - type: Resource

      resource:

        name: memory

        target:

          type: Utilization

          averageUtilization: 80

Monitor memory usage

Proactive monitoring is key. For instance, you can configure Prometheus to scrape memory metrics from your Kubernetes pods every 15 seconds and set up Grafana dashboards to visualize memory usage trends over time. Additionally, you can create alerts in Prometheus to trigger notifications when memory usage exceeds a certain threshold.

Optimize application memory usage

Don’t underestimate the power of code optimization. Address memory leaks within your applications and implement memory-efficient data structures to minimize memory consumption.

Pod disruption budgets (PDB)

When deploying updates, PDBs ensure a minimum number of pods remain available, even during rollouts. This mitigates the risk of widespread OOM kills during deployments. Here is a PDB configuration example that helps ensure minimum pod availability.

apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

  name: my-app-pdb

spec:

  minAvailable: 80%

  selector:

    matchLabels:

      app: my-app

Manage node resources

You can apply a node selector to ensure that a memory-intensive pod is only scheduled on nodes with a minimum of 8GB of memory. Additionally, you can use taints and tolerations to dedicate specific nodes with high memory capacity for memory-hungry applications, preventing OOM kills due to resource constraints.

nodeSelector:

  disktype: ssd

tolerations:

  - key: "key"

    operator: "Equal"

    value: "value"

    effect: "NoSchedule"

Use QoS classes

Kubernetes offers Quality of Service (QoS) classes that prioritize resource allocation for critical applications. Assign the highest QoS class to applications that can least tolerate OOM kills. Here is a sample resource configuration with QoS parameters:

resources:

  requests:

    memory: "1Gi"

    cpu: "500m"

  limits:

    memory: "1Gi"

    cpu: "500m"

These are a few potential strategies to help prevent OOM kills. The challenge comes with the frequency with which they can occur, and the risk to your applications when they happen.

As you can imagine, it’s not possible to manually manage resource utilization, and guarantee the stability and performance of your containerized applications within your Kubernetes environment.

Manual thresholds = Rigidity and risk

These techniques can help reduce the risk of OOM kills. The issue is not entirely solved though. By setting manual thresholds and limits, you’re removing many of the dynamic advantages of Kubernetes.

A more ideal way to solve the OOM kill problem is to use adaptive, dynamic resource allocation. Even if you get resource allocation right on initial deployment, there are many factors that change that affect how your application consumes resources. There is also a risk because application and resource issues don’t just affect one pod, or one container. Resource issues can reach every part of the cluster and degrade the other running applications and services.

Which strategy works best to prevent OOM kills?

Vertical Pod Autoscaling (VPA) and Horizontal Pod Autoscaling (HPA) are common strategies used to manage resource limits in Kubernetes containers. VPA adjusts resource limits based on real-time memory usage patterns, while HPA scales pods based on memory utilization.

Monitoring with tools like Prometheus may help with the troubleshooting of memory usage trends. Optimizing application memory usage is no easy feat because it’s especially challenging to identify whether it is infrastructure or code causing the problem.

Pod Disruption Budgets (PDB) may help ensure a minimum number of pods remain available during deployments, while node resources can be managed using node selectors and taints. Quality of Service (QoS) classes prioritize resource allocation for critical applications.

One thing is certain: OOM kills are a common and costly challenge to manage using traditional monitoring tools and methods.

At Causely, we’re focused on applying causal reasoning software to help organizations keep applications healthy and resilient. By automating root cause analysis, issues like OOM kills can be resolved in seconds, and unintended consequences of new releases or application changes can be avoided.

The “R” in MTTR: Repair or Recover? What’s the difference?

Karina Babcock — Wed, 18 Sep 2024 13:28:50 +0000

Author: Will Searle

Finding meaning in a world of acronyms

There are so many ways to measure application reliability today, with hundreds of key performance indicators (KPIs) to measure availability, error rates, user experiences, and quality of service (QoS). Yet every organization I speak with struggles to effectively use these metrics. Some applications and services require custom metrics around reliability while others can be measured with just uptime vs. downtime.

In my role at Causely, I work with companies every day who are trying to improve the reliability, resiliency, and agility of their applications. One method of measuring reliability that I keep a close eye on is MTT(XYZ). Yes, I made that up, but it’s meant to capture all the different variations of mean time to “X” out there. We have MTTR, MTTI, MTTF, MTTA, MTBF, MTTD, and the list keeps going. In fact, some of these acronyms have multiple definitions. The one whose meaning I want to discuss today is MTTR.

So, what’s the meaning of MTTR anyway?

Before cloud-native applications, MTTR meant one thing – Mean Time to Repair. It’s a metric focused on how quickly an organization can respond to and fix problems that cause downtime or performance degradation. It’s simple to calculate too:

Total time spent on repairs is the length of time IT spends fixing issues, and number of repairs is the number of times a fix has been implemented. Some organizations look at this over a week or a month in production. It’s a great metric to understand how resilient your system is and how quickly the team can fix a known issue. Unfortunately, data suggests that most IT organizations’ MTTR is increasing every year, despite massive investments in the observability stack.

For monolithic applications, MTTR has historically been an excellent measurement; as soon as a fix is applied, the entire application is usually back online and performing well. Now that IT is moving toward serverless and cloud-native applications, it is a much different story. When a failure occurs in Kubernetes – where there are many different containers, services, applications, and more all communicating in real time – the entire system can take much longer to recover.

The new MTTR: Mean Time to Recover

I am seeing more and more organizations redefine the meaning of MTTR from “mean time to repair” to “mean time to recover.” Recover means that not only is everything back online, but the system is performing well and satisfying any QoS or SLAs AND a preventative approach has been implemented.

For example, take a common problem within Kubernetes: a pod enters a CrashLoopBackoff state. There are many reasons why a pod might continuously restart including deployment errors, resourcing constraints, DNS resolution errors, missing K8s dependencies, etc. But let’s say you completed your investigation and found out that your pod did not have sufficient memory and therefore was crashing/restarting. So you increased the limit on the container or the deployment and the pod(s) seems to be running fine for a bit…. but wait, it just got evicted.

The node now has increased memory usage and pods are being evicted. Or, what if now we created noisy neighbors, and that pod is “stealing” resources like memory from others on the same node? This is why organizations are moving away from repair because sometimes when the applied fix brings everything online, it doesn’t mean the system is healthy. “Repaired” can be a subjective term. Furthermore, sometimes the fix is merely a band-aid, and the problem returns hours, days, or weeks later.

Waiting for the entire application system to become healthy and applying a preventative measure will get us better insight into reliability. It is a more accurate way to measure how long it takes from a failure event to a healthy environment. After all, just because something is online does not mean it is performing well. The tricky issue here is: How do you measure “healthy”? In other words, how do we know the entire system is healthy and our preventative patch is truly preventing problems? There are some good QoS benchmarks like response time or transactions per second, but there is usually some difficulty in defining these thresholds. An improvement in MTBF (mean time between failures) is another good benchmark to test to see if your preventative approach is working.

How can we improve Mean Time to Recover?

There are many ways to improve system recovery, and ultimately the best way to improve MTTR is to improve all the MTT(XYZ) that come before it on incident management timelines.

Automation: Automating tasks like ticket creation, assigning incidents to appropriate teams, and probably most importantly, automating the fix can all help reduce the time from problem identification to recovery. But, the more an organization scrutinizes every single change and configuration, the longer it takes to implement a fix. Becoming less strict drives faster results.
Well-defined Performance Benchmarks: Lots of customers I speak with have a couple KPIs they track, but the more specific the better. For example, instead of making a blanket statement that every application needs to have 200ms of response time or less, set these metrics on an app by app basis.
Chaos Engineering: This is an often-overlooked methodology to improve recovery rate. Practicing and simulating failures helps improve how quickly we can react, troubleshoot, and apply a fix. It does take a lot of time though, so it is not an easy strategy to adhere to.
Faster Alerting Mechanisms: This is simple: The faster we get notified of a problem, the quicker we can fix it. We need to not just identify the symptoms but also quickly find the root cause. I see many companies try to set up proactive alerts, but they often get more smoke than fire.
Knowledge Base: This was so helpful for me in a previous role. Building a KB in a system like Atlassian, SharePoint, or JIRA can help immensely in the troubleshooting process. The KB needs to be searchable and always changing as the environment evolves. Being able to search for a specific string in an error message within a KB can immediately highlight not just a root cause but also a fix.

To summarize, MTTR is a metric that needs to capture the state of a system from the moment of failure until the entire system is healthy again. This is a much more accurate representation of how fast we recover from a problem, and how resilient the application architecture is. MTTR is a principle that extends beyond the world of IT; its applications exist in security, mechanics, even healthcare. Just remember, a good surgeon is not only measured by how fast he can repair a broken bone, but by how fast the patient can recover.

Improving application resilience and reliability is something we spend a lot of time thinking about at Causely. We’d love to hear how you’re handling this today, and what metric you’ve found most useful toward this goal. Comment here or contact us with your thoughts!

Understanding the Kubernetes Readiness Probe: A Tool for Application Health

Karina Babcock — Tue, 13 Aug 2024 20:51:36 +0000

Application reliability is a dynamic challenge, especially in cloud-native environments. Ensuring that your applications are running smoothly is make-or-break when it comes to user experience. One essential tool for this is the Kubernetes readiness probe. This post will explore the concept of a readiness probe, explaining how it works and why it’s a key component for managing your Kubernetes clusters.

What is a Kubernetes Readiness Probe?

A readiness probe is essentially a check that Kubernetes performs on a container to ensure that it is ready to serve traffic. This check is needed to prevent traffic from being directed to containers that aren’t fully operational or are still in the process of starting up.

By using readiness probes, Kubernetes can manage the flow of traffic to only those containers that are fully prepared to handle requests, thereby improving the overall stability and performance of the application.

Readiness probes also help in preventing unnecessary disruptions and downtime by only including healthy containers in the load balancing process. This is an essential part of a comprehensive SRE operational practice for maintaining the health and efficiency of your Kubernetes clusters.

How Readiness Probes Work

Readiness probes are configured in the pod specification and can be of three types:

HTTP Probes: These probes send an HTTP request to a specified endpoint. If the response is successful, the container is considered ready.
TCP Probes: These probes attempt to open a TCP connection to a specified port. If the connection is successful, the container is considered ready.
Command Probes: These probes execute a command inside the container. If the command returns a zero exit status, the container is considered ready.

Below is an example demonstrating how to configure a readiness probe in a Kubernetes deployment:

apiVersion: v1

kind: Pod

metadata:

  name: readiness-example

spec:

  containers:

  - name: readiness-container

    image: your-image

    readinessProbe:

      httpGet:

        path: /healthz

        port: 8080

      initialDelaySeconds: 5

      periodSeconds: 10

This YAML file defines the Kubernetes pod with a readiness probe configured based on the following parameters:

apiVersion: v1 – Specifies the API version used for the configuration.
kind: Pod – Indicates that this configuration is for a Pod.
metadata:
name: readiness-example – Sets the name of the Pod to “readiness-example.”
spec – Describes the desired state of the Pod.
containers:
name: readiness-container – Names the container within the Pod as “readiness-container.”
image: your-image – Specifies the container image to use, named “your-image.”
readinessProbe – Configures a readiness probe to check if the container is ready to receive traffic.
httpGet:
path: /healthz – Sends an HTTP GET request to the /healthz path.
port: 8080 – Targets port 8080 for the HTTP GET request.
initialDelaySeconds: 5 – Waits 5 seconds before performing the first probe after the container starts.
periodSeconds: 10 – Repeats the probe every 10 seconds.

This relatively simple configuration creates a Pod named “readiness-example” with a single container running “your-image.” It includes a readiness probe that checks the /healthz endpoint on port 8080, starting 5 seconds after the container launches and repeating every 10 seconds to determine if the container is ready to accept traffic.

Importance of Readiness Probes

The goal is to make sure you can prevent traffic from being directed to a container that is still starting up or experiencing issues. This helps maintain the overall stability and reliability of your application by only sending traffic to containers that are ready to handle it.

Readiness probes can be used in conjunction with liveness probes to further enhance the health checking capabilities of your containers.

Readiness probes are important for a few reasons:

Prevent traffic to unready pods: They ensure that only ready pods receive traffic, preventing downtime and errors.
Facilitate smooth rolling updates: By making sure new pods are ready before sending traffic to them.
Enhanced application stability: They can help with the overall stability and reliability of your application by managing traffic flow based on pod readiness.

Remember that your readiness probes only check for availability, and don’t understand why a container is not available. Readiness probe failure is a symptom that can manifest from many root causes. It’s important to know the purpose, and limitations before you rely too heavily on them for overall application health.

Best Practices for Configuring Readiness Probes

To make the most of Kubernetes readiness probes, consider the following practices:

Define Clear Health Endpoints: Ensure your application exposes a clear and reliable health endpoint.
Set Appropriate Timing: Configure initialDelaySeconds and periodSeconds based on your application’s startup and response time.
Monitor and Adjust: Continuously monitor the performance and adjust the probe configurations as needed.

For example, if your application requires a database connection to be fully established before it can serve requests, you can set up a readiness probe that checks for the availability of the database connection.

By configuring the initialDelaySeconds and periodSeconds appropriately, you can ensure that your application is only considered ready once the database connection is fully established. This will help prevent any potential issues or errors that may occur if the application is not fully prepared to handle incoming requests.

Limitations of Readiness Probes

Readiness probes are handy, but they only check for the availability of a specific resource and do not take into account the overall health of the application. This means that even if the database connection is established, there could still be other issues within the application that may prevent it from properly serving requests.

Additionally, readiness probes do not automatically restart the application if it fails the check, so it is important to monitor the results and take appropriate action if necessary. Readiness probes are still a valuable tool for ensuring the stability and reliability of your application in a Kubernetes environment, even with these limitations.

Troubleshooting Kubernetes Readiness Probes: Common Issues and Solutions

Slow Container Start-up

Problem: If your container’s initialization tasks exceed the initialDelaySeconds of the readiness probe, the probe may fail.

Solution: Increase the initialDelaySeconds to give the container enough time to start and complete its initialization. Additionally, optimize the startup process of your container to reduce the time required to become ready.

Unready Services or Endpoints

Problem: If your container relies on external services or dependencies (e.g., a database) that aren’t ready when the readiness probe runs, it can fail. Race conditions may also occur if your application’s initialization depends on external factors.

Solution: Ensure that external services or dependencies are ready before the container starts. Use tools like Helm Hooks or init containers to coordinate the readiness of these components with your application. Implement synchronization mechanisms in your application to handle race conditions, such as using locks, retry mechanisms, or coordination with external components.

Misconfiguration of the Readiness Probe

Problem: Misconfigured readiness probes, such as incorrect paths or ports, can cause probe failures.

Solution: Double-check the readiness probe configuration in your Pod’s YAML file. Ensure the path, port, and other parameters are correctly specified.

Application Errors or Bugs

Problem: Application bugs or issues, such as unhandled exceptions, misconfigurations, or problems with external dependencies, can prevent it from becoming ready, leading to probe failures.

Solution: Debug and resolve application issues. Review application logs and error messages to identify the problems preventing the application from becoming ready. Fix any bugs or misconfigurations in your application code or deployment.

Insufficient Resources

Problem: If your container is running with resource constraints (CPU or memory limits), it might not have the resources it needs to become ready, especially under heavy loads.

Solution: Adjust the resource limits to provide the container with the necessary resources. You may also need to optimize your application to use resources more efficiently.

Conflicts Between Probes

Problem: Misconfigured liveness and readiness probes might interfere with each other, causing unexpected behavior.

Solution: Ensure that your probes are configured correctly and serve their intended purposes. Make sure that the settings of both probes do not conflict with each other.

Cluster-Level Problems

Problem: Kubernetes cluster issues, such as kubelet or networking problems, can result in probe failures.

Solution: Monitor your cluster for any issues or anomalies and address them according to Kubernetes best practices. Ensure that the kubelet and other components are running smoothly.

These are common issues to keep an eye out for. Watch for problems that the readiness probes are not surfacing or that might be preventing them from acting as expected.

Summary

Ensuring that your applications are healthy and ready to serve traffic is necessary for maximizing uptime. The Kubernetes readiness probe is one helpful tool for managing Kubernetes clusters; it should be a part of a comprehensive Kubernetes operations plan.

Readiness probes can be configured in pod specifications and can be HTTP, TCP, or command probes. They help prevent disruptions and downtime by ensuring only healthy containers are included in the load-balancing process.

They also use the prevention of sending traffic to unready pods for smooth rolling updates and enhancing application stability. It’s good practice that your readiness probes include defining clear health endpoints, setting appropriate timing, and monitoring and adjusting configurations.

Don’t forget that readiness probes have clear limitations, as they only check for the availability of a specific resource and do not automatically restart the application if it fails the check. A Kubernetes readiness probe failure is merely a symptom that can be attributed to many root causes. To automate root cause analysis across your entire Kubernetes environment, check out Causely for Cloud-Native Applications.

Unlocking the Power of Causal AI: A Self-Guided Tour of Our Causal Reasoning Platform

Karina Babcock — Thu, 01 Aug 2024 20:21:20 +0000

We spend a lot of time unpacking the pain and implications associated with keeping modern application environments healthy and resilient, and surfacing the potential impact Causal AI for DevOps can have. It's time for a solution.

At Causely, we've built a Causal Reasoning Platform that assures continuous application reliability, and we're excited to introduce a new interactive tour so you can experience Causely for Cloud-Native Applications yourself.

In this self-guided tour, you'll explore the key features and capabilities of our platform, witnessing firsthand how it can transform your DevOps strategy and day-to-day work. From automatically identifying root causes of anomalies to future-proofing your application environment and avoiding unintended consequences of new releases, our Causal Reasoning Platform will help you minimize service disruptions and stay focused on meeting business objectives.

Take Causely for Cloud-Native Applications for a spin and let us know what you think!