Forem: Samyukktha

Inferencing - the AI-led future of Observability?

Samyukktha — Wed, 02 Aug 2023 07:38:18 +0000

Over the last 9 months, rapid advancements in Generative AI have impacted nearly every industry. What are the implications for how we do observability and monitoring of our production systems?

I posit that the next evolution in this space will be a new class of solutions that do "Inferencing" - i.e., directly root-cause the source of an error for a developer with reasonable accuracy.

In this article, we examine -

Inferencing as a natural next step to observability
Learnings from AIOps - why it failed to take off, and implications for inferencing
Some emergent principles for the class of inferencing solutions

Where we are

To understand the next generation of observability, let us start with the underlying goal of all of these tools. It is to keep production systems healthy and running as expected and, if anything goes wrong, to allow us to quickly understand why and resolve the issue.

If we start there, we see that are three distinct levels in how tools can support us:

Level 1: "Alert me when something is off in my system" — monitoring
Level 2: "Tell me why something is off (and how to fix it)" — let's call this inferencing
Level 3: "Fix it yourself and tell me what you did" — auto-remediation.

Monitoring tools do Level 1 (detecting issues) reasonably well. The natural next step is level 2, where a system automatically tells us why something is breaking. We don't yet have solutions that can do this, so we added a category of tools called observability between Level 1 and Level 2, whose goal was to "help us understand why something is breaking”. This essentially became "any data that could potentially help us understand what is happening".

The problem with observability today

The main problem with observability today is that it is loosely defined and executed. This is because we don't know what data we need beforehand - that depends on the issue. And the nature of production errors is that they are long-tail and unexpected - if they could've been predicted, they would've been resolved. So we collect a broad variety of data types - metrics, logs, distributed traces, stack-traces, error data, K8s events, and continuous profiling - all to ensure we have some visibility when things go wrong.

The best way to understand observability as it is implemented in practice is - "Everything outside of metrics, plus metrics."

An observability platform also has to make choices around how much data to capture and store. This decision is currently pushed to customers, and customers default to collecting as much data as they can afford to. Rapid cloud adoption & of SaaS models has made it possible to collect and store massive amounts of data.

The underlying human driver for collecting more types & volume of observability data is - "What if something breaks and I don't have enough data to be able to troubleshoot?" This is every engineering team's worst nightmare. As a result of this, we have a situation today where there is -

Observability data types are continuously expanding
Observability data volumes are exploding
Increasing tool sprawl - avg. company uses 5+ observability tools

With all this, we started running into new problems. It is now increasingly hard for developers to manually navigate the tens of data-intensive dashboards to identify an error. Observability has also become prohibitively expensive, so companies have to start making complex trade-offs around what data to store, sometimes losing visibility. Between both of this and the continued increase in complexity of production systems, we are seeing that MTTR for production errors is actually increasing over the years.

Inferencing— observability plus AI

I'd argue the next step after observability is Inferencing — where a platform can reasonably explain why an error occurred, so we can fix it. This becomes possible now in 2023 with the rapid evolution of AI models in the last few months.

Imagine a solution that:

Automatically surfaces just the errors that need immediate developer attention.
Tells the developer exactly what is causing the issue and where the issue is - this pod, this server, this code path, this type of request.
Guides the developer on how to fix it.
Uses the developer's actual actions to continuously improve its recommendations.

There is some early activity in this space (including ZeroK.ai), but we are still extremely early and we can expect several new companies to emerge here over the next couple of years.

However, any conversation around AI + observability would be incomplete without a discussion on AIOPs, which was the same promise (use AI to automate production ops), and saw a wave of investment between '15-20 but had limited success and died down. For a comprehensive exploration of why AIOPs failed, read this - AIOps Is Dead.

Avoiding the pitfalls of AIOps

The original premise of AIOps was that if we pushed data from the five or six different observability sources into one unifying platform and then applied AI on all this data (metrics, logs, dependencies, traces, configurations, updates, K8s events), we would get to insights on why something broke.

While attractive in theory, this argument fell short in a few ways.

First, tying everything together was impractical and a much harder problem than estimated. Different observability tools collected different data about different systems at different time intervals, and they exposed different subsets of the data they collected to a third-party tool. In addition, each individual customer had their own data collection practices. Enterprises, for example, would have several different observability tools with highly customized and non-standard data collection, and the enterprise would have to manually provide context about them to any new AIOPs tool. All this impacted the quality of response/ root-causing that the ML models could get to and the value they ended up delivering.

Second, the AIOPs model was too broad and not use-case driven. For example, what specific kinds of errors were the models targeting? Infrastructure errors/ code errors/ hard errors/ soft errors? Most AIOPs platforms went quite broad here, and tried to ingest 8 different data types in the hope of finding patterns and anomalies, and had too much dependence on how much data and of what quality the customer pushed into the platform, which varied the quality of their output.

Third, the integration process in each company was too complex. A large company has a fragmented footprint of observability tools, with 10s of tools being run by different teams. An AIOps platform needed to be integrated into all those tools to be able to add value. Even in the case of the AIOPs module of a large observability platform like a Datadog or DynaTrace, the requirement was that the entire observability stack be moved to the single vendor (across infra & app. metrics, logs, distributed tracing, error monitoring, and so on), for the AIOPs module to be effective.

Fourth, the massive data volumes and the processing power required to process the data also made these tools very expensive in relation to their questionable value.

All this resulted in a degree of disillusionment among many early adopters of AIOps tools, and most companies reverted to a simple “collect data and display on dashboards” mechanism for their needs.

Several learnings from here will be applicable and can learn and apply as we attempt Inferencing.

Principles of Inferencing

Inferencing would have to be different in a few ways to more efficiently achieve the end goal.

Architected from the ground up with AI in mind

An Inferencing solution would have to be architected ground-up for the AI use-case i.e., data collection, processing, pipeline, storage, and user interface are all designed ground-up for "root-causing issues using AI".

What it will probably NOT look like, is AI added on top of an existing observability solution or existing observability data, simply because that is what we attempted and failed with AIOPs. It will have to be a full-stack system in which everything is designed around using AI for the explicit goal of performing root cause analysis.

What would this look like in practice?

Full-stack AI-native architecture with data collection, processing, storage, and visualization

An Inferencing solution would have to be vertically integrated — that is, it would have to collect telemetry data, process the data, and own the data pipeline, and the interaction layer with users.

This is important because it would allow the solution to use user interactions to learn and update its data processing and collection mechanisms.

It would also be critical for the Inferencing solution to directly be able to collect the data it needs, in the format it needs, with the context it needs, from the customer environments — to have control over the effectiveness of the root-causing and the user experience.

Adaptive data collection

An Inferencing solution would have to have intelligence embedded into the data collection itself. That means the agent should be able to look at the data streaming in and decide at the moment whether to capture or discard.

Today, all instrumentation agents are "dumb" — they just take a snapshot of all data and ship it, and processing occurs later. This makes sense today because we primarily use code-agents, and we want the agents to be as lightweight as possible and add minimal overhead.

However, with the emergence of technologies like eBPF that allow us to do out-of-band data collection from the kernel with almost zero overhead, it is possible to imagine an intelligent instrumentation agent that actively solves for the data quality that inferencing models would require.

Data processing tuned to specific use-cases

In Inferencing, all data processing techniques would have to be centered around specific use cases — for instance, How do I reliably error type A? And error type B? And so on.

All data processing models would follow the principle of being able to reliably root-cause some types of errors, and slowly increase the type of errors they would identify as the models evolve. This may lead to different data types, data pipelines, and prompts for each type of error. We may have some inferencing platforms that are better at root-causing infrastructure errors, some at security issues, some at code errors, and so on, with each type collecting slightly different data sets, with slightly different data pipelines and a different model framework.

What they will probably not do is "general anomaly detection" and then work backward to see what might have caused the anomaly. This is because the data required to root cause an error is very different from the data required to "identify/spot" an error.

Storage designed for AI

The view of storage in an Inferencing world already looks different from that in a monitoring or observability world. We now have massive data stores for everything that occurred, but in inferencing what we need is just storage of the most relevant and recent data the AI models need to root cause errors reliably. This could involve several practices — storing data in vector DBs for easy clustering and withdrawal, storing only a small portion of success cases and discarding the rest, deduping errors, and, so on.

As a result, the amount of storage required for Inferencing solutions could actually be less than what traditional monitoring and observability solutions needed.

Simpler, more interactive user interface

An Inferencing interface would probably look less like a bunch of dashboards with data, and more like a conversational++ interface that is tuned for development teams. Maybe it has a simple table of prioritized errors that need human attention — you click on each error, and it gives a list of the most likely root causes with a probability of accuracy. It could then have RLHF (Reinforcement Learning from Human Feedback) to have users confirm whether a root cause was identified correctly, so the models would improve performance with time. The possibilities are broad here.

Summary

In summary, there are likely to be sweeping changes in the monitoring and observability space with the recent developments in Gen. AI.

The first wave of solutions are likely to resemble the AIOPs platforms of yore, with a thin Gen. AI wrapper around existing observability data. Incumbents are best positioned to win this wave. This is likely to be akin to Github Copilot - you get a good suggestion about ~10% of the time. However, the true leap forward is probably a couple of years out, with a new class of Inferencing solutions that can accurately root cause errors 80%+ of the time. To be able to do this, they would need to be full-stack and own the data collection, processing, storage, models, and user interactions. We explored some early hypotheses on how Inferencing solutions would differ from existing practices in each part of the stack.

Monitoring vs Observability in 2023 - An honest take

Samyukktha — Mon, 17 Jul 2023 08:48:40 +0000

If you're running a software system, you need to know what’s happening with it: how it’s performing, whether it’s running as expected, and whether any issues need your attention. And once you spot an issue, you need information so you can troubleshoot.

A plethora of tools promise to help with this, from monitoring, APMs, Observability, and everything in between. This has resulted in something of a turf war in the area of observability, where monitoring vendors claims they also do observability, while “observability-first” players disagree and accuse them of observability-washing.

So let's take an unbiased look at this and answer a few questions:

How are monitoring and observability different, if at all?
How effective is each at solving the underlying problem?
How does AI impact this space now and what comes next?

What is monitoring?

A monitoring solution performs 3 simple actions-

Pre-define some "metrics" in advance
Deploy agents to collect these metrics
Display these metrics in dashboards

Note that a metric here is a simple number that captures a quantifiable characteristic of a system. We can then perform mathematical operations on metrics to get different aggregate views.

Monitoring has existed for the past 40 years — since the rise of computing systems — and was originally how operations teams kept track of how their infrastructure was behaving.

Types of monitoring

Originally, monitoring was most heavily used in infrastructure to keep track of infrastructure behavior - this was infrastructure monitoring. Over time, as applications became more numerous and diverse, we wanted to monitor them as well, leading to the emergence of a category called APM (Application performance monitoring). In a modern distributed system, we have several components we want to monitor — infrastructure, applications, databases, networks, data streams, and so on, and the metrics we want differ depending on the component. For instance:

Infrastructure monitoring: uptime, CPU utilization, memory utilization
Application performance monitoring: throughput, error rate, latency
Database monitoring: number of connections, query performance, cache hit ratios
Network monitoring: roundtrip time, TCP retransmits, connection churn
..and so on

These metrics are measures that are generally agreed upon as relevant for that system, and most monitoring tools come pre-built with agents that know which metric to collect, and what dashboards to display.

As the number of components in distributed systems multiplied, the volume and variety of metrics grew exponentially. To manage this complexity, a separate suite of tools and processes emerged that expanded upon traditional monitoring tools, with time-series databases, SLO systems, and new visualizations.

Distinguishing monitoring

Through all this, the core functioning of a monitoring system remains the same, and a monitoring system can be clearly distinguished if:

It captures predefined data
The data being collected is a metric (a number)

The goal of monitoring

The goal of a monitoring tool is to alert us when something unexpected is happening in a system.

This is akin to an annual medical checkup - we measure a bunch of pre-defined values that will give us an overall picture of our body and let us know if any particular sub-system (organ) is behaving unexpectedly.

And just like annual checkups, a monitoring tool may or may not provide any additional information about why something is unexpected. For that, we’ll likely need deeper, more targeted tests and investigation.

An experienced physician might still be able to diagnose a condition based on just the overall test, but that is not what the test is designed for. Same with a monitoring solution.

What is observability?

Unlike monitoring, observability is much harder to define. This is because the goal of observability is fuzzier. It is to "help us understand why something is behaving unexpectedly".

Logs are the original observability tool that we've been using since the 70s. How we worked until the late 2000s was - traditional monitoring systems would alert us when something went wrong, and logs would help us understand why.

However, in the last 15 years, our architectures have gotten significantly more complex. It became near impossible to manually scour logs to figure out what happened. At the same time, our tolerance for downtime decreased dramatically as businesses became more digital, and we could no longer afford to spend hours understanding and fixing issues.

We needed more data than we had, so we could troubleshoot issues faster. This led to the rise of the observability industry, whose purpose was to help us understand more easily why our systems were misbehaving. This started with the addition of a new data type called traces, and we said the 3 pillars of observability were metrics, logs, and traces. Then from there, we kept adding new data types to " improve our observability".

The problem with observability

The fundamental problem with observability is - we don't know what information we might need beforehand. The data we need depends on the issue. The nature of production errors is that they are unexpected and long-tail: if they could've been foreseen, they’d have been fixed already.

This is what makes observability fuzzy: there’s no clear scope around what and how much to capture. So observability became "any data that could potentially help us understand what is happening".

Today, the best way to describe observability as it is implemented is - "Everything outside of metrics, plus metrics."

A perfectly observable system would record everything that happens in production, with no data gaps. Thankfully, that is impractical and prohibitively expensive, and 99% of the data would be irrelevant anyway, so an average observability platform needs to make complex choices on what and how much telemetry data to capture. Different vendors view this differently, and depending on who you ask, observability seems slightly different.

Commonly cited descriptions of observability are unhelpful

Common articulations of observability, like "observability is being able to observe internal states of a system through its external outputs", are vague and give us neither a clear indication of what it is, nor guide us in deciding whether we have sufficient observability for our needs.

In addition, most of the commonly cited markers that purport to distinguish observability from monitoring are also vague, if not outright misleading. Let’s look at a few examples:

1. "Monitoring is predefined data; observability is not"

In reality, nearly everything we capture in an observability solution today is also predetermined. We define in advance what logs we want to capture, what distributed traces we want to capture (including sampling mechanisms), what context to attach to each distributed trace, and when to capture a stack trace.

We're yet to enter the era of tools that selectively capture data based on what is actually happening in production.

2. "Monitoring is simple dashboards; observability is more complex analysis and correlation"

This is another promise that’s still unmet in practice.
Most observability platforms today also just have dashboards — just that their dashboards show more data than metrics (for example, strings for logs) or can pull up different charts and views based on user instructions. We don't yet have tools that can do any meaningful correlation context by themselves to help us understand problems faster.

Being able to connect a log and a trace using a unique ID doesn’t qualify as a complex analysis or correlation, even though the effort required for it may be non-trivial.

3. "Monitoring is reactive; observability is proactive"

All observability data we collect is pre-defined and nearly everything we do in production today (including around observability) is reactive. The proactive part was what we did while testing. In production, if something breaks and/or looks unexpected, we respond and investigate.

At best, we use SLO systems which could potentially qualify as proactive. With SLO systems we predefine an acceptable amount of errors (error budgets) and take action before we surpass them. However SLO systems are more tightly coupled with monitoring tools, so this is not a particularly relevant distinction between a monitoring and observability solution.

4. "Monitoring focuses on individual components; observability reveals relationships across components"

This is a distinction created just to make observability synonymous with distributed tracing. Distributed tracing is just one more data type that shows us the relationships across components. Today, distributed tracing needs to be used in conjunction with other data to be really useful (for more on this, read this article that explores the real utility of distributed tracing).

In summary, we have a poorly defined category with no outer boundaries. Then we made up several vague, not very helpful markers to distinguish that category from monitoring, which existed before. This narrative is designed to tell us that there's always some distance to go before we get to "true observability" —and always one more tool to buy. As a result, we’re continuously expanding the scope of what we need within observability.

What is the impact of this?

Ever increasing list of data types for observability

All telemetry data is observability because it helps us "observe" the states of our system. Do logs qualify as observability? Yes, because they help us understand what happened in production. Does distributed tracing qualify? Yes. How about error monitoring systems that capture stack traces for exceptions? Yes. How about live debugging systems? Yes. How about continuous profilers? Yes. How about metrics? Also yes, because they also help us understand the state of our systems.

Ever increasing volume of observability data

How much data to capture is left to the customer to decide, especially outside of monitoring. How much you want to log, how many distributed traces you want to capture, how many events you want to capture and store, at what intervals, for how long — everything is an open question, with limited guidance on how much is "reasonable" and at what point you might be capturing too much. Companies can spend $1M or as much as $65M on observability; it all depends on who builds what business case.

Tool sprawl and spending increase

All of the above has led to the amount spent on observability rising rapidly. Most companies today use five or more observability tools, and monitoring & observability is typically the second-largest infrastructure spend in a company after cloud infrastructure itself, with a market size of ~$17B.

Fear and loss-aversion are underlying drivers for observability expansion

The underlying human driver for the adoption of all these tools is fear -"What if something breaks and I don't have enough data to be able to troubleshoot"? This is every engineering team's worst nightmare. This naturally drives teams to capture more and more telemetry data every year so they feel more secure.

Yet MTTR appears to be increasing globally

One would expect that with the wide adoption of observability and the aggressive capturing and storing of various types of observability data, MTTR would have dropped dramatically globally.

On the contrary, it appears to be increasing, with 73% of companies taking more than an hour to resolve production issues (vs 47% just 2 years ago).

Despite all the investment, we seem to be making incremental progress at best.

Where we are now

So far, we continued to collect more and more telemetry data in the hope that processing and storage costs would keep dropping to support that. But with exploding data volumes, we ran into a new problem outside of cost, which is usability. It was getting impossible for a human to directly look at 10s of dashboards and arrive at conclusions quickly enough. So we created different data views and cuts to make it easier for users to test and validate their hypotheses. But these tools have become too complex for an average engineer to use, and we need specially trained "power users" (akin to data scientists) who are well versed in navigating this pool of data to identify an error.

This is the approach many observability companies are taking today: capture more data, have more analytics, and train power users who are capable of using these tools. But these specialized engineers do not have enough information about all the parts of the system to be able to generate good-enough hypotheses.

Meanwhile, the average engineer continues to rely largely on logs to debug software issues, and we make no meaningful improvement in MTTR. So all of observability seems like a high-effort, high-spend activity that allows us merely to stay in the same place as our architectures rapidly grow in complexity.

So what’s next?

Inferencing - the next stage after Observability?

To truly understand what the next generation would look like, let us start with the underlying goal of all these tools. It is to keep production systems healthy and running as expected and, if anything goes wrong, to allow us to quickly understand why and resolve the issue.

If we start there, we can see that are three distinct levels in how tools can support us:

Level 1: "Tell me when something is off in my system" — monitoring

Level 2: "Tell me why something is off (and how to fix it)" — let's call this inferencing

Level 3: "Fix it yourself and tell me what you did" — auto-remediation

Traditional monitoring tools do Level 1 reasonably well and help us detect issues. We have not yet reached Level 2 where a system can automatically tell us why something is breaking.

So we introduced a set of tools called observability that sit somewhere between Level 1 and Level 2, to "help understand why something is breaking” by giving us more data.

Inferencing — Observability plus AI

Imagine a solution that:

Automatically surfaces just the errors that need immediate developer attention.
Tells the developer exactly what is causing the issue and where the issue is: this pod, this server, this code path, this line of code, for this type of request.
Guides the developer on how to fix it.
Uses the developer's actual actions to improve its recommendations continuously.

There is some early activity in this space, including with companies like ZeroK, but this is an open space as yet and we can expect several new companies to emerge here over the next couple of years.

Avoiding the pitfalls of AIOps

In any conversation around AI + observability, it’s important to remember that this has been attempted before with AIOps, with limited success. To understand why, read -AIOps Is Dead.

It will be important for inferencing solutions to avoid the pitfalls of AIOps. To do that, inferencing solutions would have to be architected ground-up for the AI use-case i.e., data collection, processing, storage, and user interface are all designed ground-up for root-causing issues using AI.

What it will probably NOT look like, is AI added on top of existing observability tools and existing observability data, simply because that is what we attempted and failed with AIOPs.

See here for a more detailed exploration of what AI-based Inferencing solutions will look like.

Conclusion

We explored monitoring and observability and how they differ. We looked at how observability is poorly defined today with loose boundaries, which results in uncontrolled data, tool, and spend sprawl. Meanwhile, the latest progress in AI could resolve some of the issues we have with observability today with a new class of Inferencing solutions based on AI. Watch this space for more on this topic!

Distributed Tracing - Past, Present and Future

Samyukktha — Wed, 14 Jun 2023 17:15:53 +0000

Distributed Tracing is a divisive topic. Once the doyen of every KubeCon, the technology was expected to revolutionize observability.

Fast forward 5 years, the hype has subsided somewhat, there's a lot more talk about the pain, and adoption is moderate. Meanwhile, there continues to be steady activity around expanding and standardizing the technology - Open Telemetry (based on OpenTracing) is the 2nd largest CNCF project after Kubernetes. So what is the deal with Distributed Tracing? Should one implement it right away or wait and watch? In this article, let's explore Distributed Tracing in depth -

What is special about Distributed Tracing and why do we need it?
What are the problems with distributed tracing today?
What are upcoming developments and how do they address existing challenges?

Introduction - How Distributed Tracing Works

For the uninitiated, Distributed Tracing is a technology that allows us to track a single request as it traverses several different components/ microservices of a distributed environment. Each network call made in the request's path is captured and represented as a span.

Why we need distributed tracing

To enable this, distributed tracing tools insert a unique trace context (trace ID) into each request's header and implement mechanisms to ensure that the trace context is propagated throughout the request path.

How a distributed trace represents a request path

Why we need Distributed Tracing in the first place

Distributed tracing is unique because it focuses on a request as the unit for observability. In a monitoring/ metrics platform, a component (e.g., a service, host) is the fundamental unit that is being observed. One can ask these platforms questions about the behavior of this unit as a whole, over time. For example, what is this service's health/ throughput /error rate in a specific timeframe?

With logs, the fundamental unit being observed is an event - e.g., whenever an event occurs during code execution, print some information. These "events" are subjectively defined by developers while writing code. The challenge with logs is that they are all disjointed, with each component printing its own form of log messages in isolation, with no easy way to connect them together to make sense.

In contrast, with distributed tracing what is being observed is a single request as it traverses several components. This allows us to ask questions about the distributed system as a whole and understand what occurred where in a complex, interconnected system.

View across metrics, logs, and distributed tracing

The basic case for distributed tracing lies in the argument that this orientation around requests is the closest to the end user's experience. And as a result, it is also the most intuitive for how we'd like to examine and troubleshoot distributed architectures.

The evolution of Distributed Tracing

Distributed Tracing has risen in importance due to the widespread adoption of distributed software architectures in the last decade.

The modern microservices-based architecture is an evolution from the late 90s internet growth story, when it became common to use request-response systems.

"With the late 90s and explosive growth of the internet, came the huge proliferation of request-response systems, such as two-tier websites, with a web server frontend and a database backend... Requests were a new dimension for reasoning about systems, orthogonal to any one machine or process in aggregate."

- Distributed Tracing in Practice, O'Reilly Media

In these microservices architectures, every single request ends up hitting many (10s or even 100s of microservices), making several network calls in between. Refer below for Uber's microservices architecture, which has 3000+ services.

Uber's microservices architecture from 2018. Source: https://www.uber.com/en-IN/blog/microservice-architecture/

In such complex systems, distributed tracing becomes critical for any form of troubleshooting. As a result, Distributed Tracing was pioneered by large companies that were early adopters using large, complex, distributed environments.

Google's Dapper paper released in 2010 was the beginning of distributed tracing
In the next few years, two more companies open-sourced their own distributed tracing systems (Twitter open-sourced Zipkin in 2012 and Uber open-sourced Jaeger in 2017). Zipkin and Jaeger continue to be among the most popular distributed tracing tools even today
Since 2016, there has been a significant effort to standardize distributed tracing across components through the OpenTracing project. OpenTracing eventually became OpenTelemetry in 2019. OpenTelemetry is widely popular and has thousands of contributors globally
Now Distributed Tracing is widely regarded as the third "pillar" of observability alongside metrics and logs. Most major monitoring and observability players provide distributed tracing tools as part of their products.

State of Distributed Tracing: Theory vs Reality

However, despite the promise, excitement, and community effort, the adoption of distributed tracing today is around ~25%. It is not uncommon to find companies on microservices architectures who are making do with logs and metrics, even though they clearly need distributed tracing.

Distributed Tracing adoption

At the same time, Mean-Time-To-Resolve production errors are going up in the world today. 73% of companies report it takes over an hour to resolve production issues today.

Increasing production MTTRs

Ask any developer what the most painful moments in their life are and they'll talk about time spent debugging a Sev-1 error in production with what seemed like a few hundred people breathing down their neck.

Seems then, that any company that cares about its MTTR (which is nearly every company) should be using distributed tracing, and the adoption should have skyrocketed in this environment. But the actual numbers do not support that - so what gives?

Challenges with Distributed Tracing today

There are several problems with distributed tracing today that companies have to overcome to get value - all of which don't get discussed as widely in the mainstream narrative.

1. Implementation is hard!

To implement distributed tracing in a service today, we need to make a code change and a release. While making code changes is a common-enough ask for observability, the challenge specifically with distributed tracing is this - e*very service or component* needs to be instrumented to get a distributed trace, or the trace breaks.

Each service has to be instrumented with an agent

One cannot just get started with a single service - as one can with monitoring or logging - and realize value. Distributed tracing requires instrumentation across a collective set of services to generate usable traces.

This requires coordination across several teams and service owners to make changes in their services. So it becomes an organizational problem- imagine getting 100s of teams to instrument their services over several months before you can realize value.

This is the biggest challenge with distributed tracing today.

2. Need for complex sampling decisions

Next, the volume of trace data generated by Distributed Tracing can be overwhelming. Imagine hundreds of services each emitting a small amount of trace data for every single request. This is going to be millions of requests per second, and makes distributed tracing expensive both in terms of storage and network bandwidth.

While logging also does the same thing (and emits more data per request, which is then managed by massive log aggregation tools), the difference is that most companies today already have logging. Introducing one more data type which is going to be almost as voluminous as logging is a daunting task and will likely double the spend.

To handle this problem of cost, all distributed tracing systems today use sampling and record only a subset of traces. The common sampling rates in practice today are between 0.1% to 2%. The rationale is that even 1% of samples are sufficient to give a decent aggregate picture of where the performance bottlenecks are.

Most platforms today let customers choose their sampling strategy and make their own cost-visibility trade-offs. However, this decision process adds to the already complex overhead of instrumenting and managing a distributed tracing system.

3. But sampling meaningfully diminishes the value

Let's assume a company goes through the effort of instrumenting every service/ component and then deciding the sampling strategy to ensure they don't break the bank.

What now - should we expect MTTR to drop dramatically? No, because developers can't use distributed tracing to actually troubleshoot issues, because of sampling.

Imagine a developer's experience - "I can't find the issue I know is there. I generated the error, but I cannot find the corresponding trace".

So what happens? Developers stop trusting the quality of distributed tracing data and revert to their regular methods for debugging/ troubleshooting (i.e., using logs)

4. Developer usage is low frequency

Given these constraints, today Distributed Tracing is primarily sold as a way to troubleshoot performance problems.

Remember that a basic distributed trace really just tells us who called who and how long each span took. Distributed traces don't tell us what happened within the service that caused the error/ high latency. For that, developers still have to look at the log message and/ or reproduce the issue locally to debug.

In a typical company, performance issues are likely <10% of the total. So in reality, distributed tracing is only useful for this small segment of issues.

The average developer who ships and owns a service is using a distributed tracing tool maybe 2-3 times a year.

Impact of all these challenges

In summary -

Distributed tracing is hard to implement
Distributed Tracing needs extensive sampling to control costs
But sampling reduces the value considerably
As a result, developers only use tracing for the odd one-off performance use case

All this makes the RoI case for distributed tracing quite fuzzy.

In a typical hype cycle, what we can say is that we are now past the stage of inflated expectations and disillusionment is beginning to settle in.

Hype cycle - Distributed Tracing

If we think in terms of end-state though, if the future of computing systems is distributed, then distributed tracing is naturally the most fundamental vector for observability. In that world, any company with a distributed architecture uses tracing as the primary mechanism for troubleshooting anything occurring in production - true "observability" - vs the passive monitoring of systems we have today.

Before we can get to that end-state though, we will need several improvements over the status quo. The good news is that much of this is already underway. Let's look at each of them. So what can we expect to see in the future?

Future of distributed tracing

Instant instrumentation with no code changes

Imagine dropping in an agent and being able to cover an entire distributed system (all services, components) in one go without code changes.

This looks realistically possible in the next 2-3 years.

OpenTelemetry's auto-instrumentation libraries already enable this for some programming languages (however fall short in compiled languages such as Go). In parallel, technologies like eBPF are evolving to enable system-wide instrumentation with no code change. Between the two, we can safely expect the instrumentation problem to be solved in a few years.

Sampling gives way to AI-based selection of requests-of-interest

In an LLM world, random sampling begins to look like a relic from the dark ages. Ideally, we should be able to look at 100% of traces, identify anything that looks anomalous, and store that for future examination. No more random sampling.

If we think about it, we don't really care about the ~95% "happy requests". We only care about the ~5% of anomalous traces - errors, exceptions, high latency, or some form of soft errors. So we just need a way to look at 100% and pick out the interesting 5%.

Traces we care about

There are mechanisms like tail-based sampling that aim to do this today. In tail-based sampling, the system waits until all the spans in a request have been completed, and then based on the full trace, decides whether it has to be retained.

The main challenge with tail-based sampling is that you have to store all the spans of a trace until the whole request is completed and then decide whether to keep/ discard the trace. This means we store every single request, with all the spans, for a certain period (until the request completes) - this requires a separate data architecture with components for load-balancing, storage & processing which is highly complex and expensive.

OpenTelemetry has a tail-based sampling collector, however, it is not yet mature and has several scalability challenges (due to the problem mentioned above). Meanwhile, several companies including ZeroK.ai are working on using AI to make anomaly detection efficient and scalable.

With the fast pace of development in this space, we can reasonably expect this problem to also be solved in the next 3-5 years.

The emergence of "rich" distributed traces that enable all debugging

A true leap into the next generation of tracing will be when tracing evolves from the realm of "performance issues only" to "all issues". That is when the true power of distributed tracing is unleashed.

For this to be possible, each trace needs to have rich context.

Imagine a scenario where each span in each trace has:

Request & response payloads (with PII masking)
Stack traces for any exceptions
Logs
Kubernetes events
Pod states
And anything else that occurred along that span

All in one integrated, seamless flow.

And imagine if the trace one is looking for is super easy to find - there are no sampling-related data gaps, issues are deduped & grouped, and can be filtered across several dimensions.

This then, is all a developer needs to debug any software issue. And potentially, all an AI model needs to diagnose and point a developer to what's going wrong

In this world, the trace becomes the primary axis for observability, replacing logging. That is what the end-state of distributed tracing could look like - while it's not here yet, it is visible from where we are today.

The main barrier to making this possible is the explosion in data volume that storing all this context data will cause. We will require deep innovation in data processing and storage architectures to make this possible. It is still early days yet and we have to wait and see what happens here.

Summary

In summary, Distributed Tracing is a necessary and intuitive view required for being able to observe distributed application architectures in production.

The first generation of distributed tracing, while promising, has been beset by several challenges which have made it difficult for companies to get value from tracing, which has stymied adoption somewhat.

However, several exciting developments are occurring in the space which are expected to make tracing easier, simpler, and more powerful than what we have today, making observability more seamless in the future.

A comparison of eBPF Observability vs Agents and Sidecars

Samyukktha — Fri, 26 May 2023 21:44:15 +0000

The observability landscape is witnessing a radical transformation today. The central driver of this shift is eBPF (extended Berkeley Packet Filter), a technology that is revolutionizing how we observe and monitor systems. In an earlier post, we took a detailed look at the technology of eBPF and its implications for observability.

In this article, we will compare eBPF-based instrumentation with other instrumentation methods like code agents and sidecars and see which best suits the needs of observability today.

Before we dive in, let's briefly revisit eBPF.

eBPF logo

What is eBPF?

eBPF is a framework in the Linux kernel that allows us to safely run sandboxed programs in the kernel space, without changing kernel source code or adding kernel modules.

eBPF programs are highly efficient and secure - they undergo strict verification by the kernel to ensure they don't risk stability or security, and they are run as native machine code in the kernel so are highly performant.

eBPF as a technology has existed for 25+ years and is now becoming mainstream with the emergence of cloud-native architectures.

Companies like LinkedIn, Netflix, Facebook and Adobe have been using eBPF in production for years. If you are using GKE (Google's Kubernetes offering), you're using eBPF already, as GKE's networking, security, and observability are powered by eBPF.

While eBPF has been used widely in networking (high-speed packet processing) and security historically, it is recently gaining traction in observability.

Introduction to eBPF observability

The premise for eBPF in observability is straightforward.

eBPF operates at the kernel level, and as everything eventually goes through the kernel, one can technically observe everything from this vantage point.

So eBPF provides a new way to instrument for telemetry. Other components of a typical observability solution like data processing, storage, and visualization are unaffected by eBPF.

eBPF opens up new visibility into system and application behavior that was previously difficult to achieve, such as low-overhead profiling, system call tracing, and deep visibility into network traffic. As a result, eBPF is pushing the boundaries of what is possible in observability and providing tools to understand, optimize, and troubleshoot systems like never before.

So how does eBPF-based instrumentation compare with existing instrumentation mechanisms? For that, let's look at the commonly used instrumentation techniques today.

Agent-based instrumentation: The Old Guard

Agent-based methods have been the mainstay for system monitoring and observability for years. They work by installing an agent (software SDK/ library) on every node, microservice, or infrastructure component that needs monitoring. These agents collect data and send it back to a central location for analysis.

While this method provides deep visibility, it comes with its challenges. The installation of agents is time-consuming, requires code changes, and the agents can induce a performance overhead. There are also potential security issues and the agents need to be meticulously screened for safety.

Sidecar-based Instrumentation: The Microservices accessory

With the rise of microservices architecture, sidecar-based instrumentation emerged as an alternative mechanism for instrumentation. Sidecars are lightweight, independent processes that run alongside an application or service. In this setup, a sidecar proxy (e.g., envoy) is attached to each service, acting as a shared module for monitoring tasks and can see everything going in and out of a service.

Sidecars are easier to implement compared to agents (no code changes). However, they can be resource-intensive, leading to increased CPU and memory usage. Additionally, updating sidecars often requires service restarts, which may not be ideal for always-on services.

eBPF-based Instrumentation: The new kid on the Block

Enter eBPF, a technology that has been around for a while in the networking and security space but has recently gained traction in observability.

eBPF operates at the kernel level making it possible to observe system, infrastructure and application behavior from one place. eBPF programs can be dropped into the kernel without any code changes or releases. Moreover, they are non-intrusive, causing no impact on running workloads, and are checked for safety before execution, providing an added layer of security.

Challenges with existing instrumentation methods

The traditional code agent-based model is the most widely used - almost all observability solutions in the market today use agents. However, it has become unwieldy in modern distributed environments that have hundreds of services and hosts.

In the agent world, each service/ host is instrumented separately with an agent, requiring a release/ restart. And for every maintenance update, this process is repeated for every component - dramatically increasing human overhead.

The Sidecar proxy-based model was once expected to emerge as a viable alternative to agents due to the ease of implementation (no code changes). However, they still require restarts and add performance overhead, dampening the enthusiasm around sidecars, and the sidecar wave seems to be crashing before even taking off.

Meanwhile, eBPF is emerging as the more performant, secure, scalable, and easy-to-implement alternative to sidecars in distributed environments (Is eBPF the end of Kubernetes sidecars?)

Let us evaluate these 3 types of instrumentation a bit more closely

eBPF vs agents vs Sidecars - a comprehensive assessment

First, what are the considerations around which we want to evaluate? What factors do we care about? Let's start with the following -

Data visibility/ granularity - what types of observability data can be obtained?
Intrusiveness - Is data collection inline of a running workload being or out-of-band?
Performance overhead - what are the additional resource requirements and impact on running workloads?
Safety and security - what are the guardrails around security, given these are added to production workloads?
Ease of implementation - how easy is it to get started?
Ease of maintenance and updates - what do maintenance and updates involve?
Scalability - In high-scale systems, how does the instrumentation method perform in all of the parameters above?

Now that we have the criteria, let us compare the 3 instrumentation methods against each of the above. See below

	eBPF	Agents	Sidecars
1. Data Visibility/Granuality	High (but some gaps)	High	Low
2. Intrusiveness	Low (out of band)	High (inline)	High (inline)
3. Performance overhead	Low	Medium	High
4. Safety and Security	High	Medium	Medium
5. Ease of implementation	High	Low	Medium
6. Ease of maintenance and updates	High	Low	Medium
7. Scalability	High	Medium	Low

eBPF vs Code agents vs Sidecars: Comparison

Based on our assessment, eBPF appears to outperform other instrumentation methods across nearly all parameters. Let us dive deeper and see how.

Data visibility/ granularity

The most important question with an instrumentation method is - what type of observability data can I get through this? How deep and wide can I go?

eBPF: Unparalleled visibility into infrastructure and network behavior. Can also provide visibility into application behavior. Custom metrics can be implemented through custom probes without restarts or code changes. The unique vantage point of the kernel opens up additional use cases like low overhead continuous profiling and live debugging. While the coverage is broad, the gap today is primarily in distributed tracing (possible, but still not mature) and seeing rapid innovation by the community.

Agent: Agents provide comprehensive coverage of application behavior. Written in the same language as the application they monitor, agents provide deep visibility into application code execution and metrics. Agents written separately for infrastructure components provide visibility into infra. metrics. However agents fall short in network observability (e.g., visibility into network events and network packets), and system visibility (e.g., visibility into system calls or kernel data structures).

Sidecar proxy: Sidecar proxies can provide basic application metrics like latency, error rates, and throughput. They can even implement distributed tracing and access logs. However, they do not provide infrastructure visibility. They also do not have code-level visibility.

See below for a summary of how the 3 mechanisms fare in providing visibility into common observability data types.

Observability data types available

Intrusiveness

eBPF: Least intrusive. Out-of-band data collection - no impact on applications being executed. Snapshots are taken from the side as data passes through the kernel, from an isolated sandbox.

Agent: Most intrusive - agents are inline of the monitored components and are executed every time the workload is executed.

Sidecar proxy: Also inline of execution path, although somewhat less than agents as they don't sit within the code.

Performance Overhead

eBPF: eBPF really shines here and has the lowest performance overhead. This is because eBPF programs run as native machine code on the kernel and there is no context switching between user space and kernel space, driving near-zero overhead.

Agent: High overhead (can range from 10-100%+) and varies widely across individual agent implementations. Operate in user space so high context-switching, plus inline so executed each time the code is executed.

Sidecar proxy: Fare poorest here. They are also inline of traffic and operate in user space like agents. But in addition, they add performance overhead due to the extra network hop they introduce and the resources they consume (see figure below)

High performance overhead: Additional network hop introduced by sidecars

Security

eBPF: Most secure - sandboxed execution environment, which restricts eBPF programs' access to a limited set of kernel functions and resources. In addition, they undergo verification by the kernel through the eBPF verifier to ensure safety.

Agent: Code injection increases the potential attack surface, as vulnerabilities in the agent code could be exploited by malicious actors. Requires careful vetting (and management) of agent code to ensure security.

Sidecar proxy: Sidecar proxies introduce additional network components, increasing the potential attack surface. However, isolation between the proxy and the application can help mitigate some security risks.

Ease of Installation, maintenance, and updates

eBPF: Easiest, as eBPF agents can be dropped in directly - one agent in the kernel for all applications, infrastructure - everything in one go, without any restarts. However, the installation requires privileged access.

Agent: Hardest. Agents must be installed in every service and component, which requires code changes, releases, and restarts for each service and component. It's not just the installation but the management, maintenance, and debugging of all these disparate pieces that add to the significant overhead. However, emerging auto-instrumentation libraries are looking to reduce the effort.

Code agents vs eBPF-based instrumentation

Sidecar proxy: Sidecar proxies can be installed and updated with no code changes so are easier than agents. However they still require restarts, and the pod injection process can be problematic, slowing pod start-up times or causing race conditions or instabilities.

Scalability

eBPF: Highly scalable, as eBPF programs can efficiently gather data from multiple sources without requiring additional agents or tools.

Agent: Scalability is challenging as each application component and infra host requires its own monitoring agent. This increases resource consumption and management complexity in large-scale environments.

Sidecar proxy: Low scalability, as they consume additional resources for each application component. The increased resource consumption and network overhead can become significant in large-scale environments, especially when managing multiple sidecar instances.

Summary Pros and Cons

In summary, eBPF-based instrumentation is significantly better than current instrumentation mechanisms -

Broad coverage - across applications and infrastructure in one go
Non-intrusive - out-of-band data collection has no impact on running workloads
Easy to implement and maintain
Highly performant
More secure
More scalable

That said, there are some limitations as of today -

Distributed tracing with eBPF is not as mature
Restricted to Linux environments (windows implementation is not yet mature)

If you're in a modern cloud-native environment (Kubernetes, microservices), that's when the difference between eBPF and the agent-based approach is most visible (performance overhead, scalability, security, ease of installation & maintenance, etc). This is likely one of the reasons why eBPF adoption in the last 5-6 years has been highest in scale technology companies with massive footprints.

So what are the implications for observability?

Given the significant advantages eBPF-based instrumentation offers, the next generation of observability solutions are all likely to be built with eBPF.

There are already several emerging eBPF-native observability solutions, including ZeroK. Meanwhile, traditional observability players like NewRelic and Datadog are also investing in updating their instrumentation.

Over time, as eBPF becomes the default instrumentation mechanism that everyone uses, we can expect innovation in this space to shift to higher levels of the observability value chain, like data processing, advanced analytics and AI.

Decoding eBPF Observability: How eBPF transforms Observability as we know it 🕵️🐝

Samyukktha — Mon, 22 May 2023 12:01:58 +0000

There has been a lot of chatter about eBPF in cloud-native communities over the last 2 years. eBPF was a mainstay at KubeCon, eBPF days and eBPF summits are rapidly growing in popularity, companies like Google and Netflix have been using eBPF for years, and new use cases are emerging all the time. Especially in observability, eBPF is expected to be a game changer.

So let's look at eBPF - what is the technology, how is it impacting observability, how does it compare with existing observability practices, and what might the future hold?

What is eBPF really?

eBPF is a programming framework that allows us to safely run sandboxed programs in the Linux kernel without changing kernel code.

It was originally developed for Linux (and it is still where the technology is most mature today), but Microsoft is rapidly evolving the eBPF implementation for Windows.

eBPF programs are by design highly efficient and secure - they are verified by the kernel to ensure they don't risk the operating system's stability or security.

So why is eBPF a big deal?

To understand this, we need to understand User space and Kernel space.

User space is where all applications run. Kernel space sits between user space and the physical hardware. Applications in user space can't access hardware directly. Instead, they make system calls to the kernel, which then accesses the hardware.

All memory access, file read/writes, and network traffic go through the kernel. The kernel also manages concurrent processes.

Basically, everything goes through the kernel (see Figure below).

And eBPF provides a safe, secure way to extend kernel functionality.

User space and Kernel space

Historically, for obvious reasons, changing anything in the kernel source code or operating systems layer has been super hard.

The Linux kernel has 30M lines of code, and it takes several years for any change to go from an idea to being available widely. First, the Linux community has to agree to it. Then, it has to become part of the official Linux release. Then, after a few months, it is picked up by distributions like Red Hat and Ubuntu, which take it to a wider audience.

Technically, one could load kernel modules to one's kernel and make changes directly, but this is very high risk and involves complex kernel-level programming, so is almost universally avoided.

eBPF comes along and solves this - and gives a secure and efficient mechanism to attach and run programs in the kernel.

Let's look at how eBPF ensures both security and performance.

Highly secure

Stringent verification - Before any eBPF program can be loaded into a kernel, it is verified by the eBPF verifier, which ensures the code is absolutely safe - e.g., no hard loops, invalid memory access, unsafe operations.
Sandboxed - eBPF programs are run in a memory-isolated sandbox within the kernel, separate from other kernel components. This prevents unauthorized access to kernel memory, data structures, and kernel source code.
Limited operations - eBPF programs typically have to be written in a small subset of the C language - a restricted instruction set. This limits the operations that eBPF programs can perform, reducing the risk of security vulnerabilities.

High-performance / lightweight

Run as native machine code - eBPF programs are run as native machine instructions on the CPU. This leads to faster execution and better performance.
No context switches - A regular application regularly context-switches between user-space and kernel-space, which is resource intensive. eBPF programs, as they run in the kernel layer, can directly access kernel data structures and resources.
Event-driven - eBPF programs typically run only in response to specific kernel events vs being always-on. This minimizes overhead.
Optimized for hardware - eBPF programs are compiled into machine code by the kernel's JIT (Just-In-Time) compiler just before execution, so the code is optimized for the specific hardware it runs on.

So eBPF provides a safe and efficient hook into the kernel for programming. And given everything goes through the kernel, this opens up several new possibilities that weren't possible until now.

Why is this a big deal only now?

The technology around eBPF has evolved over a long time and has been ~30 years in the making.

In the last 7-8 years, eBPF has been used at scale by several large companies and now we're entering an era where the use of eBPF is becoming mainstream. See this video by Alexei Starovoitov, the co-creator of Linux and co-maintainer of eBPF, on the evolution of eBPF.

eBPF - a brief history

1993- A paper from Lawrence Berkeley National Lab explores using a kernel agent for packet filtering. This is where the name BPF (“Berkeley Packet Filter") comes from.
1997 - BPF is officially introduced as part of the Linux kernel (version 2.1.75).
1997-2014 - Several features are added to improve, stabilize and expand BPF capabilities.
2014 - A significant update is introduced, called "extended Berkeley packet Filter" (eBPF). This version makes big changes to BPF technology & makes it more widely usable - hence the word "extended"

Why this release was big, was that this made extending kernel functionality easy.

A programmer could code more or less like they would a regular application - and the surrounding eBPF infrastructure takes care of the low-level verification, security, and efficiency.

An entire supporting ecosystem and scaffolding around eBPF makes this possible (see figure below).

Source: https://ebpf.io/what-is-ebpf/

Even better, eBPF programs could be loaded and unloaded from the kernel without any restarts.

All this suddenly allowed for widespread adoption and application.

Widespread adoption in production systems

eBPF's popularity has exploded in the last 7-8 years, with several large companies using it in scale production systems.

By 2016, Netflix was using eBPF widely for tracing. Brendan Gregg, who implemented it, became widely known in infrastructure & operations circles as an authority on eBPF.
2017 - Facebook open-sourced Katran, their eBPF-based load balancer. Every single packet to Facebook.com since 2017 has passed through eBPF.
2020- Google made eBPF part of its Kubernetes offering. eBPF now powers the networking, security, and observability layer of GKE. By now there's also broad enterprise adoption in companies like Capital One and Adobe.
2021 - Facebook, Google, Netflix, Microsoft & Isovalent came together to announce the eBPF foundation to manage the growth of eBPF technology.

Now there are thousands of companies using eBPF and hundreds of eBPF projects coming up each year exploring different use cases.

eBPF is now a separate subsystem within the Linux kernel with a wide community to support it. The technology itself has expanded considerably with several new additions.

So what can we do with eBPF?

The most common use cases for eBPF are in 3 areas -

Networking
Security
Observability

Security and networking have seen wider adoption and application, fuelled by projects like Cilum. In comparison, eBPF-based observability offerings are earlier in their evolution and just getting started.

Let's look at the use cases in security and networking first.

Security

Security is a highly popular use case for eBPF. Using eBPF, programs can observe everything happening at the kernel level, process events at a high speed to check for unexpected behavior, and raise alerts much more rapidly than otherwise.

For example -

Google uses eBPF for intrusion detection at scale
Shopify uses eBPF to implement container security

Several third-party security offerings now use eBPF for data gathering and monitoring.

Networking

Networking is another widely applied use case. Being at the eBPF layer allows for comprehensive network observability, like visibility into the full network path including all hops, along with source and destination IP. With eBPF programs, one can process high-volume network events and manipulate network packets directly within the kernel with very low overhead.

This allows for various networking use cases like load balancing, DDoS prevention, Traffic shaping, and Quality of Service (QoS).

Cloudflare uses eBPF to detect and prevent DDoS attacks, processing 10M packets per second without impacting network performance.
Meta's eBPF-based Katran does load-balancing for all of Facebook

Observability

By now it must be straightforward how eBPF can be useful in Observability.

Everything passes through the kernel. And eBPF provides a highly performant and secure way to observe everything from the kernel.

Let us dive deeper into observability and look at the implications of this technology.

How exactly does eBPF impact Observability?

To explore this, let's step out of the eBPF universe and into the Observability universe and look at what makes up our standard observability solution.

Any observability solution has 4 major components -

Data collection - Getting telemetry data from applications and infrastructure
Data processing - Filtering, indexing, and performing computations on the collected data
Data storage - Short-term and long-term storage of data
User experience layer - Determining how data is consumed by the user

Of this, what eBPF impacts (as of today), is really just the data collection layer - the easy gathering of telemetry data directly from the kernel using eBPF.

eBPF - Impact on observability

So what we mean when we say "eBPF observability" today, is using eBPF as the instrumentation mechanism to gather telemetry data, instead of using other methods of instrumenting. Other components of an observability solution remain unaffected.

How eBPF Observability works

To fully understand the underlying mechanisms behind eBPF observability, we need to understand the concept of hooks.

As we saw earlier, eBPF programs are primarily event-driven - i.e., they are triggered any time a specific event occurs. For example, every time a function call is made, an eBPF program can be called to capture some data for observability purposes.

First, these hooks can be in kernel space or user space. So eBPF can be used to monitor both user space applications as well as kernel-level events.

Second, these hooks can either be pre-determined/ static or inserted dynamically into a running system (without restarts!)

Four distinct eBPF mechanisms allow for each of these (see figure below)

	Predetermined/Manual	Dynamic
Kernel	Kernel tracepoints	kprobes
Userspace	USDT	uprobes

Static and dynamic eBPF hooks into user space and kernel space

Kernel tracepoints - used to hook into events pre-defined by kernel developers (with TRACE_EVENT macros)
USDT - used to hook into predefined tracepoints set by developers in application code
Kprobes (Kernel Probes) - used to dynamically hook into any part of the kernel code at runtime
Uprobes (User Probes) - used to dynamically hook into any part of a user-space application at runtime

There are several pre-defined hooks in the kernel space that one can easily attach an eBPF program to (e.g., system calls, function entry/ exit, network events, kernel tracepoints). Similarly in the user space, many language runtimes, database systems, and software stacks expose predefined hooks for Linux BCC tools that eBPF programs can hook into.

But what's more interesting is kprobes and uprobes. What if something is breaking in production and I do not have sufficient information and I want to dynamically add instrumentation at runtime? That is where kprobes and uprobes allow for powerful observability.

eBPF kprobes and uprobes

For example, using uprobes, one can hook into a specific function within an application without modifying the application's code, at runtime. Whenever the function is executed, an eBPF program can be triggered to capture required data. This allows for exciting possibilities like live debugging.

Now that we know how observability with eBPF works, let's look at use cases.

eBPF Observability use cases

eBPF can be used for almost all common existing observability use-cases, and in addition opens up new possibilities.

System and Infrastructure Monitoring: eBPF allows for deep monitoring of system-level events such as CPU usage, memory allocation, disk I/O, and network traffic. For example, LinkedIn uses eBPF for all their infra monitoring.
Container and Kubernetes Monitoring: Visibility into Kubernetes-specific metrics, resource usage, and health of individual containers and pods.
Application Performance Monitoring (APM): Fine-grained observability into user-space applications and visibility into application throughput, error rates, latency, and traces.
Custom Observability: Visibility into custom metrics specific to applications or infra that may not be easily available without writing custom code.
Advanced Observability: eBPF can be used for advanced observability use cases such as live debugging, low-overhead application profiling, and system call tracing.

There are new applications of eBPF in Observability emerging every day.

What does this mean for how observability is done today? Is eBPF likely to replace existing forms of instrumentation? Let's compare with existing options.

eBPF vs existing instrumentation methods

Today, there are two main ways to instrument applications and infrastructure for Observability, apart from eBPF.

Agent-based instrumentation: Independent software SDKs/ libraries integrated into application code or infrastructure nodes to collect telemetry data.
Sidecar proxy-based instrumentation: Sidecars are lightweight, independent processes that run alongside an application or service. They are popular in microservices and container-based architectures such as Kubernetes.

For a detailed comparison of how eBPF-based instrumentation compares against agents and sidecars, see here. Below is a summary view -

	eBPF	Agents	Sidecars
1. Data Visibility/Granuality	High (but some gaps)	High	Low
2. Intrusiveness	Low (out-of-band)	High (inline)	High (inline)
3. Performance overhead	Low	Medium	High
4. Safety and Security	High	Medium	Medium
5. Ease of implementation	High	Low	Medium
6. Ease of maintenance and updates	High	Low	Medium
7. Scalability	High	Medium	Low

eBPF vs agents vs sidecars: Comparison

As we can see, eBPF outperforms existing instrumentation methods across nearly all parameters. There are several benefits -

Can cover everything in one go (infrastructure, applications)
Less intrusive - eBPF is not inline of running workloads like code agents, which run everytime the workload runs. Data collection is out-of-band and sandboxed, so there is no impact on a running system.
Low performance overhead - eBPF runs as native machine code and there is no context switching.
More secure - due to in-built security measures like verification.
Easy to install - can be dropped in without any code change or restarts.
Easy to maintain and update - again no code change & restarts.
More scalable - driven by easy implementation & maintenance, and low performance overhead

In terms of cons, the primary gap with eBPF observability today is in distributed tracing (feasible, but the use case is still in early stages).

In balance, given the significant advantages eBPF offers over existing instrumentation methods, we can reasonably expect that eBPF will emerge as the default next-generation instrumentation platform.

Implications for observability

What does this mean for the observability industry? What changes?

Imagine an observability solution:

that you can drop into the kernel in 5 minutes
no code change or restarts
covers everything in one go - infrastructure, applications, everything
has near-zero overhead
is highly secure

That is what eBPF makes possible. And that is the reason why there is so much excitement around the technology.

We can expect the next generation of observability solutions to all be instrumented with eBPF instead of code agents.

Traditional players like Datadog and NewRelic are already investing in building eBPF-based instrumentation to augment their code-based agent portfolio. Meanwhile there are several next-generation vendors built on eBPF, solving both niche use-cases and for complex observability.

While traditional players had to build individual code agents language by language and for each infrastructure component over several years, the new players can get to the same degree of coverage in a few months with eBPF. This allows them to also focus on innovating higher up the value chain like data processing, user experience, and even AI. In addition, their data processing and user experience layers are also built ground-up to support the new use cases, volumes and frequency.

All this should drive a large amount of innovation in this space and make observability more seamless, secure and easy to implement over the coming years.

Who should use eBPF observability?

First, if you're in a modern cloud-native environment (Kubernetes, microservices), then the differences between eBPF-based and agent-based approaches are most visible (performance overhead, security, ease of installation etc).

Second, if you are operating at a large scale, then eBPF-based lightweight agents will drive dramatic improvements over status-quo. This is likely one of the reasons why eBPF adoption has been highest in technology companies with massive footprints like LinkedIn, Netflix, and Meta.

Third, if you're short on tech. capacity and are looking for an observability solution that requires almost no effort to install and maintain, then go straight for an eBPF-based solution.

Summary

In summary, by offering a significantly better instrumentation mechanism, eBPF has the potential to fundamentally reshape our approach to observability in the years ahead.

While in this article we primarily explored eBPF's application in data collection/ instrumentation, future applications could see eBPF used in data processing or even data storage layers. The possibilities are broad and as yet unexplored.