Forem: Shahar Azulay

Navigate your way to production bliss with Caretta

Shahar Azulay — Mon, 09 Jan 2023 10:38:10 +0000

Get to know Caretta - a lightweight, standalone tool that instantly creates a visual network map of the services running in your cluster. Caretta leverages eBPF to efficiently map service network interactions in a K8s cluster, and Grafana to query & visualize the collected data.

Originally posted on groundcover's blog by Udi Rot, Founding Engineer

Scientists have long known that the turtles, like many animals, navigate at sea by sensing the invisible lines of the magnetic field, similar to how sailors use latitude and longitude.
But the common sea turtle does much more than that. Turtles also rely on the Earth's magnetic field to find their way home, using the unique magnetic signature of their birth coastline as an internal compass.

Turtles have effectively tamed the wilderness of the open ocean. What can be seen to some animals as an infinite unknown is mapped to the finest of details inside the head of the sea turtle.

From the sea to the cloud, it’s all too easy to get lost in a typical Kubernetes cluster. Gaining a decent understanding of the inter-dependencies between the different workloads running in the cluster is a complicated task, leaving teams to work hard for impactful, actionable insights such as identifying central points of failure or pinpointing security anomalies.

One approach to tackle this issue is visualization: In many ways, a K8s cluster can be seen as a geographic area, with paths and trails forged by communications between different workloads. And just as a map helps familiarize you with your neighborhood and how to navigate around it, it can help you “get around” your K8s cluster.

This is part of the mission of cloud-native observability tools - no APM product is complete without network tracing capabilities of some kind, and data from these traces can help one answer those aforementioned questions. But continuing the approach from our previous post on Murre, what if I only want a minimalistic, efficient solution?

So, what is the easiest way we can map our cluster?

Introducing Caretta - a standalone OSS tool that does just that. Let’s dive into how it works, how it leverages eBPF technology to be lightweight and frictionless, and the obstacles and challenges we encountered on our journey to building it. The end result could be digested directly as raw Prometheus metrics, or you can integrate it into your Grafana with some of our pre-made panels:

Exploring the land

First thing we’ll need to figure out is how we get the network data. The naive approach would be using a sniffing tool like tcpdump to gain network observability. But that can be overkill - we don’t need to actually capture the network traffic, we just want to know it exists. As suggested above, we can use eBPF to probe just the data we need.

eBPF who?

If you haven’t already heard of eBPF I can tell you that people like Brendan Gregg call it an "invaluable technology" and compare it to JavaScript:

So instead of a static HTML website, JavaScript lets you define mini programs that run on events like mouse clicks, which are run in a safe virtual machine in the browser. And with eBPF, instead of a fixed kernel, you can now write mini programs that run on events like disk I/O, which are run in a safe virtual machine in the kernel.

eBPF was introduced in early 2014, expanding on BPF's original architecture by providing tools that allow complex programs to run directly in the Linux kernel space.

OK, now you’re probably asking yourself what running the kernel space even means. Basically, it's code running in higher privileges, as opposed to running in "user space" like standard applications. It allows the code to run very efficiently, access low-level kernel resources that would otherwise be complicated and costly (in terms of resource overhead) to access from within user space, but most importantly: it lets you observe any and all programs running in user space – which is hard to do when relying on observability tools that operate in user space themselves.

This is a big thing. It’s basically a new way of equipping the Linux kernel with a programmable, highly efficient virtual machine, allowing programmers access into what was before the sole realm of kernel-developers.

Observability is where eBPF shines. It allows teams to monitor their running applications with a completely out-of-band approach that requires zero code changes or R&D efforts. eBPF enables powerful advantages for observability applications by providing a faster, less resource intensive and holistic approach to gather high precision data.

eBPF offers many ways to capture data from a running system. The kernel allows developers who use eBPF to attach their programs to various types of probes- places in the code of the kernel or applications that when reached, will run the programs attached to them before or after executing their original code. This diagram shows some of the probes available for eBPF:

Image from ebpf.io/what-is-ebpf

Starting with the basics

Working on this project, I was inspired by tcplife - a nifty tool to calculate statistics and information of TCP lifespans, using only a single eBPF probe. The thing is, the kernel already does half of the job for us as it maintains information, such as throughput, for each socket. By probing the tcp_set_state kernel function, tcplife is aware of the active network connection and is accessible to data the kernel maintains for them. Beside having minimal footprint, another advantage of probing specific kernel functions is covering all possible data flows - compared to probing a bunch of possible syscalls the application might use, such as read() or recv(). When application-level context is unnecessary, “sitting” close to the root of the tree lets you cover its branches more easily.

Sounds great - but we’d still be missing something. tcp_set_state is called when, obviously, the state of the TCP connection is being set. For example, when a server starts listening, or when a server-client connection is established, or when a connection is closed. But TCP connections can go long without changing their state, and relying on tcp_set_state will keep us blind to them.

We set out to look for an additional probe that can help us complete the picture. Even then, we find eBPF useful to explore the linux TCP stack. Tools like stacksnoop or stackcount can be used to understand the flow a network packet is going through when it’s processed and compare different functions to see how “noisy” each function is. Searching for data probing locations consists of a constant trade-off between being too nosy and being blind, and we’re looking for the sweet spot in the middle.

In our case, we found the tcp_data_queue function suitable for our needs:

This function is called as one of the final steps of tcp_rcv_established, the function used when receiving packets in the ESTABLISHED state:

So it covers the blindspot of tcp_set_state, and is called only if all previous steps were successful.

Another advantage to using this function is that like tcp_set_state, and most of the kernel TCP functions, its first argument is a struct sock object. This is a big, important kernel struct and we’ll soon dive into the advantages and disadvantages of relying on it to retrieve data, but for now we’ll say it’s convenient that both our probes share a consistent approach to the data.
To recap our collection mechanism - we’ll probe tcp_data_queue to observe the network sockets updating its stats in their established state, and tcp_set_state to track their lifecycle.

Understanding the data

As mentioned, our “supply” of data comes from internal kernel network stack objects of a struct called sock. Socks actually wrap other data structures for TCP and INET, so having them means having the data for lower levels of the connection as well. Information like the connection’s 4-tuple (source IP, source port, destination IP and destination port) can be easily retrieved from the sock, especially using the CO-RE approach.

Some issues arise from using kernel internal structures. For example, instead of allocating a new sock each time, the kernel may reuse closed structs. Apparently, socks reusing might occur frequently, and if we do not keep track of socks ending and starting it can lead to collisions and disinformation.

Another difficult issue for extracting the data out of the sock was understanding the role of our host. The sock contains source IP and destination IP, but has no information about who’s the client or the server in this connection. For new socks, we can see (in tcp_set_state) if it goes through TCP_SYN_RECV or TCP_SYN_SENT to understand the side of the equation it’s on; But if the connection was already established, we can’t see the previous states it’s gone through.

The solution we found was examining a field in the sock struct called sk_max_ack_backlog. It holds the maximal length of the accept queue, and is assigned as part of the listen() function of the connection. Which means, that if it is assigned a positive number, the sock went through a listen state, and therefore is a server - and vice versa. That information gives us the direction of the link and the relevant port to store.

Aggregation

So we figured out how to deal with those sock structures. But we’re not really interested in them - we’re interested in it at a higher level. If a client queries a server hundreds of times one after another, we’d like to address it as a single edge in the graph. So the next thing we’re going to do is aggregating what we see in the kernel probes to start counting Links (tuples of client, server and server-port) and their Throughputs. The eBPF program will maintain a map of the sockets it observes, and a userland program accessible to this map will poll over it and generalize each connection to a link and aggregate the throughputs. We’ll run this program on each node in our cluster, and every instance of it will publish the results as metrics to be scraped by a centralized pod.

Let’s talk about the structure of our data. We basically want to have a counter for each link, which is populated by aggregating smaller counters - the ones maintained by the kernel. Sockets can come and go, so we need to keep track of the total throughput of past, closed sockets as well. Eventually, our metrics will expose the total throughput of each link observed since launching the program.

That information, scraped and consolidated by a prometheus agent, can be easily analyzed with standard queries such as sorting, calculating rate, filtering namespaces and time ranges, and of course - visualizing as a network map.

Below is an example of a timeseries published by a Caretta’s agent:

As you can see, the identifiers are resolved from an IP address to the controllers owning the workloads communicating. For example, in this link we can see checkoutservice sending 2537 bytes to a service named productcatalogservice. Note that some of the labels are generated solely to comply with the format Grafana expects for displaying a Node Graph.

Speaking in Kubernetes terms

One thing left before publishing the metrics is resolving the source and destination of each link to the language we want to speak - our cluster’s language, with our k8s-defined names. That might sound trivial - the information should be easily retrievable using the K8s API. But again, some obstacles arise.

The first and simple one was API requests overhead. Querying the API for each object we encounter, especially in the beginning of the run of the program, leads to a big amount of REST messages which simply take too much time. The solution was using a bunch of big queries in the beginning of each polling iteration and storing a snapshot of the cluster’s state. With the snapshot, we can build a mapping of the IPs used by objects in the cluster, and for each object get its ancestors in the hierarchy to consolidate the nodes of the graph to meaningful items like DaemonSets and Deployments.

On the other hand, some cases are very hard to resolve. For example, Caretta doesn’t support resolution of Pods set to “host” network, as they use the node’s IP address directly themselves. Keeping track of bound ports and their processes can be a solution for that in the future, but it will make the program more complicated. Same problem occurs with traffic originated in NAT’d services arriving in other pods in the cluster, which makes it impossible to associate it with a workload given the source IP alone.

We can also add reverse-DNS lookup to try and resolve external IP addresses.

Putting it all together

With resolved and aggregated metrics, what’s left for us is running an agent pod as a daemonset in our cluster along with a single metrics scraper. To visualize a map, we will use Grafana’s Node Graph feature - a fairly new panel in beta version. Of course, there are countless libraries to create an image of a map out of a description of relations, but we wanted to use a tool which is very much likely already installed on your cluster.

Caretta is equipped with a Grafana dashboard consisting of a Node Graph panel and some PromQL witchcraft queries to adjust the data to Node Graph’s specifications. After setting Caretta as a Prometheus data source in Grafana, the dashboard will display the map of your cluster.

But actually, instead of trying to explain the architecture of Caretta with words, let’s use… Caretta itself! Here’s what an instance of Caretta shows when filtered on its own namespace:

As you can see, we have a Grafana instance querying a VictoriaMetrics (caretta-vm) agent (and displaying this map on its web UI); The Victoria agent scrapes metrics from caretta daemonset; And both the Victoria agent and Caretta make use of the kubernetes API exposed by the Kubernetes service. But I’m sure you've already figured that out :) Beside that, we have the Kubernetes Node itself probing the Grafana’s and Victoria’s health probes and the Grafana pulling external resources from the web.

Gaining landscape insights

Along the road, analyzing (and debugging :)) the data collected by Caretta’s agent revealed some surprising behaviors in our dev cluster. Although the main purpose of creating this tool was understanding the communications between different services, the “background” traffic caught our attention and enlightened us. Here are some examples of that usually-transparent traffic:

Non-container networking: A lot of the traffic observed by Caretta wasn’t coming from a pod at all, but from “normal” processes running on the nodes themselves. In our case, it was the kubelet process itself next to aws-specific agents. This communication shouldn’t come off as a surprise, as it is essential to controlling and running the cluster, but seeing its intensity clears some mist surrounding it.
Control-Plane networking: like how the kubelet is a “transparent” component in the node, the control plane is a cluster-wide “transparent” component. We all know the control plane and its role, but have you ever wondered where it actually resides? The answer to this question is vendor-specific, but for AWS and GCP the control plane is actually hosted in servers outside of your cluster managed by the vendor itself, bridged to the cluster with dedicated interfaces.
Endpoint with no targets: continuing with the control plane, the API server is exposed via the kubernetes service in the default namespace. But because, as mentioned, the API server doesn’t actually reside in any workload inside the cluster, and this endpoint merely forwards the traffic out to the control plane.

Final Words

Mapping the inter-dependencies between the different services running in the cluster is a complicated but critical task. Unlocking this rare type of information can help developers, devops and SRE teams get a much clearer view into their production K8s clusters.

eBPF is shaking the foundations of how we understand and experience observability. What was once almost impossible to get can now be obtained in just a few minutes on huge complex clusters running hundreds of different microservices, and Caretta implements exactly this modern lightweight approach.

The next tools that will rule the K8s eco-system will be efficient, ad-hoc and fast. Teams just have too much on their plate to work hard for critical data.

Caretta is open-source, and we’d love to get your feedback and contributions. Head over to our GitHub, create an issue with a bug or a feature request, or drop your thoughts on our community slack.

Cloud-Native Observability: Breaking the Link Between Cost and Depth in Monitoring

Shahar Azulay — Tue, 15 Nov 2022 10:19:32 +0000

Traditional observability architectures and tools, including those of distributed cloud-native systems, made the balance point between performance, affordability, and data depth, one of the first compromises to be made by engineering teams. And inevitably, left dev teams feeling like they’re getting the short end of the stick. But the application monitoring game has come a long way, opening up strong alternatives to these outdated solutions.
By rethinking established monitoring fundamentals, it’s possible to build a monitoring workflow that provides deep insights both quickly and affordably, giving everybody what they need.

The good old days

We’ve all heard the adage that you can only have two of these three options – good, fast, and cheap - but back in the day, that decision point just didn’t exist. When services ran on systems the size of family cars in single locations, there was no complexity to work with and no vast quantities of data to collect, ship, and store. Just grab an integration plugin and get siphoning those logs and metrics, straight out of the app. No problem to design, dead easy to do, super-quick, and the cost was almost non-existent.
Sounds like a good thing, right? Well, there had to be something good about the old monolithic way of doing things. All those massive single points of failure couldn’t have been all bad.

Monoliths to microservices

Most technology teams worldwide have spent a good proportion of the last ten years migrating their legacy systems and services to the cloud. Once you start taking advantage of the cloud model to improve availability and performance while reducing TCO and enabling endless scale, it very quickly becomes apparent that there is a downside here too.
What was a single integration with one giant system has just become dozens of individual microservice integrations in several different locations, each with its own set of requirements. Pulling logs direct from a single system has now become data collection from several endpoints, likely all very different.
Just when you were wondering whether this cloud computing revolution was all it is cracked up to be, along came egress charges – moving data, particularly between cloud providers, is not cheap, then you’ve got the storage fees to deal with. Sure, there are cheaper storage tiers, but when you’re dealing with data at scale even a lower cost will soon mount up, and fast.
Wouldn’t it be great if you could speed all this up at the same time as reducing costs? It seems sensible to only collect and transport the data you need rather than pulling in every log you can find. But then to do that you’d need to know what you want to analyze ahead of time, as well as risking incomplete datasets and repetitive jobs that make the existing situation worse. And just like that we’re back to the speed, cost, and quality conundrum we started with. But fear not!

Have the cake and eat it

By rethinking what you need from your monitoring solution, and focusing on the outcomes rather than the inputs, it is completely achievable to break out of the cost/depth loop. It’s not complicated – instead of collecting every single dataset you can get hold of for analysis, or taking random cuts of data as it streams by, apply intelligence.
Identify the right datasets at source and send only the data you need to your observability platform, translating that data into actionable metrics that trigger automated analysis processes on arrival. An instrumentation approach to monitoring means wrapping input data in code to manage the monitoring tool’s logic. If the logic is too complex, the overheads can kill the process – the more analysis happens to data before it’s sent out, the greater the impact. And this is why traditional monitoring tools keep it simple either by pulling all the data, or taking random cuts.
This can be overcome with eBPF, resulting in lower volumes of data that need to be transferred, analyzed, and stored. With the focus only on data that is relevant, the data you select for analysis will be the data you need for reliable insights.

The power of cloud-native observability

Cloud-native data management is a complicated game, and the complexity and volume of logs, metrics, and trace data, will likely only increase. Maintaining visibility without paying ever-increasing storage and egress charges means a change in approach, and that’s what inspired groundcover.
There was a time when you couldn’t have it all, but now you can – an observability architecture that allows you to scale sustainably by targeting relevant data at source. Freeing the power of cloud and eliminating the cost, speed, and quality decision point by reducing overhead on source services, improving efficiency and eliminating irrelevant data, reducing network usage, and reducing costs.

From Valencia to Detroit: Observability is Stealing All the Thunder at KubeCon 2022

Shahar Azulay — Wed, 13 Jul 2022 10:37:02 +0000

In the world of Kubernetes, observability is the talk of the town. Not only have sessions about novel takes on observability become recurring features at KubeCon events, but the blogo- and Twitterspheres are also rife with thoughts and posts on improving the way teams observe Kubernetes environments – and taking better advantage of tools like eBPF, which weren't around when Kubernetes debuted eight years ago.

This trend reflects what we are inclined to think of as an inflection point for K8s observability. Although the idea of observing K8s is obviously not new, practitioners and thought leaders are increasingly rethinking the tools, goals, and practices surrounding observability for Kubernetes.

Kubernetes observability: The traditional approach

When Kubernetes came along in 2014, most admins did the obvious thing by applying the same monitoring tools and practices to K8s clusters that had worked in other environments. Their approach boiled down to these basic steps:

Collect as many metrics and logs as possible, typically with the help of monitoring agents that run inside the cluster.
Dump those metrics and logs into some kind of time-series monitoring platform, like Prometheus.
Spend a nice chunk of time studying dashboards and manually comparing data sets to identify and troubleshoot Kubernetes performance issues.

This approach to Kubernetes observability may have worked well enough in some cases, but it was never ideal. It required high effort and resulted in relatively low visibility. It also costs a lot of money to collect, analyze and store all of that monitoring data.

How the K8s community is rethinking observability

Today, however, evidence abounds that Kubernetes admin teams are not satisfied with the traditional approach to K8s observability.

For example, the Valencia KubeCon in May included a dozen sessions dedicated to observability – not to mention a co-located "eBPF Day." The blogosphere is abuzz with articles on the potential of eBPF. And it's increasingly hard to find discussions of Kubernetes observability on Twitter that don't mention eBPF. It's a pretty safe bet, too, that observability talks will be front and center at the next KubeCon in Detroit this fall (whose schedule has yet to be released).

If you look at these wide-ranging discussions on how to improve Kubernetes observability, you'll notice a few key themes and goals.

Reducing cost

First, we are seeing an increased focus on Kubernetes observability strategies that reduce overall cost.

One way to do this is to deploy lighter-weight observability stacks, which reduce the overhead that observability tools place on your cluster. For example, as Alex Jones writes:

Jones's thoughts are reinforced by Logz.io's latest DevOps Pulse survey, which found that 27 percent of respondents identified "total cost of ownership and the large volumes of data being ingested into the tools among their main challenges in maintaining effective observability into the cloud."

Perhaps observability spending wasn't as much of a concern in the heady days of 2021, when the tech industry remained flush with cash. But in today's increasingly volatile economic environment, controlling cost is more and more of a priority – and K8s management teams are realizing that they can do their part by deploying lighter-weight observability stacks.

The turn to eBPF

A second K8s observability trend – and one that ties into the cost-reduction concerns we just discussed – is a turn toward eBPF as the foundation for Kubernetes observability.

It's not at all an exaggeration to say that eBPF, which lets lightweight programs run in the Linux kernel, provides revolutionary new opportunities when it comes to observability. eBPF makes it possible to collect deep, focused observability data from Kubernetes nodes (as long as they are Linux-based, at least) without burdening clusters with a full-scale observability stack, or having to worry about work like deploying sidecar containers.

More traces

One of the many tasks that eBPF makes faster and cheaper to perform is tracing – which is, indeed, an increasingly important component of Kubernetes observability. The Logz.io survey found that 27 percent of practitioners are now including tracing in their observability strategies.

The concept of tracing is not exactly new, of course. For several years, we've been told that traces are one of the three "pillars" of observability (the others are metrics and logs). But until tools like eBPF came along, actually implementing distributed traces was a lot of work – not to mention a high degree of overhead.

But as teams look to gain more visibility than they can achieve via metrics and logs alone, and as eBPF makes it increasingly easy to look beyond metrics and logs, traces are evolving from a nice-to-have asset, to a core component of Kubernetes observability.

Change is going to come

Thanks to the ongoing conversations about rethinking Kubernetes observability – and taking greater advantage of newer tools, like eBPF – admins can look forward to a future where observability is easier, less costly and more meaningful than ever before. The new approaches that are being highlighted in contexts like KubeCon – and which, again, we expect to see even more of at upcoming K8s events – are underlining how it has become easier than ever to trace what is happening to Kubernetes clusters and apps from deep within the kernel that powers each node. By extension, deploying lightweight, low-cost Kubernetes observability tools has also become much simpler.

For now, a lot of these discussions remain in the talking stages more than the action stages – teams are pioneering or validating new approaches, but they may not have implemented them fully. But the fact that there is so much buzz around K8s observability suggests that we're at an inflection point, and that within maybe a year's time, more and more organizations will be putting these new tools and approaches into practice.

We certainly hope they do, because to be the most it can be, Kubernetes needs simple, effective and low-cost observability – which only these new approaches can deliver.