Snooping on your GPU: Using eBPF to Build Zero-instrumentation CUDA Monitoring

Ethan Graham — Sun, 22 Dec 2024 17:00:00 +0000

GPUprobe uses Linux's eBPF to monitor CUDA applications with zero code changes - no recompilation, no instrumentation, just attach and go. Check the repository out!

Introduction

When I first started my job as a software engineer, my first task was to fix a memory leak in a Go service causing it to crash and restart frequently. The problem was that the Go code was calling a C library through CGo, thus leaving the warm embrace of the garbage-collected Go runtime and entering the cold void of raw malloc() and free() calls, and effectively rendering any Go profiling tools powerless.

So I reached for Valgrind and other tooling - this however didn't work well with a Go binary at all as it instruments all memory access, slowing it down so much that I could barely even launch the binary, let alone debug it. I frantically searched for another solution...

Enter BCC-memleak. This is an eBPF-based tool that allows the user to find leaking memory by attaching it to a running process. All it took was compiling the leaking library with a debug flag, attaching memleak to my process, and voila. Within 30 minutes I had found the exact function call that was leaking memory, identified how much memory was leaking per call, and opened a PR for a one-line patch.

A few months later, while pondering away at potential side projects that I could work on, I reflected back on my experience with BCC Memleak and how much time it had saved me. I wondered if a tool like that could work with GPU memory allocations, I wondered what other cool things could be done.

And so GPUprobe came to be - an eBPF-based observability tool for CUDA.

Today, it provides insights on memory allocation patterns, memory leaks, kernel launch patterns, and more features to come.

What GPU Monitoring Lacks

GPUs are expensive. In some cases, really expensive. As a GPU user you owe it to yourself and your wallet to want to squeeze all the performance that you can out of it.

Furthermore, as with systems software in general, debugging and failure detection are non-trivial. CUDA helper functions will normally return a cudaError_t, which is just an enum value. Of course one should always handle errors in their code, but if you want any observability at runtime you'll need to litter your code with statements like the following, and then frantically check stdout to see if something has failed.

cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
    printf("CUDA error: %s\n", cudaGetErrorString(err));
}

There are, naturally, existing solutions for GPU observability with their own strengths and trade-offs.

NVIDIA NSight Systems

NSight Systems is NVIDIA's primary GPU profiling tool. It's incredibly powerful for development-time profiling and optimization, using CUPTI (CUDA Profiling Tools Interface) to collect detailed metrics about GPU usage. However, its workflow is fundamentally different from continuous monitoring:

It requires explicitly starting a profiling session
Profiling adds significant overhead (often 2-10x slowdown)
Data analysis happens after the program finishes
Not designed for continuous production monitoring.

Think of it as more like a GPU-specialized debugger than a monitoring solution - it is great for finding bottlenecks during development, but not suitable for ongoing production insights.

DCGM (Data Center GPU Manager)

DCGM is NVIDIA's solution for GPU monitoring in data centers. It excels at collecting system-level metrics like:

GPU utilization
Memory usage (total used/free)
Temperature and power consumption
Hardware health status

However, DCGM operates at a high level and misses application-specific details:

No visibility into per-process memory allocation patterns
Can't track individual CUDA kernel launches
Limited ability to detect memory leaks
No insight into API-level behavior

Where GPUprobe fits in

GPUprobe fills a specific gap in GPU observability: lightweight, continuous
monitoring at the application level. It provides:

Close to zero-overhead runtime monitoring:
- Uses eBPF uprobes that piggyback on existing system calls
- No code instrumentation required
- Minimal impact on application performance (<4% in benchmarks)
Application-level insights:
- Track memory allocations and potential leaks per process
- Monitor CUDA kernel launch patterns
- See actual function names and call patterns
- Debug API-level issues in production
Modern observability integration:
- Exports Prometheus metrics for Grafana dashboards
- Continuous monitoring suitable for production
- Fits into existing monitoring stacks

GPUprobe achieves this by using eBPF uprobes to monitor CUDA runtime API calls at the Linux kernel level. This unique approach lets us gather detailed metrics without modifying application code or significantly impacting performance. Think of it as filling the middle ground between NSight's deep but heavyweight profiling and DCGM's high-level system monitoring.

eBPF (and why it's cool)

I won't detail all of the things that you can do with eBPF here - consider checking out the eBPF website to learn more. Here's a snippet from the website:

eBPF is a revolutionary technology with origins in the Linux kernel that can run sandboxed programs in a privileged context such as the operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules. Historically, the operating system has always been an ideal place to implement observability, security, and networking functionality due to the kernel’s privileged ability to oversee and control the entire system. ...

Uprobes

One of eBPF's most powerful features is its ability to attach to user-space programs through uprobes. Think of a uprobe as a microscopic breakpoint that you attach can attach to any function in a running program. When that function is called, the eBPF program gets notified and can inspect or collect data about the call.

They are particularly powerful because the program itself doesn't need to be modified, and the overhead is minimal compared to traditional instrumentation.

For GPUprobe, uprobes are the secret sauce. We attach them to the CUDA runtime API directly (libcudart.so), and they are triggered by calls to functions like cudaMalloc(), cudaFree(), or cudaLaunchKernel(). When calls are made by your program, our eBPF programs intercept them, collect the relevant data, and send it up to the monitoring pipeline - all without your program even knowing we're there.

Case study: implementing a memory leak detection tool

Let's dive into how we implemented a tool for detecting CUDA memory leaks in real time. At a high-level, we maintain per-process CUDA memory maps. A chunk of memory is allocated by a call to cudaMalloc(), and freed by an associated call to cudaFree(). Here is the signature of those functions for the unfamiliar:

// allocate `size` bytes on device, device address is copied to `*devPtr`
cudaError_t cudaMalloc (void** devPtr, size_t size)

/// free an allocation at device address `devPtr`
cudaError_t cudaFree (void* devPtr)

The idea behind monitoring this in eBPF is illustrated with this Python-like pseudo-code.

class MemoryMaps:
    ...

memory_maps = MemoryMaps()

def uprobe_cuda_malloc(devPtr, size):
    pid = get_pid()
    memory_maps.make_entry(pid, *devPtr, size)

def uprobe_cuda_free(devPtr):
    pid = get_pid()
    memory_maps.free_entry(pid, devPtr)

def process_exits(pid):
    memory_maps.free_all(pid)

It's really as simple as that! Kinda...

Firstly, we don't want our memory maps to be living in the kernel. Whenever we export metrics, we have to lookup our data structure, traverse it to see which allocations are still outstanding, clean up processes that have exited. Meanwhile, our uprobes are contending for these same data structures - if they are triggered by a CUDA runtime function but have to wait for a lock, then our
application will be slowed down.

So instead, we opt for an "event-based" system. We implement this using an eBPF queue, which is pushed to from our uprobes and consumed by the user-space program. The data in this queue holds relevant information that we may want to know about a call to cudaMalloc() or cudaFree().

/**
 * Wraps the arguments passed to `cudaMalloc` or `cudaFree`, and return code,
 * and some metadata
 */
struct memleak_event {
    __u64 start;
    __u64 end;
    void *device_addr;
    __u64 size;
    __u32 pid;
    int32 ret;
    enum memleak_event_t event_type;
};

/**
 * Queue of memleak events that are updated from eBPF space, then dequeued
 * and processed from user-space by the GPUprobe daemon.
 */
struct {
    __uint(type, BPF_MAP_TYPE_QUEUE);
    __uint(key_size, 0);
    __type(value, struct memleak_event);
    __uint(max_entries, 1024);
} memleak_events_queue SEC(".maps");

A complication that we encountered while implementing our cudaMalloc() uprobe is that the device address copied into *devPtr is only known when the function returns. So naturally, we can use a uretprobe (which is the same as a uprobe but triggered when a function returns). However, this isn't sufficient either - uprobes and uretprobes read from a struct pt_regs *ctx, i.e. they read a snapshot of the register state. This means that we cannot only use a uretprobe, because the content of the registers will change during function execution and contain arbitrary data.

An important note is that eBPF programs cannot call each other, nor can they call any function other than the allowed helper functions that are exposed by bpf.h. Thus for sharing data between the cudaMalloc() uprobe and uretprobes, we use an eBPF hash-map that holds devPtr for a given process. This makes the assumption, which is supported by CUDA documentation, that cudaMalloc() is blocking, and cannot be called twice concurrently from the same thread.

At a high-level, the logic looks like this:

pid_to_devPtr = {}

def uprobe_malloc(devPtr, size):
    pid = get_pid()
    pid_to_devPtr[pid] = devPtr

def uretprobe_malloc(devPtr):
    pid = get_pid()
    # we must read from user-space to get the device address that was copied 
    # into `void** devPtr`. Think of this as a pointer deref, except that we
    # are deferencing something in user-space
    device_address = read_from_user(pid_to_devPtr[pid])

In practice, with the collection of other useful metadata, we have this

/// uprobe triggered by a call to `cudaMalloc`
SEC("uprobe/cudaMalloc")
int memleak_cuda_malloc(struct pt_regs *ctx)
{
    struct memleak_event e = { 0 };
    void **dev_ptr;
    u32 pid, key0 = 0;

    e.size = (size_t)PT_REGS_PARM2(ctx);
    dev_ptr = (void **)PT_REGS_PARM1(ctx);
    pid = (u32)bpf_get_current_pid_tgid();

    e.event_type = CUDA_MALLOC;
    e.start = bpf_ktime_get_ns();
    e.pid = pid;

    if (bpf_map_update_elem(&memleak_pid_to_event, &pid, &e, 0)) {
        return -1;
    }

    return bpf_map_update_elem(&memleak_pid_to_dev_ptr, &pid, &dev_ptr, 0);
}

/// uretprobe triggered when `cudaMalloc` returns
SEC("uretprobe/cudaMalloc")
int memleak_cuda_malloc_ret(struct pt_regs *ctx)
{
    int cuda_malloc_ret;
    u32 pid, key0 = 0;
    size_t *size, *num_failures;
    struct memleak_event *e;
    void **dev_ptr;
    void ***map_ptr;

    cuda_malloc_ret = (int)PT_REGS_RC(ctx);
    pid = (u32)bpf_get_current_pid_tgid();

    e = bpf_map_lookup_elem(&memleak_pid_to_event, &pid);
    if (!e) {
        return -1;
    }

    e->ret = cuda_malloc_ret;

    // lookup the value of `devPtr` passed to `cudaMalloc` by this process
    map_ptr = (void ***)bpf_map_lookup_elem(&memleak_pid_to_dev_ptr, &pid);
    if (!map_ptr) {
        return -1;
    }
    dev_ptr = *map_ptr;

    // read the value copied into `*devPtr` by `cudaMalloc` from user-space
    if (bpf_probe_read_user(&e->device_addr, sizeof(void *), dev_ptr)) {
        return -1;
    }

    e->end = bpf_ktime_get_ns();

    return bpf_map_push_elem(&memleak_events_queue, e, 0);
}

Voila - now our userpace code can consume from memleak_events_queue and update its memory maps.

Our implementations of the cudaFree() uprobe and uretprobe are very similar to what we just presented for cudaMalloc()

Naturally a benefit of having our processing pipeline in user-space is having access to the richer data structures that are exposed by user-space programming languages. These include, but are not limited to:

Dynamically sized data structures: eBPF maps must either have a static size or be set explicitly on initialization before attaching a program.
Nested data structures: We implement our per-process memory maps as a hash-map of B-tree maps. This maps PIDs to their memory map, which is a B-Tree map maintaining an ordered range of CUDA device addresses with the associated size metadata.

Our user-space processing pipeline is written in Rust, making use of libbpf-rs which provides Rust bindings for libbpf, exposing a nice API for attaching and managing the lifetime of eBPF programs, and accessing eBPF maps.

The queue of events that have been generated from our uprobes is consumed when displaying to stdout or when exporting. A display is triggered at a fixed interval that is user-configurable (default is 5 seconds), and an export is triggered whenever a request is made to the metrics endpoint at :9000/metrics (this port is also user-configurable).

Consuming the queue is relatively straight-forward - we just pop from the eBPF queue until there is nothing left to process. We note that this queue contains events generated in both our cudaMalloc() and our cudaFree() uprobes.

let key: [u8; 0] = []; // key size must be zero for BPF_MAP_TYPE_QUEUE
                       // `lookup_and_delete` calls.
while let Ok(opt) = self
    .skel
    .skel
    .maps
    .memleak_events_queue
    .lookup_and_delete(&key)
{
    let event_bytes = match opt {
        Some(b) => b,
        None => {
            return Ok(());
        }
    };
    let event = match MemleakEvent::from_bytes(&event_bytes) {
        Some(e) => e,
        None => {
            return Err(GpuprobeError::RuntimeError(
                "unable to construct MemleakEvent from bytes".to_string(),
            ));
        }
    };
// update CUDA state

The state that we keep looks like

pub struct MemleakState {
    pub memory_map: HashMap<u32, BTreeMap<u64, CudaMemoryAlloc>>,
    pub num_successful_mallocs: u64,
    pub num_failed_mallocs: u64,
    pub num_successful_frees: u64,
    pub num_failed_frees: u64,
    active_pids: HashSet<u32>,
}

We maintain a set of the processes that we are tracking (active_pids) so that we can perform aliveness checks on the CUDA programs that we have observed, and clean up any relevant process state when they exit.

Observing memory leaks and kernel launches - live!

To illustrate that our program is exhibiting correct behavior and catching cudaMalloc(), cudaFree() and cudaLaunchKernel() events, we spin up an instance of gpuprobe-daemon and launch a simple CUDA binary.

// CUDA kernels
__global__
void optimized_convolution_part1(double *input, double *output, int length);
__global__
void optimized_convolution_part2(double *input, double *output, int length);

int main() 
{
    double *dv_input, *dv_output, *dv_intermediate;

    cudaMalloc((void**)&dv_input, sizeof(double)*1000*1000);
    cudaMalloc((void**)&dv_output, sizeof(double)*1000*1000);
    cudaMalloc((void**)&dv_intermediate, sizeof(double)*1000*1000);

    for (int i = 0; i < 1000; i++) {
        // both of these trigger a `cudaKernelLaunch()` call
        optimized_convolution_part1<<<n_blocks, n_threads, shared_mem>>>
            (dv_input, dv_intermediate, length);
        optimized_convolution_part2<<<n_blocks, n_threads, shared_mem>>>
            (dv_intermediate, dv_output, length);
    }

    // Note how we forget to free `dv_intermediate`!
    cudaFree(dv_input);
    cudaFree(dv_output);
}

So what does our program output?

2024-21-12 16:32:46

num_successful_mallocs:  3
num_failed_mallocs:      0
num_successful_frees:    0
num_failed_frees:        0
per-process memory maps:
process 365159
        0x0000793a44000000: 8000000 Bytes
        0x0000793a48c00000: 8000000 Bytes
        0x0000793a49400000: 8000000 Bytes

total kernel launches: 1470
pid: 365159
        0x5de98f9fba50 (_Z27optimized_convolution_part1PdS_i) -> 735
        0x5de98f9fbb30 (_Z27optimized_convolution_part2PdS_i) -> 735

==============================

2024-21-12 16:32:51

num_successful_mallocs:  3
num_failed_mallocs:      0
num_successful_frees:    2
num_failed_frees:        0
per-process memory maps:
process 365159
        0x0000793a44000000: 8000000 Bytes
        0x0000793a48c00000: 0 Bytes
        0x0000793a49400000: 0 Bytes

total kernel launches: 2000
pid: 365159
        0x5de98f9fba50 (_Z27optimized_convolution_part1PdS_i) -> 1000
        0x5de98f9fbb30 (_Z27optimized_convolution_part2PdS_i) -> 1000

Aha! It looks to be working. So what exactly are we seeing here?

In the first interval, we see the program in flight - all of the required memory is allocated on the GPU, and we see that we have launched each of our two CUDA kernels 735 times.

Firstly, we see the function address of our kernel as it is found in the program binary, as well as its name (which we achieved by resolving its symbol from the associated binary).

Secondly, we see how many allocations have been made, their virtual addresses on the GPU, and how large the allocations were.

In the second interval, we see that both kernels have been launched 1000 times, i.e. the number of iterations that we set. We also see that two of our chunks of memory have been freed! But what about the third allocation that seems to still be holding onto GPU memory? Well if you look at the code carefully, you can see that we forgot to call cudaFree(dv_intermediate)!

Now although this is a toy example, I think it illustrates the point relatively well.

We have caught a memory leak during program execution.
We see exactly how many times each cuda kernel was launched
We didn't have to make any code modifications, or even attach GPUprobe to a specific process

The example program that I introduced was a boiled-down version of a class project of mine from my undergrad, which is what I first used to test GPUprobe. Funnily enough, I didn't expect it to have a memory leak at all - I caught that later on using GPUprobe. If I'd had the tool back then, I certainly wouldn't have submitted a class assignment that leaked memory.

Bugs

I will share a bug that I have encountered that I am still trying (albeit not necessarily actively) to solve.

We perform symbol resolution on CUDA kernels so that the name of the launched kernel is displayed to stdout. It almost seems magic - but quite a bit of hacking around was needed to get that working. It's a super useful feature in my opinion - it can be very difficult to relate a virtual function address to the abstraction of a CUDA kernel.

A CUDA launch kernel event generated by our uprobe contains the PID of the process launching the kernel. This is useful! Knowing the PID, we can go and peak around in /proc/[pid] to get the path of the running binary, as well as its virtual base address (which varies between executions due to ASLR).

From here, we can do some good old symbolic resolution to resolve a function name from the virtual address of a CUDA kernel at runtime. We can do this because CUDA kernels are written like functions, and thus live in the .text section of a binary. So voila - now we can show the user a more human-parsable output by telling them the names of the kernels that are being passed into cudaLaunchKernel(const void* func, ...). At a high-level, resolving a symbol looks like this:

# Map from binary offset -> symbol name
symbols = {}  # e.g. {0x1000: "my_cuda_kernel", ...}

def resolve_symbol(virtual_addr, pid):
    # Adjust for ASLR by subtracting the binary's base address
    offset = virtual_addr - get_virtual_base_offset(pid)
    return symbols[offset]  # Get the symbol name for this offset

This works great in our case... most of the time.

Recall that events are consumed when a display or export event is triggered every few seconds. We cache a symbol table so that we don't have to go and read through /proc every time an event is recorded - only when an event for a new PID is.

A problem arises when an event is recorded for a short-lived execution that starts and ends between two intervals.

While processing the events queue, we will check to see if there is a cached symbols table for pid - if it is the first time a PID has been recorded, then naturally it won't have an entry. So we go and look in /proc/pid and... it doesn't exist. The process as already exited - so we don't know the virtual base offset of the binary, nor do we know the location of the binary that executed. We can't perform symbolic resolution here.

So that's a bit of a bummer. We can still record and display our frequency histogram, but since we weren't able to resolve the symbol of the kernels, we just display unknown kernel, as you see in this example.

total kernel launches: 25
pid: 365306
        0x5823e39efa50 (unknown kernel) -> 10
        0x5823e39efb30 (unknown kernel) -> 15

There are, as always, other ways of implementing our symbolic resolution. A potential alternative would be to monitor process exits system-wide, and cache symbol tables for exiting processes in case we need them. I think this falls out of scope, and we will very likely end up collecting heaps of redundant data, because most processes in a Linux system won't use the CUDA runtime API. I think the trade-off made here is reasonable, because most CUDA jobs will run for longer than a single display/export interval - or at least any CUDA jobs that we'd want to monitor would.

Performance Benchmark

Lastly, before concluding, I would like to discuss some benchmarks. Firstly, it should be noted that in general, using uprobes is pretty expensive because it causes a context switch (the kernel snoops on the function calls made by user-space). The reason why we expect minimal overhead when using uprobes for monitoring the CUDA runtime is because

The CUDA runtime API calls the CUDA drivers, which live in kernel space. This also leads to a context switch!
Calls to CUDA runtime API functions will lead to communication with the GPU over PCIe (or some other interconnect), and most of these functions are blocking.

I.e. the CUDA runtime API functions are expensive anyways, and they are normally used to tell the GPU what to do. It is optimized for throughput by batching, not low-latency single-threaded performance. The idea is that the additional latency introduced by our uprobes is negligeable compared to that of the CUDA runtime API calls, thus leading to very little relative overhead.

I am going to keep the benchmarks very simple for this article, saving the more rigorous benchmarking for later. I perform the benchmarks on my laptop with a NVIDIA Quadro P520 with 2048MiB of VRAM.

On my system I ran 5000 iterations of cudaMalloc()/cudaFree() pairs, and measured the average latency of each iteration. I kept the allocation size very small at 100 bytes so that the results would better reflect the overhead of the cudaMalloc() call itself rather than the overhead related to allocating a large chunk of contiguous memory on the GPU. The first 500 iterations are discarded to account for warm-up effects.

	without GPUprobe	with GPUprobe
avg. latency	255μs	265μs

What we observe is a 3.92% increase in latency when instrumenting all
cudaMalloc()/cudaFree() calls.

As for the overhead of monitoring cudaLaunchKernel() calls, I decided to benchmark the program that I presented during the case study, that performs 1000 iterations launching two CUDA kernels in each. In this case, I found no measurable performance impact when running with GPUprobe versus running without it.

What we learn from these benchmarks is the following:

Monitoring many rapid calls to cudaMalloc()/cudaFree() incurs a ~4% overhead. However this case isn't quite realistic - in general, we allocate a chunk of memory on-device and then perform many operations on it.
Monitoring cudaLaunchKernel() calls incurs a negligible runtime overhead in this simple case. This makes sense - our uprobe for monitoring kernel launches is a lot simpler (no intermediate state required) than our uprobes for recording memory alloation events.

I am thus confident at this stage in saying that GPUprobe introduces very little overhead to running CUDA applications.

Future benchmarking work will include:

Testing with larger, real-world ML workloads
Measuring impact on memory-intensive applications
Benchmarking across different GPU architectures and CUDA versions

But for now, these results suggest that GPUprobe's overhead is minimal enough for practical use in local or production environments.

Conclusion

In this write-up, I've introduced GPUprobe - a zero-instrumentation tool for monitoring GPU behavior through eBPF-based inspection of CUDA runtime API calls. We explored how GPUprobe fills a specific niche in the GPU monitoring landscape, combining low overhead with detailed application-level insights that tools like NSight and DCGM don't provide.

Through a deep dive into the memory leak detector's implementation, we saw how eBPF uprobes can be used to track GPU memory allocations without modifying application code. We tackled interesting technical challenges like symbol resolution for CUDA kernels at runtime, and demonstrated through benchmarking that the overhead is minimal even for allocation-heavy workloads.

My next steps will focus on testing GPUprobe with real-world ML workloads to better understand where it fits in the observability landscape. I'm particularly interested in:

Exploring use cases in ML infrastructure monitoring
Adding support for more CUDA runtime API functions

If you're interested in GPU observability or eBPF, check out the repository! And consider leaving a star, it helps spread the word about the project :)

link to the repo

Forem: Ethan Graham