<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ethan Graham</title>
    <description>The latest articles on Forem by Ethan Graham (@ethgraham).</description>
    <link>https://forem.com/ethgraham</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2596083%2F262fb187-42e2-40e2-b134-3dbda63d9102.jpg</url>
      <title>Forem: Ethan Graham</title>
      <link>https://forem.com/ethgraham</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ethgraham"/>
    <language>en</language>
    <item>
      <title>Snooping on your GPU: Using eBPF to Build Zero-instrumentation CUDA Monitoring</title>
      <dc:creator>Ethan Graham</dc:creator>
      <pubDate>Sun, 22 Dec 2024 17:00:00 +0000</pubDate>
      <link>https://forem.com/ethgraham/snooping-on-your-gpu-using-ebpf-to-build-zero-instrumentation-cuda-monitoring-2hh1</link>
      <guid>https://forem.com/ethgraham/snooping-on-your-gpu-using-ebpf-to-build-zero-instrumentation-cuda-monitoring-2hh1</guid>
      <description>&lt;p&gt;GPUprobe uses Linux's eBPF to monitor CUDA applications with zero code changes - no recompilation, no instrumentation, just attach and go. &lt;a href="https://github.com/GPUprobe/gpuprobe-daemon" rel="noopener noreferrer"&gt;Check the repository out&lt;/a&gt;!&lt;/p&gt;

&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;When I first started my job as a software engineer, my first task was to fix a memory leak in a Go service causing it to crash and restart frequently. The problem was that the Go code was calling a C library through CGo, thus leaving the warm embrace of the garbage-collected Go runtime and entering the cold void of raw &lt;code&gt;malloc()&lt;/code&gt; and &lt;code&gt;free()&lt;/code&gt; calls, and effectively rendering any Go profiling tools powerless.&lt;/p&gt;

&lt;p&gt;So I reached for Valgrind and other tooling - this however didn't work well with a Go binary at all as it instruments all memory access, slowing it down so much that I could barely even launch the binary, let alone debug it. I frantically searched for another solution...&lt;/p&gt;

&lt;p&gt;Enter &lt;a href="https://github.com/iovisor/bcc/blob/master/tools/memleak.py" rel="noopener noreferrer"&gt;BCC-memleak&lt;/a&gt;. This is an eBPF-based tool that allows the user to find leaking memory by attaching it to a running process. All it took was compiling the leaking library with a debug flag, attaching memleak to my process, and voila. Within 30 minutes I had found the exact function call that was leaking memory, identified how much memory was leaking per call, and opened a PR for a one-line patch.&lt;/p&gt;

&lt;p&gt;A few months later, while pondering away at potential side projects that I could work on, I reflected back on my experience with BCC Memleak and how much time it had saved me. I wondered if a tool like that could work with GPU memory allocations, I wondered what other cool things could be done.&lt;/p&gt;

&lt;p&gt;And so &lt;strong&gt;GPUprobe&lt;/strong&gt; came to be - an eBPF-based observability tool for CUDA.&lt;/p&gt;

&lt;p&gt;Today, it provides insights on memory allocation patterns, memory leaks, kernel launch patterns, and more features to come.&lt;/p&gt;

&lt;h1&gt;
  
  
  What GPU Monitoring Lacks
&lt;/h1&gt;

&lt;p&gt;GPUs are expensive. In some cases, &lt;em&gt;really&lt;/em&gt; expensive. As a GPU user you owe it to yourself and your wallet to want to squeeze all the performance that you can out of it. &lt;/p&gt;

&lt;p&gt;Furthermore, as with systems software in general, debugging and failure detection are non-trivial. CUDA helper functions will normally return a &lt;code&gt;cudaError_t&lt;/code&gt;, which is just an enum value. Of course one should always handle errors in their code, but if you want any observability at runtime you'll need to litter your code with statements like the following, and then frantically check stdout to see if something has failed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;cudaError_t&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cudaGetLastError&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;cudaSuccess&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"CUDA error: %s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cudaGetErrorString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are, naturally, existing solutions for GPU observability with their own strengths and trade-offs.&lt;/p&gt;

&lt;h2&gt;
  
  
  NVIDIA NSight Systems
&lt;/h2&gt;

&lt;p&gt;NSight Systems is NVIDIA's primary GPU profiling tool. It's incredibly powerful for development-time profiling and optimization, using CUPTI (CUDA Profiling Tools Interface) to collect detailed metrics about GPU usage. However, its workflow is fundamentally different from continuous monitoring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It requires explicitly starting a profiling session&lt;/li&gt;
&lt;li&gt;Profiling adds significant overhead (often 2-10x slowdown)&lt;/li&gt;
&lt;li&gt;Data analysis happens after the program finishes&lt;/li&gt;
&lt;li&gt;Not designed for continuous production monitoring. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it as more like a GPU-specialized debugger than a monitoring solution - it is great for finding bottlenecks during development, but not suitable for ongoing production insights.&lt;/p&gt;

&lt;h2&gt;
  
  
  DCGM (Data Center GPU Manager)
&lt;/h2&gt;

&lt;p&gt;DCGM is NVIDIA's solution for GPU monitoring in data centers. It excels at collecting system-level metrics like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU utilization&lt;/li&gt;
&lt;li&gt;Memory usage (total used/free)&lt;/li&gt;
&lt;li&gt;Temperature and power consumption&lt;/li&gt;
&lt;li&gt;Hardware health status&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, DCGM operates at a high level and misses application-specific details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No visibility into per-process memory allocation patterns&lt;/li&gt;
&lt;li&gt;Can't track individual CUDA kernel launches&lt;/li&gt;
&lt;li&gt;Limited ability to detect memory leaks&lt;/li&gt;
&lt;li&gt;No insight into API-level behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where GPUprobe fits in
&lt;/h2&gt;

&lt;p&gt;GPUprobe fills a specific gap in GPU observability: lightweight, continuous&lt;br&gt;
monitoring at the application level. It provides:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Close to zero-overhead runtime monitoring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses eBPF uprobes that piggyback on existing system calls&lt;/li&gt;
&lt;li&gt;No code instrumentation required&lt;/li&gt;
&lt;li&gt;Minimal impact on application performance (&amp;lt;4% in benchmarks)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Application-level insights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track memory allocations and potential leaks per process&lt;/li&gt;
&lt;li&gt;Monitor CUDA kernel launch patterns&lt;/li&gt;
&lt;li&gt;See actual function names and call patterns&lt;/li&gt;
&lt;li&gt;Debug API-level issues in production&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Modern observability integration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exports Prometheus metrics for Grafana dashboards&lt;/li&gt;
&lt;li&gt;Continuous monitoring suitable for production&lt;/li&gt;
&lt;li&gt;Fits into existing monitoring stacks&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;GPUprobe achieves this by using eBPF uprobes to monitor CUDA runtime API calls at the Linux kernel level. This unique approach lets us gather detailed metrics without modifying application code or significantly impacting performance. Think of it as filling the middle ground between NSight's deep but heavyweight profiling and DCGM's high-level system monitoring.&lt;/p&gt;
&lt;h1&gt;
  
  
  &lt;a href="https://ebpf.io/" rel="noopener noreferrer"&gt;eBPF&lt;/a&gt; (and why it's cool)
&lt;/h1&gt;

&lt;p&gt;I won't detail all of the things that you can do with eBPF here - consider checking out &lt;a href="https://ebpf.io/" rel="noopener noreferrer"&gt;the eBPF website&lt;/a&gt; to learn more. Here's a snippet from the website:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;eBPF is a revolutionary technology with origins in the Linux kernel that can run sandboxed programs in a privileged context such as the operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules. Historically, the operating system has always been an ideal place to implement observability, security, and networking functionality due to the kernel’s privileged ability to oversee and control the entire system. ...&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Uprobes
&lt;/h2&gt;

&lt;p&gt;One of eBPF's most powerful features is its ability to attach to user-space programs through uprobes. Think of a uprobe as a microscopic breakpoint that you attach can attach to any function in a running program. When that function is called, the eBPF program gets notified and can inspect or collect data about the call.&lt;/p&gt;

&lt;p&gt;They are particularly powerful because the program itself doesn't need to be modified, and the overhead is minimal compared to traditional instrumentation.&lt;/p&gt;

&lt;p&gt;For GPUprobe, uprobes are the secret sauce. We attach them to the CUDA runtime API directly (&lt;code&gt;libcudart.so&lt;/code&gt;), and they are triggered by calls to functions like &lt;code&gt;cudaMalloc()&lt;/code&gt;, &lt;code&gt;cudaFree()&lt;/code&gt;, or &lt;code&gt;cudaLaunchKernel()&lt;/code&gt;. When calls are made by your program, our eBPF programs intercept them, collect the relevant data, and send it up to the monitoring pipeline - all without your program even knowing we're there.&lt;/p&gt;
&lt;h1&gt;
  
  
  Case study: implementing a memory leak detection tool
&lt;/h1&gt;

&lt;p&gt;Let's dive into how we implemented a tool for detecting CUDA memory leaks in real time. At a high-level, we maintain per-process CUDA memory maps. A chunk of memory is allocated by a call to &lt;code&gt;cudaMalloc()&lt;/code&gt;, and freed by an associated call to &lt;code&gt;cudaFree()&lt;/code&gt;. Here is the signature of those functions for the unfamiliar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// allocate `size` bytes on device, device address is copied to `*devPtr`&lt;/span&gt;
&lt;span class="n"&gt;cudaError_t&lt;/span&gt; &lt;span class="n"&gt;cudaMalloc&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;devPtr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;/// free an allocation at device address `devPtr`&lt;/span&gt;
&lt;span class="n"&gt;cudaError_t&lt;/span&gt; &lt;span class="n"&gt;cudaFree&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;devPtr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The idea behind monitoring this in eBPF is illustrated with this Python-like pseudo-code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MemoryMaps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;memory_maps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MemoryMaps&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;uprobe_cuda_malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;devPtr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_pid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;memory_maps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;make_entry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;devPtr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;uprobe_cuda_free&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;devPtr&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_pid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;memory_maps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;free_entry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;devPtr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_exits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;memory_maps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;free_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's really as simple as that! &lt;em&gt;Kinda&lt;/em&gt;...&lt;/p&gt;

&lt;p&gt;Firstly, we don't want our memory maps to be living in the kernel. Whenever we export metrics, we have to lookup our data structure, traverse it to see which allocations are still outstanding, clean up processes that have exited.  Meanwhile, our uprobes are contending for these same data structures - if they are triggered by a CUDA runtime function but have to wait for a lock, then our&lt;br&gt;
application will be slowed down. &lt;/p&gt;

&lt;p&gt;So instead, we opt for an "event-based" system. We implement this using an eBPF queue, which is pushed to from our uprobes and consumed by the user-space program. The data in this queue holds relevant information that we may want to know about a call to &lt;code&gt;cudaMalloc()&lt;/code&gt; or &lt;code&gt;cudaFree()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cm"&gt;/**
 * Wraps the arguments passed to `cudaMalloc` or `cudaFree`, and return code,
 * and some metadata
 */&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;memleak_event&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;__u64&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;__u64&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;device_addr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;__u64&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;__u32&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;int32&lt;/span&gt; &lt;span class="n"&gt;ret&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;memleak_event_t&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="cm"&gt;/**
 * Queue of memleak events that are updated from eBPF space, then dequeued
 * and processed from user-space by the GPUprobe daemon.
 */&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;__uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BPF_MAP_TYPE_QUEUE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;__uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;__type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;memleak_event&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;__uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_entries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;memleak_events_queue&lt;/span&gt; &lt;span class="nf"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".maps"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A complication that we encountered while implementing our &lt;code&gt;cudaMalloc()&lt;/code&gt; uprobe is that the device address copied into &lt;code&gt;*devPtr&lt;/code&gt; is only known when the function returns. So naturally, we can use a uretprobe &lt;em&gt;(which is the same as a uprobe but triggered when a function returns)&lt;/em&gt;. However, this isn't sufficient either - uprobes and uretprobes read from a &lt;code&gt;struct pt_regs *ctx&lt;/code&gt;, i.e. they read a snapshot of the register state. This means that we cannot only use a uretprobe, because the content of the registers will change during function execution and contain arbitrary data.&lt;/p&gt;

&lt;p&gt;An important note is that eBPF programs cannot call each other, nor can they call any function other than the allowed helper functions that are exposed by &lt;code&gt;bpf.h&lt;/code&gt;.  Thus for sharing data between the &lt;code&gt;cudaMalloc()&lt;/code&gt; uprobe and uretprobes, we use an eBPF hash-map that holds &lt;code&gt;devPtr&lt;/code&gt; for a given process.  This makes the assumption, which is supported by CUDA documentation, that &lt;code&gt;cudaMalloc()&lt;/code&gt; is blocking, and cannot be called twice concurrently from the same thread.&lt;/p&gt;

&lt;p&gt;At a high-level, the logic looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pid_to_devPtr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;uprobe_malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;devPtr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_pid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;pid_to_devPtr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;devPtr&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;uretprobe_malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;devPtr&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_pid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# we must read from user-space to get the device address that was copied 
&lt;/span&gt;    &lt;span class="c1"&gt;# into `void** devPtr`. Think of this as a pointer deref, except that we
&lt;/span&gt;    &lt;span class="c1"&gt;# are deferencing something in user-space
&lt;/span&gt;    &lt;span class="n"&gt;device_address&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_from_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pid_to_devPtr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice, with the collection of other useful metadata, we have this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;/// uprobe triggered by a call to `cudaMalloc`&lt;/span&gt;
&lt;span class="n"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"uprobe/cudaMalloc"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;memleak_cuda_malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;pt_regs&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;memleak_event&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;dev_ptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;u32&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;PT_REGS_PARM2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;dev_ptr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;PT_REGS_PARM1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;bpf_get_current_pid_tgid&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CUDA_MALLOC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bpf_ktime_get_ns&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bpf_map_update_elem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;memleak_pid_to_event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bpf_map_update_elem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;memleak_pid_to_dev_ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;dev_ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;/// uretprobe triggered when `cudaMalloc` returns&lt;/span&gt;
&lt;span class="n"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"uretprobe/cudaMalloc"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;memleak_cuda_malloc_ret&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;pt_regs&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;cuda_malloc_ret&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;u32&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;num_failures&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;memleak_event&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;dev_ptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;***&lt;/span&gt;&lt;span class="n"&gt;map_ptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;cuda_malloc_ret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;PT_REGS_RC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;bpf_get_current_pid_tgid&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bpf_map_lookup_elem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;memleak_pid_to_event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;ret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cuda_malloc_ret&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// lookup the value of `devPtr` passed to `cudaMalloc` by this process&lt;/span&gt;
    &lt;span class="n"&gt;map_ptr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;***&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;bpf_map_lookup_elem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;memleak_pid_to_dev_ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;map_ptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;dev_ptr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;map_ptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// read the value copied into `*devPtr` by `cudaMalloc` from user-space&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bpf_probe_read_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;device_addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dev_ptr&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bpf_ktime_get_ns&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bpf_map_push_elem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;memleak_events_queue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Voila - now our userpace code can consume from &lt;code&gt;memleak_events_queue&lt;/code&gt; and update its memory maps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftz70lyccerp28ez6541a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftz70lyccerp28ez6541a.png" alt="High-level architecture diagram" width="800" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our implementations of the &lt;code&gt;cudaFree()&lt;/code&gt; uprobe and uretprobe are very similar to what we just presented for &lt;code&gt;cudaMalloc()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Naturally a benefit of having our processing pipeline in user-space is having access to the richer data structures that are exposed by user-space programming languages. These include, but are not limited to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dynamically sized data structures: eBPF maps must either have a static size or be set explicitly on initialization before attaching a program.&lt;/li&gt;
&lt;li&gt;Nested data structures: We implement our per-process memory maps as a hash-map of B-tree maps. This maps PIDs to their memory map, which is a B-Tree map maintaining an ordered range of CUDA device addresses with the associated size metadata.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our user-space processing pipeline is written in Rust, making use of &lt;code&gt;libbpf-rs&lt;/code&gt; which provides Rust bindings for &lt;a href="https://github.com/libbpf/libbpf" rel="noopener noreferrer"&gt;libbpf&lt;/a&gt;, exposing a nice API for attaching and managing the lifetime of eBPF programs, and accessing eBPF maps.&lt;/p&gt;

&lt;p&gt;The queue of events that have been generated from our uprobes is consumed when displaying to stdout or when exporting. A display is triggered at a fixed interval that is user-configurable (default is 5 seconds), and an export is triggered whenever a request is made to the metrics endpoint at &lt;code&gt;:9000/metrics&lt;/code&gt; (this port is also user-configurable).&lt;/p&gt;

&lt;p&gt;Consuming the queue is relatively straight-forward - we just pop from the eBPF queue until there is nothing left to process. We note that this queue contains events generated in both our &lt;code&gt;cudaMalloc()&lt;/code&gt; and our &lt;code&gt;cudaFree()&lt;/code&gt; uprobes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt; &lt;span class="c1"&gt;// key size must be zero for BPF_MAP_TYPE_QUEUE&lt;/span&gt;
                       &lt;span class="c1"&gt;// `lookup_and_delete` calls.&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;opt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;
    &lt;span class="py"&gt;.skel&lt;/span&gt;
    &lt;span class="py"&gt;.skel&lt;/span&gt;
    &lt;span class="py"&gt;.maps&lt;/span&gt;
    &lt;span class="py"&gt;.memleak_events_queue&lt;/span&gt;
    &lt;span class="nf"&gt;.lookup_and_delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;event_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;opt&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;None&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(());&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="nn"&gt;MemleakEvent&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;event_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;None&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;GpuprobeError&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="s"&gt;"unable to construct MemleakEvent from bytes"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="p"&gt;));&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="c1"&gt;// update CUDA state&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The state that we keep looks like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;MemleakState&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;memory_map&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BTreeMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CudaMemoryAlloc&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;num_successful_mallocs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;num_failed_mallocs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;num_successful_frees&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;num_failed_frees&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;active_pids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HashSet&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We maintain a set of the processes that we are tracking (&lt;code&gt;active_pids&lt;/code&gt;) so that we can perform aliveness checks on the CUDA programs that we have observed, and  clean up any relevant process state when they exit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observing memory leaks and kernel launches - live!
&lt;/h2&gt;

&lt;p&gt;To illustrate that our program is exhibiting correct behavior and catching &lt;code&gt;cudaMalloc()&lt;/code&gt;, &lt;code&gt;cudaFree()&lt;/code&gt; and &lt;code&gt;cudaLaunchKernel()&lt;/code&gt; events, we spin up an instance of &lt;code&gt;gpuprobe-daemon&lt;/code&gt; and launch a simple CUDA binary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// CUDA kernels&lt;/span&gt;
&lt;span class="n"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;optimized_convolution_part1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;optimized_convolution_part2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;dv_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;dv_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;dv_intermediate&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;cudaMalloc&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;dv_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;cudaMalloc&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;dv_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;cudaMalloc&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;dv_intermediate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// both of these trigger a `cudaKernelLaunch()` call&lt;/span&gt;
        &lt;span class="n"&gt;optimized_convolution_part1&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;n_blocks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_threads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shared_mem&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dv_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dv_intermediate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;optimized_convolution_part2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;n_blocks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_threads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shared_mem&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dv_intermediate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dv_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Note how we forget to free `dv_intermediate`!&lt;/span&gt;
    &lt;span class="n"&gt;cudaFree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dv_input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;cudaFree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dv_output&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So what does our program output?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2024-21-12 16:32:46

num_successful_mallocs:  3
num_failed_mallocs:      0
num_successful_frees:    0
num_failed_frees:        0
per-process memory maps:
process 365159
        0x0000793a44000000: 8000000 Bytes
        0x0000793a48c00000: 8000000 Bytes
        0x0000793a49400000: 8000000 Bytes

total kernel launches: 1470
pid: 365159
        0x5de98f9fba50 (_Z27optimized_convolution_part1PdS_i) -&amp;gt; 735
        0x5de98f9fbb30 (_Z27optimized_convolution_part2PdS_i) -&amp;gt; 735

==============================

2024-21-12 16:32:51

num_successful_mallocs:  3
num_failed_mallocs:      0
num_successful_frees:    2
num_failed_frees:        0
per-process memory maps:
process 365159
        0x0000793a44000000: 8000000 Bytes
        0x0000793a48c00000: 0 Bytes
        0x0000793a49400000: 0 Bytes

total kernel launches: 2000
pid: 365159
        0x5de98f9fba50 (_Z27optimized_convolution_part1PdS_i) -&amp;gt; 1000
        0x5de98f9fbb30 (_Z27optimized_convolution_part2PdS_i) -&amp;gt; 1000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aha! It looks to be working. So what exactly are we seeing here?&lt;/p&gt;

&lt;p&gt;In the first interval, we see the program in flight - all of the required memory is allocated on the GPU, and we see that we have launched each of our two CUDA kernels 735 times. &lt;/p&gt;

&lt;p&gt;Firstly, we see the function address of our kernel as it is found in the program binary, as well as its name &lt;em&gt;(which we achieved by resolving its symbol from the associated binary)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Secondly, we see how many allocations have been made, their virtual addresses on the GPU, and how large the allocations were.&lt;/p&gt;

&lt;p&gt;In the second interval, we see that both kernels have been launched 1000 times, i.e. the number of iterations that we set. We also see that two of our chunks of memory have been freed! But what about the third allocation that seems to still be holding onto GPU memory? Well if you look at the code carefully, you can see that we forgot to call &lt;code&gt;cudaFree(dv_intermediate)&lt;/code&gt;!&lt;/p&gt;

&lt;p&gt;Now although this is a toy example, I think it illustrates the point relatively well.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We have caught a memory leak during program execution.&lt;/li&gt;
&lt;li&gt;We see exactly how many times each cuda kernel was launched&lt;/li&gt;
&lt;li&gt;We didn't have to make &lt;em&gt;any&lt;/em&gt; code modifications, or even attach GPUprobe to a specific process&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The example program that I introduced was a boiled-down version of a class project of mine from my undergrad, which is what I first used to test GPUprobe. Funnily enough, I didn't expect it to have a memory leak at all - I caught that later on using GPUprobe. If I'd had the tool back then, I certainly wouldn't have submitted a class assignment that leaked memory.&lt;/p&gt;

&lt;h1&gt;
  
  
  Bugs
&lt;/h1&gt;

&lt;p&gt;I will share a bug that I have encountered that I am still trying &lt;em&gt;(albeit not necessarily actively)&lt;/em&gt; to solve.&lt;/p&gt;

&lt;p&gt;We perform symbol resolution on CUDA kernels so that the name of the launched kernel is displayed to stdout. It almost seems magic - but quite a bit of hacking around was needed to get that working. It's a super useful feature in my opinion - it can be very difficult to relate a virtual function address to the abstraction of a CUDA kernel.&lt;/p&gt;

&lt;p&gt;A CUDA launch kernel event generated by our uprobe contains the PID of the  process launching the kernel. This is useful! Knowing the PID, we can go and peak around in &lt;code&gt;/proc/[pid]&lt;/code&gt; to get the path of the running binary, as well as its virtual base address &lt;em&gt;(which varies between executions due to ASLR)&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;From here, we can do some good old symbolic resolution to resolve a function name from the virtual address of a CUDA kernel at runtime. We can do this because CUDA kernels are written like functions, and thus live in the &lt;code&gt;.text&lt;/code&gt; section of a binary. So voila - now we can show the user a more human-parsable output by telling them the names of the kernels that are being passed into &lt;code&gt;cudaLaunchKernel(const void* func, ...)&lt;/code&gt;. At a high-level, resolving a symbol looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Map from binary offset -&amp;gt; symbol name
&lt;/span&gt;&lt;span class="n"&gt;symbols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;  &lt;span class="c1"&gt;# e.g. {0x1000: "my_cuda_kernel", ...}
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;resolve_symbol&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;virtual_addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Adjust for ASLR by subtracting the binary's base address
&lt;/span&gt;    &lt;span class="n"&gt;offset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;virtual_addr&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;get_virtual_base_offset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Get the symbol name for this offset
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works great in our case... &lt;em&gt;most of the time.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Recall that events are consumed when a display or export event is triggered every few seconds. We cache a symbol table so that we don't have to go and read through &lt;code&gt;/proc&lt;/code&gt; every time an event is recorded - only when an event for a new PID is.&lt;/p&gt;

&lt;p&gt;A problem arises when an event is recorded for a short-lived execution that starts and ends between two intervals. &lt;/p&gt;

&lt;p&gt;While processing the events queue, we will check to see if there is a cached symbols table for &lt;code&gt;pid&lt;/code&gt; - if it is the first time a PID has been recorded, then naturally it won't have an entry. So we go and look in &lt;code&gt;/proc/pid&lt;/code&gt; and... it doesn't exist. The process as already exited - so we don't know the virtual base offset of the binary, nor do we know the location of the binary that executed. We can't perform symbolic resolution here.&lt;/p&gt;

&lt;p&gt;So that's a bit of a bummer. We can still record and display our frequency  histogram, but since we weren't able to resolve the symbol of the kernels, we just display &lt;code&gt;unknown kernel&lt;/code&gt;, as you see in this example.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;total kernel launches: 25
pid: 365306
        0x5823e39efa50 (unknown kernel) -&amp;gt; 10
        0x5823e39efb30 (unknown kernel) -&amp;gt; 15
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are, as always, other ways of implementing our symbolic resolution. A potential alternative would be to monitor process exits system-wide, and cache symbol tables for exiting processes in case we need them. I think this falls out of scope, and we will very likely end up collecting heaps of redundant data, because most processes in a Linux system won't use the CUDA runtime API. I think the trade-off made here is reasonable, because most CUDA jobs will run for longer than a single display/export interval - or at least any CUDA jobs that we'd want to monitor would.&lt;/p&gt;

&lt;h1&gt;
  
  
  Performance Benchmark
&lt;/h1&gt;

&lt;p&gt;Lastly, before concluding, I would like to discuss some benchmarks. Firstly, it should be noted that in general, using uprobes is pretty expensive because it causes a context switch &lt;em&gt;(the kernel snoops on the function calls made by user-space)&lt;/em&gt;. The reason why we expect minimal overhead when using uprobes for monitoring the CUDA runtime is because&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The CUDA runtime API calls the CUDA drivers, which live in kernel space. This also leads to a context switch!&lt;/li&gt;
&lt;li&gt;Calls to CUDA runtime API functions will lead to communication with the GPU over PCIe &lt;em&gt;(or some other interconnect)&lt;/em&gt;, and most of these functions are blocking.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I.e. the CUDA runtime API functions are expensive anyways, and they are normally used to tell the GPU what to do. It is optimized for throughput by batching, not low-latency single-threaded performance. The idea is that the additional latency introduced by our uprobes is negligeable compared to that of the CUDA runtime API calls, thus leading to very little relative overhead.&lt;/p&gt;

&lt;p&gt;I am going to keep the benchmarks very simple for this article, saving the more rigorous benchmarking for later. I perform the benchmarks on my laptop with a NVIDIA Quadro P520 with 2048MiB of VRAM.&lt;/p&gt;

&lt;p&gt;On my system I ran 5000 iterations of &lt;code&gt;cudaMalloc()&lt;/code&gt;/&lt;code&gt;cudaFree()&lt;/code&gt; pairs, and measured the average latency of each iteration. I kept the allocation size very small at 100 bytes so that the results would better reflect the overhead of the &lt;code&gt;cudaMalloc()&lt;/code&gt; call itself rather than the overhead related to allocating a large chunk of contiguous memory on the GPU. The first 500 iterations are discarded to account for warm-up effects.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;without GPUprobe&lt;/th&gt;
&lt;th&gt;with GPUprobe&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;avg. latency&lt;/td&gt;
&lt;td&gt;255μs&lt;/td&gt;
&lt;td&gt;265μs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What we observe is a 3.92% increase in latency when instrumenting all &lt;br&gt;
&lt;code&gt;cudaMalloc()&lt;/code&gt;/&lt;code&gt;cudaFree()&lt;/code&gt; calls.&lt;/p&gt;

&lt;p&gt;As for the overhead of monitoring &lt;code&gt;cudaLaunchKernel()&lt;/code&gt; calls, I decided to benchmark the program that I presented during the case study, that performs 1000 iterations launching two CUDA kernels in each. In this case, I found no measurable performance impact when running with GPUprobe versus running without it.&lt;/p&gt;

&lt;p&gt;What we learn from these benchmarks is the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitoring many rapid calls to &lt;code&gt;cudaMalloc()/cudaFree()&lt;/code&gt; incurs a ~4% overhead. However this case isn't quite realistic - in general, we allocate a chunk of memory on-device and &lt;em&gt;then&lt;/em&gt; perform many operations on it.&lt;/li&gt;
&lt;li&gt;Monitoring &lt;code&gt;cudaLaunchKernel()&lt;/code&gt; calls incurs a negligible runtime overhead in this simple case. This makes sense - our uprobe for monitoring kernel launches is a lot simpler &lt;em&gt;(no intermediate state required)&lt;/em&gt; than our uprobes for recording memory alloation events.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I am thus confident at this stage in saying that GPUprobe introduces very little overhead to running CUDA applications.&lt;/p&gt;

&lt;p&gt;Future benchmarking work will include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Testing with larger, real-world ML workloads&lt;/li&gt;
&lt;li&gt;Measuring impact on memory-intensive applications&lt;/li&gt;
&lt;li&gt;Benchmarking across different GPU architectures and CUDA versions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But for now, these results suggest that GPUprobe's overhead is minimal enough for practical use in local or production environments.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;In this write-up, I've introduced &lt;strong&gt;GPUprobe&lt;/strong&gt; - a zero-instrumentation tool for monitoring GPU behavior through eBPF-based inspection of CUDA runtime API calls. We explored how GPUprobe fills a specific niche in the GPU monitoring landscape, combining low overhead with detailed application-level insights that tools like NSight and DCGM don't provide.&lt;/p&gt;

&lt;p&gt;Through a deep dive into the memory leak detector's implementation, we saw how eBPF uprobes can be used to track GPU memory allocations without modifying application code. We tackled interesting technical challenges like symbol resolution for CUDA kernels at runtime, and demonstrated through benchmarking that the overhead is minimal even for allocation-heavy workloads.&lt;/p&gt;

&lt;p&gt;My next steps will focus on testing GPUprobe with real-world ML workloads to better understand where it fits in the observability landscape. I'm particularly interested in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exploring use cases in ML infrastructure monitoring&lt;/li&gt;
&lt;li&gt;Adding support for more CUDA runtime API functions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're interested in GPU observability or eBPF, check out the repository! And consider leaving a star, it helps spread the word about the project :)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/GPUprobe/gpuprobe-daemon" rel="noopener noreferrer"&gt;link to the repo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>linux</category>
      <category>performance</category>
      <category>cuda</category>
    </item>
  </channel>
</rss>
