<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Vu Nguyen</title>
    <description>The latest articles on Forem by Vu Nguyen (@vuvietnguyenit).</description>
    <link>https://forem.com/vuvietnguyenit</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3855344%2F35efee60-e709-4a5f-9d7f-fc83e60f3fb9.png</url>
      <title>Forem: Vu Nguyen</title>
      <link>https://forem.com/vuvietnguyenit</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/vuvietnguyenit"/>
    <language>en</language>
    <item>
      <title>I Couldn’t Debug My AI/ML GPU Incident - So I Built gpuxray</title>
      <dc:creator>Vu Nguyen</dc:creator>
      <pubDate>Thu, 02 Apr 2026 04:03:44 +0000</pubDate>
      <link>https://forem.com/vuvietnguyenit/i-couldnt-debug-my-aiml-gpu-incident-so-i-built-my-own-tool-33n9</link>
      <guid>https://forem.com/vuvietnguyenit/i-couldnt-debug-my-aiml-gpu-incident-so-i-built-my-own-tool-33n9</guid>
      <description>&lt;p&gt;Several weeks ago, I encountered some problems with ML jobs running on my GPU server. I received alerts triggered at midnight, and one of the jobs failed due to GPU memory usage.&lt;/p&gt;

&lt;p&gt;The next morning, I performed a root-cause analysis to understand what had happened the night before. However, I couldn’t identify the issue because I only had access to overall GPU usage metrics at current time. I used &lt;code&gt;nvidia-smi&lt;/code&gt; and &lt;code&gt;nvtop&lt;/code&gt; to inspect the current state, but there was no trace about the issue we got from last night. Therefore, I realized I needed a solution to prevent similar problems from happening in the future.&lt;/p&gt;

&lt;p&gt;I tried using &lt;a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer"&gt;DCGM exporter&lt;/a&gt; to expose GPU metrics, but &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html#limitations" rel="noopener noreferrer"&gt;it couldn’t provide PID-level metrics&lt;/a&gt;. I also tested it in a Kubernetes environment to get pod-level metrics, but it didn’t work because our GPUs only support &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html#time-slicing-gpus-in-kubernetes" rel="noopener noreferrer"&gt;time-slicing&lt;/a&gt; mode. &lt;/p&gt;

&lt;p&gt;Therefore, I developed an open-source tool called &lt;a href="https://github.com/vuvietnguyenit/gpuxray" rel="noopener noreferrer"&gt;gpuxray&lt;/a&gt; to monitor GPUs at the process level. gpuxray has helped our team significantly when observing and investigating bottlenecks in AI/ML processes running on Linux servers. It exposes metrics in Prometheus format, which we use to build Grafana dashboards for visualizing resource usage at the process level.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28v6bg22y6qk2oooa6ym.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28v6bg22y6qk2oooa6ym.png" alt="dashboard" width="800" height="439"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We &lt;a href="https://github.com/vuvietnguyenit/gpuxray/blob/main/docs/kubernetes.md" rel="noopener noreferrer"&gt;deployed gpuxray in a Kubernetes cluster&lt;/a&gt; as a &lt;code&gt;DaemonSet&lt;/code&gt; on all GPU nodes that need to be monitored.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; kube-operators get daemonset/gpuxray 
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
gpuxray   2         2         2       2            2           node.k8s.cluster/gpu&lt;span class="o"&gt;=&lt;/span&gt;exists   20d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the setup &lt;a href="https://github.com/vuvietnguyenit/gpuxray?tab=readme-ov-file#install" rel="noopener noreferrer"&gt;described here&lt;/a&gt;, we can easily enable per-process GPU observability.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;gpuxray achieves high performance while consuming minimal resources. Because it is built using &lt;strong&gt;eBPF&lt;/strong&gt; to trace GPU memory-related events. This is powerful because eBPF allows us to observe what is happening inside the kernel based on specific use cases - in this case, we create probes these are attached to CUDA API.&lt;br&gt;
The project is built on a solid codebase, making it easy to extend in the future. If you have ideas, feel free to discuss or open a pull request.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Design and Architecture
&lt;/h3&gt;

&lt;p&gt;Now, I will describe the architecture of gpuxray to help you understand how it works.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foek6hp4eu8n1ooznu77q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foek6hp4eu8n1ooznu77q.png" alt="Architecture" width="800" height="521"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Basically, the &lt;code&gt;userspace-code&lt;/code&gt; handles the main logic and is written in Go. The &lt;code&gt;eBPF-program&lt;/code&gt; is attached to &lt;code&gt;CUDA API&lt;/code&gt; calls. When these APIs are invoked, events are captured. The &lt;code&gt;eBPF-program&lt;/code&gt; performs lightweight processing at the kernel level, updates eBPF maps, and sends events to the &lt;code&gt;ring-buffer&lt;/code&gt;.The &lt;code&gt;userspace-code&lt;/code&gt; then consumes events from the &lt;code&gt;ring-buffer&lt;/code&gt;, processes them, and produces the final metrics output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance and Resources Usage
&lt;/h3&gt;

&lt;p&gt;With the &lt;code&gt;mon&lt;/code&gt; option, gpuxray even no taken resources on the GPU server.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fioo6482mhuvg3nvt4fmk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fioo6482mhuvg3nvt4fmk.png" alt="perf and resource usage by mon" width="732" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When tracing memory leaks using the &lt;code&gt;memtrace&lt;/code&gt; option for a specific PID, I used a Python script to generate more than 2,000 malloc/free calls per second on the GPU and observed resource usage, gpuxray consumed only about ~8% of a single CPU core (on a server with 32 CPU cores and 125GB RAM).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwj4spfr2g01ztcsx107t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwj4spfr2g01ztcsx107t.png" alt="memtrace perf and resource usage" width="800" height="248"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is impressive because ~2,000 malloc/free operations per second is not a typical real-world workload. As a result, we don’t need to worry about performance or resource overhead when using gpuxray.&lt;/p&gt;

&lt;p&gt;Feel free to explore the project, try it out, and contribute your ideas:&lt;br&gt;
&lt;a href="https://github.com/vuvietnguyenit/gpuxray" rel="noopener noreferrer"&gt;https://github.com/vuvietnguyenit/gpuxray&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>gpu</category>
      <category>opensource</category>
      <category>linux</category>
    </item>
  </channel>
</rss>
