<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ingero Team</title>
    <description>The latest articles on Forem by Ingero Team (@ingero).</description>
    <link>https://forem.com/ingero</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3853036%2F403f610f-f2f0-4fed-af9b-7362de7c9ee4.png</url>
      <title>Forem: Ingero Team</title>
      <link>https://forem.com/ingero</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ingero"/>
    <language>en</language>
    <item>
      <title>Agent + MCP + eBPF: 10,869 CUDA Kernel Events, Now Queryable</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Tue, 21 Apr 2026 13:30:00 +0000</pubDate>
      <link>https://forem.com/ingero/agent-mcp-ebpf-10869-cuda-kernel-events-now-queryable-35p4</link>
      <guid>https://forem.com/ingero/agent-mcp-ebpf-10869-cuda-kernel-events-now-queryable-35p4</guid>
      <description>&lt;p&gt;A vLLM inference server handles hundreds of requests per second. Then one request with &lt;code&gt;n_completions=8&lt;/code&gt; and &lt;code&gt;logprobs=20&lt;/code&gt; arrives, and every other request blocks for 9-11 seconds. GPU utilization monitors stay green. Kubernetes reports healthy pods. Latency dashboards show a spike but no why. An eBPF trace of every CUDA call is the only view that catches this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filaogi59lwdl6er0fbc1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Filaogi59lwdl6er0fbc1.png" alt="Agent + MCP + eBPF cover" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://ingero.io/ebpf-trace-cuda-mcp-queryable/" rel="noopener noreferrer"&gt;ingero.io&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is a real issue (&lt;a href="https://github.com/vllm-project/vllm/issues/37343" rel="noopener noreferrer"&gt;vLLM #37343&lt;/a&gt;). We reproduced it on an RTX 4090 running vLLM 0.18.0 with Qwen3.5 with 27B params. Ingero's eBPF trace captured everything: 10,869 events, 550 cudaLaunchKernel calls, 7,757 context switches, 6 causal chains. All stored in a 1.2 MB SQLite database.&lt;/p&gt;

&lt;p&gt;Then we pointed Claude at it via MCP.&lt;/p&gt;

&lt;h2&gt;
  
  
  The session
&lt;/h2&gt;

&lt;p&gt;The entire investigation below is real. These are actual MCP tool calls, actual data, actual conclusions. Nothing was edited except formatting for readability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's first move&lt;/strong&gt; - it called &lt;code&gt;get_trace_stats&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10,869 events traced:
  7,757 sched_switch (context switches)
    550 cudaLaunchKernel   p50: 20us  p99: 5,079us
  1,672 cuLaunchKernel     p50: 15us  p99: 892us
    734 cudaMemcpyAsync
     10 cudaDeviceSync     p50: 19us  p99: 4,720us (239x median)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two numbers jump out: cudaLaunchKernel p99 is 254x the p50, and there are 7,757 context switches. That ratio is not normal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's second move&lt;/strong&gt; - it called &lt;code&gt;get_causal_chains&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;The most severe chain:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;cudaLaunchKernel p99 = 13.1ms (784x the p50 of 16.7us), correlated with 428 context switches totaling 8.9 seconds off-CPU.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the signal. The kernel launch itself takes microseconds. But the CPU thread that feeds the GPU is getting preempted 428 times, accumulating 8.9 seconds of idle time. The GPU sits waiting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's third move&lt;/strong&gt; - without being asked, it called &lt;code&gt;get_stacks&lt;/code&gt; to look at the actual code paths:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Kernel&lt;/th&gt;
&lt;th&gt;Calls&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;Max Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;fused_add_rms_norm&lt;/td&gt;
&lt;td&gt;118&lt;/td&gt;
&lt;td&gt;54 us&lt;/td&gt;
&lt;td&gt;3.0 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;reshape_and_cache_flash&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;42 us&lt;/td&gt;
&lt;td&gt;161 us&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;silu_and_mul&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;td&gt;44 us&lt;/td&gt;
&lt;td&gt;156 us&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;rotary_embedding&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;38 us&lt;/td&gt;
&lt;td&gt;222 us&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All running under &lt;code&gt;VLLM::EngineCore&lt;/code&gt;, the single-threaded engine coroutine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's diagnosis&lt;/strong&gt; (47 seconds from first tool call):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The vLLM engine coroutine is running on a CPU shared with other processes. When those processes get scheduled, the coroutine is preempted, the GPU gets no new work, and throughput collapses. The CPU thread driving the GPU was context-switched out 5,347 times, causing 18.4 seconds of CPU unavailability. The root cause is CPU contention, not GPU contention."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Claude's fix recommendation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pin the engine thread to dedicated CPU cores (&lt;code&gt;taskset&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Reduce priority of background processes (&lt;code&gt;nice -n 19&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Do NOT use &lt;code&gt;CUDA_LAUNCH_BLOCKING&lt;/code&gt; (it will make things worse)&lt;/li&gt;
&lt;li&gt;Do NOT trust &lt;code&gt;nvidia-smi&lt;/code&gt; (it shows utilization, not causation)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What happened here
&lt;/h2&gt;

&lt;p&gt;No SSH. No log files. No dashboard hopping. No "let me check nvidia-smi on each node."&lt;/p&gt;

&lt;p&gt;An AI agent made 4 MCP tool calls against a 1.2 MB SQLite database containing kernel-level eBPF traces. It identified the root cause (CPU scheduling contention), the specific code path (EngineCore coroutine), and the fix (CPU pinning) - all in under a minute.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;code&gt;nvidia-smi&lt;/code&gt; would have shown 100% GPU utilization during this entire incident. The GPU was "utilized" - it was executing the work it was given. The problem was that it wasn't being given work fast enough because the CPU thread feeding it was being preempted. That distinction - between "GPU is busy" and "GPU is being fed work efficiently" - is invisible to every standard GPU monitoring tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What made this possible
&lt;/h2&gt;

&lt;p&gt;This is not a wrapper around &lt;code&gt;nvidia-smi&lt;/code&gt;. The eBPF trace attaches uprobes directly to &lt;code&gt;libcudart.so&lt;/code&gt; (CUDA Runtime) and &lt;code&gt;libcuda.so&lt;/code&gt; (CUDA Driver), plus tracepoints on the Linux kernel scheduler (&lt;code&gt;sched_switch&lt;/code&gt;, &lt;code&gt;sched_wakeup&lt;/code&gt;), memory allocator (&lt;code&gt;mm_page_alloc&lt;/code&gt;), and I/O subsystem. Every CUDA API call is captured with nanosecond precision. Every context switch that preempted a GPU-feeding thread is recorded. The causal chain engine connects them automatically.&lt;/p&gt;

&lt;p&gt;The MCP server exposes this data through 10 tools. The AI agent decides what to query. There is no pre-aggregation layer, no dashboard, no human selecting which metrics to look at. The agent gets the raw events and builds the diagnosis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try the eBPF trace yourself
&lt;/h2&gt;

&lt;p&gt;The trace database is in the Ingero repo. The investigation works with any MCP-compatible AI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone and build&lt;/span&gt;
git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build

&lt;span class="c"&gt;# 2. With Claude Code&lt;/span&gt;
claude &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; &amp;lt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'{"mcpServers":{"ingero":{"command":"./bin/ingero","args":["mcp","--db","investigations/vllm-37343-logprobs-amplification.db"]}}}'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# 3. With Ollama (any open model)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;mcp-client-for-ollama
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Type &lt;code&gt;/investigate&lt;/code&gt; to start the guided workflow. The AI will walk through the same investigation you just read.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern repeats
&lt;/h2&gt;

&lt;p&gt;This is not a one-off. We have traced dozens of GPU performance issues. The pattern is consistent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ingero.io/124x-slower-pytorch-dataloader-kernel-level/" rel="noopener noreferrer"&gt;124x slower PyTorch DataLoader&lt;/a&gt;&lt;/strong&gt; - kernel tracing revealed 191,000 context switches and 299,000 page allocations in 40 seconds. The GPU was starved because DataLoader workers were fighting for CPU cores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ingero.io/tracing-13x-pytorch-slowdown-hidden-numpy-synchronization/" rel="noopener noreferrer"&gt;13x PyTorch slowdown from hidden NumPy sync&lt;/a&gt;&lt;/strong&gt; - a &lt;code&gt;tensor.cpu().numpy()&lt;/code&gt; call in a masking function triggered B x 2 implicit &lt;code&gt;cudaStreamSynchronize&lt;/code&gt; calls per forward pass. On faster GPUs, the bottleneck got worse, not better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;GPU 97% utilized but training 3x slower&lt;/a&gt;&lt;/strong&gt; - &lt;code&gt;nvidia-smi&lt;/code&gt; reported healthy utilization while Prometheus node exporter and Fluent Bit were consuming 51.7% of available CPU time through 14,504 context switches.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of these follows the same pattern: the GPU is fast, the host is the bottleneck, and standard GPU metrics cannot see it. The causal chain from host event to CUDA API call is the missing link.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for GPU debugging
&lt;/h2&gt;

&lt;p&gt;The traditional approach: alert fires, SSH into the machine, check &lt;code&gt;nvidia-smi&lt;/code&gt;, check &lt;code&gt;dmesg&lt;/code&gt;, check logs, open profiler, wait for reproduction, analyze flame graphs, correlate across tools. Hours.&lt;/p&gt;

&lt;p&gt;The MCP-native approach: point an AI agent at the kernel traces, let it query what it needs, read the diagnosis. Minutes.&lt;/p&gt;

&lt;p&gt;We are not saying the AI is smarter than a senior SRE. We are saying it has access to data the SRE cannot see (kernel scheduling decisions, per-CUDA-call latency distributions, automated causal chains) and it can query that data faster than a human can navigate dashboards.&lt;/p&gt;

&lt;p&gt;The investigation databases are open source. The agent is open source. Try it locally.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, &amp;lt;2% overhead. Apache 2.0 + GPL-2.0. *&lt;/em&gt;&lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;** star - &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/strong&gt; if you are seeing vLLM or CUDA runtime issues. Investigation DB: &lt;a href="https://github.com/ingero-io/ingero/tree/main/investigations" rel="noopener noreferrer"&gt;investigations/vllm-cuda-kernel-events.db&lt;/a&gt; - Original issue: &lt;a href="https://github.com/vllm-project/vllm/issues/37343" rel="noopener noreferrer"&gt;vllm-project/vllm#37343&lt;/a&gt;.*&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>ebpf</category>
      <category>mcp</category>
      <category>observability</category>
    </item>
    <item>
      <title>11-Second Time to First Token on a Healthy vLLM Server</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Tue, 21 Apr 2026 13:30:00 +0000</pubDate>
      <link>https://forem.com/ingero/11-second-time-to-first-token-on-a-healthy-vllm-server-e0c</link>
      <guid>https://forem.com/ingero/11-second-time-to-first-token-on-a-healthy-vllm-server-e0c</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;A vLLM health endpoint says "ok." nvidia-smi says 95% utilization. But a user just waited 11 seconds for their first token. We reproduced a real vLLM issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root causes: head-of-line blocking during prefix caching. This is invisible to standard monitoring. The trace databases are available in the &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/vllm-37308-hol-blocking.db" rel="noopener noreferrer"&gt;Ingero repo&lt;/a&gt; for independent investigation. We traced a production case of vLLM latency spikes down to kernel-level scheduling contention.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Problem Nobody Can See
&lt;/h2&gt;

&lt;p&gt;vLLM's continuous batching is one of the best things to happen to LLM serving. It lets the engine process multiple requests simultaneously, filling GPU capacity that would otherwise sit idle between sequential requests.&lt;/p&gt;

&lt;p&gt;But continuous batching has a dark side: when requests compete for GPU resources inside the same batch, one expensive request can silently starve all others. No error. No health check failure. No metric spike. Just users waiting 10x-250x longer than expected for their first token.&lt;/p&gt;

&lt;p&gt;We investigated a real vLLM issue reported in the last week (&lt;a href="https://github.com/vllm-project/vllm/issues/37308" rel="noopener noreferrer"&gt;#37308&lt;/a&gt;) to understand what happens at the kernel level during these silent latency spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;The investigation used the same server configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; vllm.entrypoints.openai.api_server &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--model&lt;/span&gt; Qwen/Qwen2.5-0.5B-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--port&lt;/span&gt; 8000 &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.95 &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--enable-prefix-caching&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hardware: RTX 4090 (24GB), 4 vCPUs, Ubuntu 22.04, vLLM 0.17.1.&lt;/p&gt;

&lt;p&gt;We ran &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;Ingero&lt;/a&gt; alongside each test to trace CUDA Runtime/Driver API calls and host kernel events (scheduler context switches, memory allocations) simultaneously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prefix Caching Head-of-Line Blocking
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Issue&lt;/strong&gt;: &lt;a href="https://github.com/vllm-project/vllm/issues/37308" rel="noopener noreferrer"&gt;vllm-project/vllm#37308&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens
&lt;/h3&gt;

&lt;p&gt;6 concurrent requests arrive within 40ms. 4 are heavy (2048-token prompts, 128-512 output tokens) and 2 are light (128-token prompts, 32-64 output tokens). All share a 32-token prefix so the prefix cache groups them together.&lt;/p&gt;

&lt;p&gt;The light requests should complete in under 100ms. Instead:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;r08 (128 tok)&lt;/th&gt;
&lt;th&gt;r05 (128 tok)&lt;/th&gt;
&lt;th&gt;r07 (2048 tok)&lt;/th&gt;
&lt;th&gt;r02 (2048 tok)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,131ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,406ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,654ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,851ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;54ms&lt;/td&gt;
&lt;td&gt;129ms&lt;/td&gt;
&lt;td&gt;258ms&lt;/td&gt;
&lt;td&gt;234ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;66ms&lt;/td&gt;
&lt;td&gt;177ms&lt;/td&gt;
&lt;td&gt;175ms&lt;/td&gt;
&lt;td&gt;156ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Run 1 is catastrophic: the light requests are 14x over threshold. Subsequent runs settle to 2-4x because the prefix cache warms up. But that first cold-cache batch is brutal.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the tracer shows
&lt;/h3&gt;

&lt;p&gt;3 causal chains detected. The most revealing one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[MEDIUM] cudaLaunchKernel p99=444us (6.4x p50) - 371 sched_switch events
 Timeline:
 [HOST ] 371 context switches (5.9s off-CPU)
 [CUDA ] p99=444us (6.4x p50=70us)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The per-process breakdown tells the full story:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VLLM::EngineCore&lt;/strong&gt; (the GPU scheduling loop):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;24,347 context switches, max stall &lt;strong&gt;2.5 seconds&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;40,632 cuLaunchKernel calls, avg 29us but max &lt;strong&gt;34ms&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;34,087 cudaLaunchKernel calls, avg 96us but max &lt;strong&gt;356ms&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engine core process – the single-threaded loop that decides which requests get GPU time – was descheduled for 2.5 seconds in the worst case. During that stall, the GPU kernel queue drained and the light requests had nothing submitted on their behalf.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;356ms cudaLaunchKernel spike&lt;/strong&gt; (3,700x the average) is the smoking gun. That's not the GPU being slow. That's the CPU failing to submit work to the GPU because the scheduling loop was preempted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why nvidia-smi misses this
&lt;/h3&gt;

&lt;p&gt;nvidia-smi shows high utilization because the GPU IS working – on the heavy requests' prefills. The light requests are starving, but from the GPU's perspective there's always a kernel to run. The starvation is in the CPU-side scheduling loop, not on the GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Standard Tools Show vs What Kernel Tracing Shows
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;nvidia-smi&lt;/th&gt;
&lt;th&gt;vLLM /health&lt;/th&gt;
&lt;th&gt;vLLM metrics&lt;/th&gt;
&lt;th&gt;Kernel tracing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU utilization&lt;/td&gt;
&lt;td&gt;95%+&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;95%+ (but wrong work)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Server health&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;"ok"&lt;/td&gt;
&lt;td&gt;requests_running=5&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTFT regression&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;Visible in histograms&lt;/td&gt;
&lt;td&gt;Visible + root cause&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine stall (2.5s)&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;24,347 sched_switch events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel launch drop (80%)&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;1,051 -&amp;gt; 208 ops/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory pressure&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;43,606 mm_page_alloc&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which process is blocked&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;Not visible&lt;/td&gt;
&lt;td&gt;VLLM::EngineCore PID 2438&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: &lt;strong&gt;GPU utilization was high because the GPU was doing work. It was just doing the wrong work&lt;/strong&gt; – processing heavy prefills or computation while light requests starved. No GPU-side metric can distinguish "GPU is busy computing my request" from "GPU is busy computing someone else's request while mine waits."&lt;/p&gt;

&lt;h2&gt;
  
  
  Implications for Production vLLM
&lt;/h2&gt;

&lt;p&gt;If you're running vLLM in production with mixed workloads (different prompt sizes, some requests with or ), you're likely experiencing these silent regressions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Monitor TTFT per-request, not just aggregate throughput.&lt;/strong&gt; Aggregate metrics hide the tail – your p99 might be 100x worse than p50 during batch contention.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Be careful with .&lt;/strong&gt; A single request with n=8 and =20 can block your entire server for 11+ seconds on a cold cache. Consider routing these to dedicated instances.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;First-request-after-idle is the worst case.&lt;/strong&gt; This issue showed the most extreme regression on Run 1 (cold prefix cache). If your traffic is bursty, the first batch after a quiet period will hit hardest.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GPU utilization is not a proxy for request health.&lt;/strong&gt; Your dashboards might show 95% utilization while individual users experience 256x TTFT regression.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Investigate It Yourself
&lt;/h2&gt;

&lt;p&gt;The trace database from this investigations are in the Ingero repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build

&lt;span class="c"&gt;# View the causal chains&lt;/span&gt;
./bin/ingero explain &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/vllm---amplification.db &lt;span class="nt"&gt;--since&lt;/span&gt; 5m

&lt;span class="c"&gt;# Per-process breakdown&lt;/span&gt;
./bin/ingero explain &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/vllm---amplification.db &lt;span class="nt"&gt;--per-process&lt;/span&gt; &lt;span class="nt"&gt;--since&lt;/span&gt; 5m

&lt;span class="c"&gt;# Connect your AI assistant for interactive investigation&lt;/span&gt;
./bin/ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/vllm---amplification.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Investigate with AI (recommended)
&lt;/h2&gt;

&lt;p&gt;You can point any MCP-compatible AI client at the trace database and ask questions directly. No code required.&lt;/p&gt;

&lt;p&gt;First, create the MCP config file at &lt;code&gt;/tmp/ingero-mcp-vllm.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ingero"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./bin/ingero"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"investigations/vllm-37308-hol-blocking.db"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Ollama (local &amp;amp; free: no data sent outside):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install ollmcp (MCP client for Ollama)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;ollmcp

&lt;span class="c"&gt;# Investigate with a local model (no data leaves your machine)&lt;/span&gt;
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp-vllm.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Claude Code (with data sent to remote models / Anthropic):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; /tmp/ingero-mcp-vllm.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then type &lt;code&gt;/investigate&lt;/code&gt; and let the model explore. Follow up with questions like "what was the root cause?" or "which kernel calls had the highest latency spikes?"&lt;/p&gt;

&lt;p&gt;Ask your AI assistant: "What caused the 80% throughput drop?" or "Which process had the most context switches?" The trace data has the full story.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/vllm-37308-hol-blocking.db" rel="noopener noreferrer"&gt;investigation database&lt;/a&gt; from this post is available for download.&lt;/em&gt; &lt;em&gt;Investigations performed on TensorDock RTX 4090 (24GB), Ubuntu 22.04, vLLM 0.17.1, Qwen/Qwen2.5-0.5B-Instruct with prefix caching enabled.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub (give us a star!):&lt;/strong&gt; &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;. No NVIDIA SDK, no code changes, production-safe by design.&lt;/p&gt;

&lt;p&gt;If you are seeing vLLM issues in your own workloads, we'd love to take a look. &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Drop an issue on GitHub&lt;/a&gt;&lt;/strong&gt; and we will gladly dive into it together.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ingero.io/debugging-vllm-latency-minimax-ollama-mcp/" rel="noopener noreferrer"&gt;debugging vLLM latency with eBPF and MCP&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;GPU showing 97% utilization while training runs 3x slower&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/gpu-problem-1-why-your-pytorch-training-runs-out-of-gpu-memory-and-how-to-actually-debug-it/" rel="noopener noreferrer"&gt;debugging PyTorch GPU out-of-memory errors&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>vllm</category>
      <category>observability</category>
      <category>ebpf</category>
      <category>mcp</category>
    </item>
    <item>
      <title>What Happens When an AI Agent Gets Kernel-Level GPU Traces</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Thu, 16 Apr 2026 16:48:26 +0000</pubDate>
      <link>https://forem.com/ingero/what-happens-when-an-ai-agent-gets-kernel-level-gpu-traces-a2d</link>
      <guid>https://forem.com/ingero/what-happens-when-an-ai-agent-gets-kernel-level-gpu-traces-a2d</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;A GPU trace of a PyTorch DataLoader bottleneck (114x slower than direct indexing) was loaded into an MCP server and handed to Claude for investigation. The AI identified the root cause in under 30 seconds: 3,676 CPU context switches starving the GPU of data. Below is the full investigation session with the trace database available for independent reproduction. We walked through a real case of Claude MCP GPU debugging, from raw eBPF traces to root cause identification.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbk9hqtiq9apjae448d8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbk9hqtiq9apjae448d8.png" alt="Ai-investigate GPU and kernel events" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Idea
&lt;/h2&gt;

&lt;p&gt;GPU performance debugging usually goes like this: training is slow, nvidia-smi shows nothing useful, print statements get added, hours pass. What happens when raw trace data gets handed to an AI assistant with the question “what went wrong?”&lt;/p&gt;

&lt;p&gt;That’s what the MCP server enables. The tracer traces CUDA API calls and Linux kernel events, stores them in a SQLite database, then exposes them to AI assistants via the Model Context Protocol (MCP). The AI can query the data, read causal chains, inspect per-process breakdowns, and run custom SQL through natural conversation.&lt;/p&gt;

&lt;p&gt;We tested this on a real investigation: a &lt;a href="https://github.com/pytorch/pytorch/issues/154318" rel="noopener noreferrer"&gt;PyTorch DataLoader bottleneck&lt;/a&gt; where DataLoader was 114x slower than direct tensor indexing on an RTX 4090. The trace database is &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/pytorch-dataloader-starvation.db" rel="noopener noreferrer"&gt;in the Ingero repo&lt;/a&gt; for independent reproduction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup (2 minutes)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Get the trace database
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The investigation database is at &lt;code&gt;investigations/pytorch-dataloader-starvation.db&lt;/code&gt; (764KB).&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Start the MCP server
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./bin/ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/pytorch-dataloader-starvation.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This starts the MCP server in stdio mode. To connect Claude Desktop, add this to your MCP config (&lt;code&gt;~/.config/claude/claude_desktop_config.json&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ingero"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/ingero"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/investigations/pytorch-dataloader-starvation.db"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Claude Code (CLI), the MCP server is automatically available when running from the repo directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Investigation Session
&lt;/h2&gt;

&lt;p&gt;Here’s the actual conversation. We started with a simple question and let the AI dig deeper.&lt;/p&gt;

&lt;h3&gt;
  
  
  “What happened in this trace?”
&lt;/h3&gt;

&lt;p&gt;The AI calls &lt;code&gt;get_trace_stats&lt;/code&gt; and gets back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3,807 events over 77 seconds
CUDA Runtime: cudaMemcpyAsync (14 calls, avg 31.5ms, max 65ms)
              cudaLaunchKernel (16 calls, avg 1.7ms, max 25.8ms)
              cudaStreamSync (21 calls, avg 31us)
              cudaDeviceSync (3 calls, avg 17.2ms)
Host: sched_switch (3,676 calls, avg 401us, max 23.9ms)
      process_fork (11 calls)
      process_exit (5 calls)
Driver: cuLaunchKernel (41 calls, avg 40us)
        cuMemAlloc (10 calls, avg 307us)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Right away, two things stand out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;cudaMemcpyAsync averages 31.5ms&lt;/strong&gt; per call. That’s host-to-device data transfer. It should be microseconds for the batch sizes involved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3,676 sched_switch events&lt;/strong&gt; in 77 seconds. That’s the Linux scheduler preempting processes, something is fighting for CPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  “What are the causal chains?”
&lt;/h3&gt;

&lt;p&gt;The AI calls &lt;code&gt;get_causal_chains&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;40 causal chains detected (all MEDIUM severity)

Recurring pattern:
  cudaLaunchKernel p99=25.8ms (352x p50=73us) - N sched_switch events
  Context switches escalate from 34 -&amp;gt; 106 -&amp;gt; 268 -&amp;gt; 308 -&amp;gt; 319 over time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The chains show a clear escalation: context switches grow from 34 to 319 over the trace window. The DataLoader workers are creating more and more scheduling pressure as the benchmark runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Which processes are involved?”
&lt;/h3&gt;

&lt;p&gt;The AI calls &lt;code&gt;run_sql&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;process_names&lt;/span&gt; &lt;span class="n"&gt;pn&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python3          3,799 events
pt_data_worker     (visible in fork/exit events)
pt_data_pin        (visible in fork/exit events)
systemd-journal       17 events
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main &lt;code&gt;python3&lt;/code&gt; process owns almost all events. But the DataLoader workers (&lt;code&gt;pt_data_worker&lt;/code&gt;, &lt;code&gt;pt_data_pin&lt;/code&gt;) are visible through fork and scheduling events, they’re being created, fighting for CPU, and getting descheduled.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Why is cudaMemcpyAsync so slow?”
&lt;/h3&gt;

&lt;p&gt;The AI connects the dots:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The 14 cudaMemcpyAsync calls average 31.5ms (max 65ms). These are host-to-device data transfers that should complete in microseconds for batch sizes under 100MB.&lt;/p&gt;

&lt;p&gt;The 3,676 sched_switch events show the main process being preempted 47 times per second. Each preemption delays the data pipeline. The DataLoader workers (pt_data_worker) are competing with the main process for CPU, on a 4-vCPU machine, 4 workers + the main process + system processes means constant context switching.&lt;/p&gt;

&lt;p&gt;The GPU is receiving cudaMemcpyAsync calls, but the source data isn’t ready in host memory because the DataLoader worker that was supposed to prepare it got descheduled. The GPU waits.&lt;/p&gt;

&lt;p&gt;Root cause: CPU scheduling contention between DataLoader workers and the main training process. The GPU is fast, it’s the host that can’t keep up.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  “What would you recommend?”
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Reduce num_workers to 2 (on a 4-core machine, 4 workers causes over-subscription)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For in-memory datasets, skip DataLoader entirely, direct tensor indexing is 114x faster because it avoids the multiprocessing overhead&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If DataLoader is required, use persistent_workers=True to avoid fork overhead on each epoch&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;Total investigation time: &lt;strong&gt;under 30 seconds&lt;/strong&gt; from first question to root cause + fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the AI Has Access To
&lt;/h2&gt;

&lt;p&gt;The MCP server exposes 7 tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_check&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;System diagnostics (kernel, GPU, CUDA, driver)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_trace_stats&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CUDA + host statistics (p50/p95/p99 per operation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_causal_chains&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Automated root cause chains with severity ranking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_stacks&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Resolved call stacks (symbols + Python source lines)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;run_demo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Run synthetic demo scenarios&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;get_test_report&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GPU integration test results&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;run_sql&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Read-only SQL against the trace database&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;run_sql&lt;/code&gt; tool is the most powerful: the AI can write arbitrary queries against the event table, joining with process names, ops, and sources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The trace database from this investigation is in the repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build

&lt;span class="c"&gt;# Quick analysis (no MCP needed)&lt;/span&gt;
./bin/ingero explain &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/pytorch-dataloader-starvation.db &lt;span class="nt"&gt;--since&lt;/span&gt; 5m

&lt;span class="c"&gt;# Interactive AI investigation via MCP&lt;/span&gt;
./bin/ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/pytorch-dataloader-starvation.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  With Claude Desktop
&lt;/h3&gt;

&lt;p&gt;Add to &lt;code&gt;~/.config/claude/claude_desktop_config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ingero"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./bin/ingero"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"investigations/pytorch-dataloader-starvation.db"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask Claude: “What caused the GPU performance problem in this trace?”&lt;/p&gt;

&lt;h3&gt;
  
  
  With Any MCP Client
&lt;/h3&gt;

&lt;p&gt;The MCP server works with any MCP-compatible client: Cursor, Windsurf, or custom implementations. The stdio transport is universal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Investigate with AI (recommended)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# With Ollama (local, free)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;ollmcp
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp-dataloader.json

&lt;span class="c"&gt;# With Claude Code&lt;/span&gt;
claude &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; /tmp/ingero-mcp-dataloader.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Type &lt;code&gt;/investigate&lt;/code&gt; and let the model explore.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Traditional GPU debugging is manual: run nvidia-smi, add print statements, read logs, guess. The AI-assisted approach is different:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The tracer captures everything&lt;/strong&gt; at the kernel level: CUDA API calls, host scheduling, memory events, with zero code changes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The trace database is self-contained&lt;/strong&gt;: no need to reproduce the issue, no need for the original hardware&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The AI asks the right follow-up questions&lt;/strong&gt;: it sees the context switches, connects them to CUDA latency, and identifies the root cause pattern&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This turns GPU debugging from “spend hours staring at logs” into “ask a question, get an answer.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investigation DB&lt;/strong&gt;: &lt;a href="https://github.com/ingero-io/ingero/tree/main/investigations" rel="noopener noreferrer"&gt;investigations/pytorch-dataloader-starvation.db&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Original issue&lt;/strong&gt;: &lt;a href="https://github.com/pytorch/pytorch/issues/154318" rel="noopener noreferrer"&gt;pytorch/pytorch#154318&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;. No NVIDIA SDK, no code changes, production-safe by design.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ingero.io/mcp-observability-interface-ai-agents-kernel-tracepoints/" rel="noopener noreferrer"&gt;MCP as an observability interface for kernel tracepoints&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/124x-slower-pytorch-dataloader-kernel-level/" rel="noopener noreferrer"&gt;124x slower PyTorch DataLoader traced at kernel level&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;GPU showing 97% utilization while training runs 3x slower&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>gpu</category>
      <category>ebpf</category>
      <category>observability</category>
      <category>gpuobservability</category>
    </item>
    <item>
      <title>MCP as Observability Interface: Connecting AI Agents to Kernel Tracepoints</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Thu, 16 Apr 2026 07:35:33 +0000</pubDate>
      <link>https://forem.com/ingero/mcp-as-observability-interface-connecting-ai-agents-to-kernel-tracepoints-4gaa</link>
      <guid>https://forem.com/ingero/mcp-as-observability-interface-connecting-ai-agents-to-kernel-tracepoints-4gaa</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;MCP is becoming the interface between AI agents and infrastructure&lt;br&gt;
data. Datadog shipped an MCP&lt;br&gt;
Server connecting dashboards to AI agents. &lt;br&gt;
Qualys flagged MCP servers as the new shadow IT risk.&lt;br&gt;
We think both are right, and we think the architecture should&lt;br&gt;
go further: the MCP server should not wrap an existing observability&lt;br&gt;
platform. It should BE the observability layer. This post explores how&lt;br&gt;
MCP can serve as a direct observability interface to kernel&lt;br&gt;
tracepoints, bypassing traditional metric pipelines entirely.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn3f20seiz58vsbd8004d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn3f20seiz58vsbd8004d.png" alt="MCP for Kernel and GPU Events" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three signals in one week
&lt;/h2&gt;

&lt;p&gt;Three things happened in the same week of March 2026 that signal where&lt;br&gt;
observability is headed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.datadoghq.com/bits_ai/mcp_server/" rel="noopener noreferrer"&gt;Datadog shipped an MCP Server&lt;/a&gt;&lt;/strong&gt; &lt;br&gt;
Their implementation connects real-time observability data to AI agents for automated detection and remediation. An AI agent can now query Datadog dashboards, pull metrics, and trigger responses through the Model Context Protocol. This is a big company validating a small protocol.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.qualys.com/product-tech/2026/03/19/mcp-servers-shadow-it-ai-qualys-totalai-2026" rel="noopener noreferrer"&gt;Qualys published a security analysis of MCP&lt;br&gt;
servers&lt;/a&gt;.&lt;/strong&gt; &lt;br&gt;
Their TotalAI team called MCP servers “the new shadow IT for AI” and&lt;br&gt;
found that over 53% of servers rely on static secrets for&lt;br&gt;
authentication. They recommended adding observability to MCP servers:&lt;br&gt;
logging capability discovery events, monitoring invocation patterns,&lt;br&gt;
alerting on anomalies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud Native Now covered eBPF for Kubernetes network observability.&lt;/strong&gt;&lt;br&gt;
Microsoft Retina deploys as a DaemonSet, captures network telemetry via&lt;br&gt;
eBPF without application changes, and provides kernel-level drop reasons. The article draws a clear line between “monitoring” (predefined questions) and “observability” (asking questions nobody planned for).&lt;/p&gt;

&lt;p&gt;The thread connecting all three: AI agents need direct access to&lt;br&gt;
infrastructure telemetry, and MCP is becoming the way they get it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Two approaches to MCP observability
&lt;/h2&gt;

&lt;p&gt;There are two ways to connect observability data to AI agents via MCP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach 1: Wrap existing platforms.&lt;/strong&gt; Datadog’s strategy. Take&lt;br&gt;
existing metrics, logs, and traces, already collected and aggregated,&lt;br&gt;
and expose them through MCP tools. The AI agent queries the dashboard&lt;br&gt;
API, gets pre-processed data, and acts on it. This makes sense for teams&lt;br&gt;
with a mature observability stack that want to add AI-powered automation&lt;br&gt;
on top.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach 2: Build MCP-native observability.&lt;/strong&gt; This is what we did with&lt;br&gt;
the tracer. Instead of wrapping an existing platform, we built an eBPF&lt;br&gt;
agent that traces CUDA Runtime and Driver APIs via uprobes, stores the&lt;br&gt;
results in SQLite, and exposes everything through 7 MCP tools. The MCP&lt;br&gt;
interface is not an adapter layer; it is the primary interface.&lt;/p&gt;

&lt;p&gt;Neither approach is wrong. They solve different problems.&lt;/p&gt;

&lt;p&gt;The wrapper approach works well for aggregate analysis: “What was the&lt;br&gt;
p99 latency for service X over the last hour?” The data is already&lt;br&gt;
summarized, indexed, and queryable.&lt;/p&gt;

&lt;p&gt;The native approach works better for root-cause investigation: “Why did&lt;br&gt;
this specific GPU request take 14.5x longer than expected?” That&lt;br&gt;
requires raw kernel events, CUDA call stacks, and causal chains – not&lt;br&gt;
summaries. The AI agent needs to drill down, not roll up.&lt;/p&gt;
&lt;h2&gt;
  
  
  What MCP-native observability looks like in practice
&lt;/h2&gt;

&lt;p&gt;Here is a concrete example. We traced a vLLM TTFT regression where the&lt;br&gt;
first token took 14.5x longer than baseline. The trace database captured&lt;br&gt;
every CUDA API call, every kernel context switch, every memory&lt;br&gt;
allocation.&lt;/p&gt;

&lt;p&gt;When Claude connects to the MCP server and loads this database, it can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;get_trace_stats&lt;/strong&gt; – See the full trace summary: 12,847 CUDA
events, 4 causal chains, total GPU time&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;get_causal_chains&lt;/strong&gt; – Read the causal chains that explain why
latency spiked, in plain English&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;run_sql&lt;/strong&gt; – Run custom queries against the raw event data (“show
me all cudaMemcpyAsync calls over 100ms”)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;get_stacks&lt;/strong&gt; – Inspect call stacks for any flagged event&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Claude identified the root cause in under 30 seconds: logprobs&lt;br&gt;
computation was blocking the decode loop, creating a 256x slowdown on&lt;br&gt;
the critical path. That root cause was not visible in any aggregate&lt;br&gt;
metric. It only appeared in the raw causal chain between specific CUDA&lt;br&gt;
API calls.&lt;/p&gt;

&lt;p&gt;A dashboard MCP adapter could not have found this. The data granularity&lt;br&gt;
does not survive aggregation.&lt;/p&gt;
&lt;h2&gt;
  
  
  The security angle matters too
&lt;/h2&gt;

&lt;p&gt;Qualys raised valid concerns about MCP server security. Their finding&lt;br&gt;
that 53% of servers rely on static secrets is alarming. Their&lt;br&gt;
recommendation to log discovery and invocation events is exactly right.&lt;/p&gt;

&lt;p&gt;For MCP servers that touch GPU infrastructure, the attack surface is&lt;br&gt;
different. An MCP server with access to CUDA traces can expose timing&lt;br&gt;
information, memory layouts, and model architecture details. The&lt;br&gt;
security model needs to account for this.&lt;/p&gt;

&lt;p&gt;In Ingero, the MCP server runs inside the same process as the eBPF tracing pipeline. There is no separate data layer between the AI agent and the kernel-level telemetry - the MCP tools query the same event store that the eBPF probes write to. This is why Ingero can answer causal questions in real time: the AI agent has direct access to raw kernel and CUDA events, not a pre-aggregated summary.&lt;/p&gt;
&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The project is open source. The &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/pytorch-dataloader-starvation.db" rel="noopener noreferrer"&gt;investigation database&lt;/a&gt; from this post is available for download. Claude (or any MCP client) can connect to it and run an investigation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build
./bin/ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; investigations/pytorch-dataloader-starvation.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Investigate with AI (recommended)
&lt;/h3&gt;

&lt;p&gt;You can point any MCP-compatible AI client at the trace database and ask questions directly. No code required.&lt;/p&gt;

&lt;p&gt;First, create the MCP config file at &lt;code&gt;/tmp/ingero-mcp-dataloader.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ingero"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./bin/ingero"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"investigations/pytorch-dataloader-starvation.db"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Ollama (local, free):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install ollmcp (MCP client for Ollama)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;ollmcp

&lt;span class="c"&gt;# Investigate with a local model (no data leaves your machine)&lt;/span&gt;
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp-dataloader.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Claude Code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; /tmp/ingero-mcp-dataloader.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then type &lt;code&gt;/investigate&lt;/code&gt; and let the model explore. Follow up with questions like “what was the root cause?” or “which processes were competing for CPU time?”&lt;/p&gt;

&lt;p&gt;The MCP server exposes 7 tools. Claude will figure out the rest.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt; Give us a star at &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;!&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ingero.io/ai-agent-kernel-level-gpu-traces/" rel="noopener noreferrer"&gt;AI agent investigation of kernel-level GPU traces&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/gpu-incident-response-in-60-seconds-an-sres-guide-to-ebpf-based-gpu-observability/" rel="noopener noreferrer"&gt;GPU incident response in 60 seconds with eBPF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/tracing-torch-cuda-empty-cache-rtx-4090/" rel="noopener noreferrer"&gt;tracing torch.cuda.empty_cache() on an RTX 4090&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ebpf</category>
      <category>mcp</category>
      <category>gpuobservability</category>
    </item>
    <item>
      <title>One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Mon, 13 Apr 2026 17:18:24 +0000</pubDate>
      <link>https://forem.com/ingero/one-query-four-gpus-tracing-a-distributed-training-stall-across-nodes-2jbd</link>
      <guid>https://forem.com/ingero/one-query-four-gpus-tracing-a-distributed-training-stall-across-nodes-2jbd</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The problem we kept hitting
&lt;/h2&gt;

&lt;p&gt;We’ve been building &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;Ingero&lt;/a&gt; – an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well.&lt;/p&gt;

&lt;p&gt;But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. &lt;code&gt;dstat&lt;/code&gt; shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening.&lt;/p&gt;

&lt;p&gt;We wanted cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works?&lt;/p&gt;

&lt;h2&gt;
  
  
  What we shipped in v0.9.1
&lt;/h2&gt;

&lt;p&gt;Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Node identity
&lt;/h3&gt;

&lt;p&gt;Every event now carries a node tag. The agent stamps each event with a name from a &lt;code&gt;--node&lt;/code&gt; flag, an &lt;code&gt;ingero.yaml&lt;/code&gt; config value, or the hostname as fallback:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;ingero trace &lt;span class="nt"&gt;--node&lt;/span&gt; gpu-node-01
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Event IDs become node-namespaced (&lt;code&gt;gpu-node-01:4821&lt;/code&gt;) so databases from different nodes can merge without collisions. For &lt;code&gt;torchrun&lt;/code&gt; workloads, rank and world size are auto-detected from environment variables (&lt;code&gt;RANK&lt;/code&gt;, &lt;code&gt;LOCAL_RANK&lt;/code&gt;, &lt;code&gt;WORLD_SIZE&lt;/code&gt;) – no extra configuration needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Fleet fan-out queries
&lt;/h3&gt;

&lt;p&gt;Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a &lt;code&gt;node&lt;/code&gt; column prepended. For production clusters, the client supports mTLS – &lt;code&gt;--ca-cert&lt;/code&gt;, &lt;code&gt;--client-cert&lt;/code&gt;, &lt;code&gt;--client-key&lt;/code&gt; – so both sides authenticate. Plain HTTP is available via &lt;code&gt;--no-tls&lt;/code&gt; but requires an explicit opt-in, and even then it’s intended for trusted VPC networks only.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--nodes&lt;/code&gt; flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into &lt;code&gt;ingero.yaml&lt;/code&gt; once and every command picks it up automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;fleet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gpu-node-01:8080&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gpu-node-02:8080&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gpu-node-03:8080&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gpu-node-04:8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A full example config is in &lt;a href="https://github.com/ingero-io/ingero/blob/main/configs/ingero.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;configs/ingero.yaml&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \
    "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us
     FROM events GROUP BY node, source"

node              source  cnt    avg_us
----------------  ------  -----  ------
gpu-node-01       4       11009  5.2
gpu-node-01       3       847    18400  # ← 9x higher than peers
gpu-node-02       4       10892  5.1
gpu-node-02       3       412    2100
gpu-node-03       4       10847  5.3
gpu-node-03       3       398    1900
gpu-node-04       4       10901  5.0
gpu-node-04       3       421    2200

  8 rows from 4 node(s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080

FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s)

[HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O
  Root cause: 847 sched_switch events + heavy block I/O
  Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs

[MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O
  Root cause: 855 sched_switch events + heavy block I/O
  Fix: Pin training process to dedicated cores with taskset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both chains are on &lt;code&gt;gpu-node-01&lt;/code&gt;. The other three nodes have zero issues. The root cause: CPU contention from block I/O – checkpoint writes preempting the training process.&lt;/p&gt;

&lt;p&gt;Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.”&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Offline merge and Perfetto export
&lt;/h3&gt;

&lt;p&gt;Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints – there are real reasons the network path isn’t always available.&lt;/p&gt;

&lt;p&gt;For those cases, &lt;code&gt;ingero merge&lt;/code&gt; combines SQLite databases from each node into a single queryable file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Collect traces from each node&lt;/span&gt;
scp gpu-node-01:~/.ingero/ingero.db node-01.db
scp gpu-node-02:~/.ingero/ingero.db node-02.db

&lt;span class="c"&gt;# 2. Merge and analyze&lt;/span&gt;
ingero merge node-01.db node-02.db &lt;span class="nt"&gt;-o&lt;/span&gt; cluster.db
ingero explain &lt;span class="nt"&gt;-d&lt;/span&gt; cluster.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with &lt;code&gt;--force-node&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For visual timeline analysis, &lt;code&gt;ingero export --format perfetto&lt;/code&gt; produces a Chrome Trace Event Format JSON that opens in &lt;a href="https://ui.perfetto.dev" rel="noopener noreferrer"&gt;ui.perfetto.dev&lt;/a&gt;. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we built it this way
&lt;/h2&gt;

&lt;p&gt;The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb – the well-trodden path.&lt;/p&gt;

&lt;p&gt;We deliberately avoided that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No new infrastructure.&lt;/strong&gt; Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure – the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Client-side fan-out is simple and sufficient.&lt;/strong&gt; The CLI sends concurrent HTTP requests, collects results, and merges them locally. A &lt;code&gt;sync.WaitGroup&lt;/code&gt;, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partial failure is first-class.&lt;/strong&gt; If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble – and knowing which nodes failed is diagnostic information in itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clock skew is measured, not ignored.&lt;/strong&gt; eBPF timestamps come from &lt;code&gt;bpf_ktime_get_ns()&lt;/code&gt; (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query – 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Offline merge covers air-gapped environments.&lt;/strong&gt; Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP: AI-driven fleet investigation
&lt;/h2&gt;

&lt;p&gt;The fleet is also accessible through Ingero’s MCP server via the &lt;code&gt;query_fleet&lt;/code&gt; tool. Here’s what the raw tool output looks like for a &lt;code&gt;chains&lt;/code&gt; query across the same 4-node cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;query_fleet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chains&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;Fleet&lt;/span&gt; &lt;span class="n"&gt;Chains&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="nf"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;HIGH&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;cuLaunchKernel&lt;/span&gt; &lt;span class="n"&gt;p99&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;843&lt;/span&gt;&lt;span class="nf"&gt;us &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;63.9&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="n"&gt;p50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;847&lt;/span&gt; &lt;span class="n"&gt;sched_switch&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;heavy&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;O&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;MEDIUM&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;cuMemAlloc&lt;/span&gt; &lt;span class="n"&gt;p99&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;932&lt;/span&gt;&lt;span class="nf"&gt;us &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="n"&gt;p50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;855&lt;/span&gt; &lt;span class="n"&gt;sched_switch&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;heavy&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;O&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s the complete response – an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: &lt;code&gt;chains&lt;/code&gt; (causal analysis), &lt;code&gt;sql&lt;/code&gt; (arbitrary queries), &lt;code&gt;ops&lt;/code&gt; (operation breakdown per node), and &lt;code&gt;overview&lt;/code&gt; (event counts). Clock skew warnings are prepended automatically when detected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this stands
&lt;/h2&gt;

&lt;p&gt;v0.9.1 is the initial step in cluster-level tracing, not the destination.&lt;/p&gt;

&lt;p&gt;What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export – these are diagnostic tools for after the fact.&lt;/p&gt;

&lt;p&gt;We’re actively working on cross-node correlation and straggler detection – more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor.&lt;/p&gt;

&lt;p&gt;The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/ingero-io/ingero/tree/main/investigations" rel="noopener noreferrer"&gt;investigations/&lt;/a&gt; directory has ready-to-query databases for trying this without a GPU cluster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;sample-gpu-node-01.db&lt;/code&gt;, &lt;code&gt;sample-gpu-node-02.db&lt;/code&gt;, &lt;code&gt;sample-gpu-node-03.db&lt;/code&gt; – individual node traces from a 3-node cluster&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sample-cluster.db&lt;/code&gt; – all three merged into one (600 events, 6 chains, 9 stacks)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;GitHub (give us a star!):&lt;/strong&gt; &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;. No NVIDIA SDK, no code changes, production-safe by design.&lt;/p&gt;

&lt;p&gt;If you are facing distributed training issues in your own workloads, we’d love to take a look. &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;Drop an issue on GitHub&lt;/a&gt;&lt;/strong&gt; and we will gladly dive into it together.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/gpu-incident-response-in-60-seconds-an-sres-guide-to-ebpf-based-gpu-observability/" rel="noopener noreferrer"&gt;GPU incident response in 60 seconds with eBPF&lt;/a&gt; – single-node investigation workflow that the fleet feature extends&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/11-second-time-to-first-token-healthy-vllm-server/" rel="noopener noreferrer"&gt;11-second time to first token on a healthy vLLM server&lt;/a&gt; – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this post&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;GPU showing 97% utilization while training runs 3x slower&lt;/a&gt; – why nvidia-smi metrics alone miss the real story&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gpu</category>
      <category>ebpf</category>
      <category>distributedcomputing</category>
    </item>
    <item>
      <title>CUDA Graphs: The 8-Year Overnight Success and the Observability Gap</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Wed, 08 Apr 2026 13:40:15 +0000</pubDate>
      <link>https://forem.com/ingero/cuda-graphs-the-8-year-overnight-success-and-the-observability-gap-3hn6</link>
      <guid>https://forem.com/ingero/cuda-graphs-the-8-year-overnight-success-and-the-observability-gap-3hn6</guid>
      <description>&lt;h1&gt;
  
  
  CUDA Graphs: The 8-Year Overnight Success and the Observability Gap
&lt;/h1&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;CUDA graphs shipped in 2018 but only became critical infrastructure in the past two years, driven by LLM inference demands and framework automation. They also create an observability blind spot: hundreds of kernel launches collapse into one opaque &lt;code&gt;cudaGraphLaunch&lt;/code&gt;. A &lt;a href="https://arxiv.org/abs/2503.19779" rel="noopener noreferrer"&gt;March 2025 study&lt;/a&gt; found 25% of CUDA graphs in PyTorch workloads actually degrade performance. We traced graph lifecycle events (capture, instantiate, launch) via eBPF uprobes and correlated them with CPU scheduling and I/O pressure to detect graph pool exhaustion, re-capture storms, and CPU contention during dispatch. The investigation database and reproduction steps are included below.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;CUDA graphs shipped in 2018. For five years, almost nobody used them. Today, they power every token generated by vLLM, SGLang, and TensorRT-LLM. The technology didn’t change. The world did.&lt;/p&gt;

&lt;p&gt;We’ve been investigating what changed, why it matters, and what the CUDA graphs observability gap looks like in practice: the industry’s most widely adopted GPU optimization hides hundreds of kernel launches behind a single API call, and existing tools can’t see into it.&lt;/p&gt;




&lt;h2&gt;
  
  
  CUDA Graphs: Quick Context
&lt;/h2&gt;

&lt;p&gt;Every CUDA kernel launch costs 20-200 microseconds of CPU-side work: Python interpreter overhead, framework dispatch, driver processing, hardware submission. CUDA graphs record a sequence of GPU operations into a DAG, instantiate it once, and replay it with a single API call. The CPU tax is paid once for the entire graph instead of per kernel.&lt;/p&gt;

&lt;p&gt;The mechanism has been available since CUDA 10. The interesting question is why adoption stayed flat for five years and then went vertical.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five Forces: Why Now?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. GPUs Got Too Fast for CPUs to Keep Up
&lt;/h3&gt;

&lt;p&gt;GPU FP16 throughput grew 47x from Pascal (GP100, 21.2 TFLOPS) to Hopper (H100, ~1,000 TFLOPS with Tensor Cores). Kernel execution times collapsed from milliseconds to single-digit microseconds. CPU-side launch overhead stayed at 20-140µs per operation (higher in Python frameworks).&lt;/p&gt;

&lt;p&gt;The PyGraph team measured a segment of DALL-E 2 inference that launches 740+ kernels with a combined GPU time of 3.4ms. End-to-end latency: 14ms. &lt;strong&gt;75% of the wall-clock time, the GPU sits idle&lt;/strong&gt;, waiting for the CPU to submit the next operation.&lt;/p&gt;

&lt;p&gt;CUDA graphs collapse all of that per-kernel overhead into a single graph launch, as low as ~2.5µs since CUDA 12.6.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Workload That Needs Graphs Most Didn’t Exist Yet
&lt;/h3&gt;

&lt;p&gt;LLM autoregressive decode (generating one token at a time with fixed compute shapes and small batch sizes) is the &lt;em&gt;perfect&lt;/em&gt; CUDA graph workload. Static shapes, repeated execution, CPU overhead dominance.&lt;/p&gt;

&lt;p&gt;This workload category barely existed before ChatGPT launched in November 2022. By 2026, inference is projected to account for roughly two-thirds of all AI compute spending (Deloitte TMT Predictions). The economic pressure to optimize it became enormous.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Frameworks Made It a Flag Flip
&lt;/h3&gt;

&lt;p&gt;Before PyTorch 2.x, using CUDA graphs meant writing manual capture code in C++/CUDA: stream capture semantics, fixed memory addresses, graph instantiation. Expert territory.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;torch.compile(mode="reduce-overhead")&lt;/code&gt; lowered that barrier to one line of Python. vLLM and SGLang built graph capture directly into their serving pipelines. Adoption shifted from “CUDA expert” to “set a flag.”&lt;/p&gt;

&lt;h3&gt;
  
  
  4. NVIDIA Made the API Actually Usable
&lt;/h3&gt;

&lt;p&gt;The original CUDA graph API was rigid: no conditional logic, no dynamic control flow. Real workloads with variable batch sizes or branching paths couldn’t use graphs without ugly workarounds.&lt;/p&gt;

&lt;p&gt;NVIDIA shipped a steady stream of fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CUDA 12.4&lt;/strong&gt; (2024): Conditional nodes (IF, WHILE)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CUDA 12.6&lt;/strong&gt;: Constant-time graph launch (~2.5µs + ~1ns/node)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CUDA 12.8&lt;/strong&gt;: IF/ELSE, SWITCH nodes; Blackwell support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nsight Compute 2025.3+&lt;/strong&gt;: CUDA Graph Viewer and graph-aware profiling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By 12.8, the API covered the majority of real-world control flow patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The Economics Made It Non-Optional
&lt;/h3&gt;

&lt;p&gt;At the scale of billions of inference requests per day, a 2.3x throughput improvement from CUDA graphs (measured on LLaMA-2 7B by Fireworks AI) translates to cutting the GPU fleet, or the cloud bill, nearly in half. That’s millions of dollars.&lt;/p&gt;

&lt;p&gt;Within two years, CUDA graphs moved from optional optimization to baseline infrastructure for inference serving.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Observability Gap
&lt;/h2&gt;

&lt;p&gt;The performance story is well-documented. The observability story is not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CUDA graphs create an observability black hole.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a graph launches, hundreds of individual kernel launches, memory copies, and synchronization points collapse into a single &lt;code&gt;cudaGraphLaunch&lt;/code&gt; call. From any external observer (profiler, monitoring, eBPF probes) there is one event where there used to be hundreds.&lt;/p&gt;

&lt;p&gt;In our investigation work, we’ve found several ways this gap manifests in production:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graphs can silently hurt performance.&lt;/strong&gt; A March 2025 paper from the &lt;a href="https://arxiv.org/abs/2503.19779" rel="noopener noreferrer"&gt;PyGraph project&lt;/a&gt; found that &lt;strong&gt;25% of CUDA graphs (29 of 116 analyzed) in PyTorch workloads actually degraded performance&lt;/strong&gt;, with individual graphs reaching up to ~5x slowdown (397% degradation). The culprits: parameter copy overhead eating up to 24% of execution time, memory garbage collection after replay, RNG state resets. Without graph-aware tracing, these costs are invisible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graph re-capture is expensive and hard to detect.&lt;/strong&gt; (&lt;a href="https://docs.nvidia.com/dl-cuda-graph/cuda-graph-basics/cuda-graph.html" rel="noopener noreferrer"&gt;NVIDIA best practices&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2503.19779" rel="noopener noreferrer"&gt;PyGraph 2025&lt;/a&gt;) When a new batch size arrives that doesn’t match any pre-captured graph, the framework re-captures. That’s a costly operation that blocks inference. In vLLM, this can cause latency spikes that cascade to all co-scheduled requests. From standard monitoring, it looks like a random latency blip.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CPU contention during graph dispatch is invisible.&lt;/strong&gt; (&lt;a href="https://developer.nvidia.com/blog/cuda-graphs/" rel="noopener noreferrer"&gt;NVIDIA CUDA Graphs blog&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2503.19779" rel="noopener noreferrer"&gt;PyGraph 2025&lt;/a&gt;) A CUDA graph launch is fast (~2.5µs on CUDA 12.6+), but only if the CPU thread gets to run uninterrupted. If &lt;code&gt;logrotate&lt;/code&gt;, a DataLoader worker, or a noisy neighbor preempts the thread during dispatch, the graph launch stalls. &lt;code&gt;nvidia-smi&lt;/code&gt; sees nothing. The GPU utilization dashboard stays green. But your p99 latency just spiked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graph pool exhaustion has no standard alert.&lt;/strong&gt; (&lt;a href="https://docs.nvidia.com/dl-cuda-graph/troubleshooting/memory-issues.html" rel="noopener noreferrer"&gt;NVIDIA memory troubleshooting&lt;/a&gt;, &lt;a href="https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html" rel="noopener noreferrer"&gt;CUDA pool API&lt;/a&gt;) When the pool of instantiated graphs fills up, launch rates drop. In one of our traces, graph launch rate dropped from 163 to 2 launches/second, a 99% collapse, with no warning from any standard monitoring tool. The root cause: a batch size change triggered re-capture, and the pool couldn’t keep up.&lt;/p&gt;

&lt;p&gt;The irony is precise: &lt;strong&gt;the optimization that eliminates per-kernel CPU overhead also eliminates per-kernel visibility.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Tracing Graph Lifecycle Events with eBPF
&lt;/h2&gt;

&lt;p&gt;To close this observability gap, we added CUDA graph lifecycle tracing to &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;Ingero&lt;/a&gt; (open-source, eBPF-based GPU observability) in v0.9.0. The approach uses eBPF uprobes on the CUDA runtime, which means no CUPTI dependency, no Nsight session, and no application code changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Live Graph Tracing
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ingero trace&lt;/code&gt; attaches uprobes to &lt;code&gt;cudaStreamBeginCapture&lt;/code&gt;, &lt;code&gt;cudaStreamEndCapture&lt;/code&gt;, &lt;code&gt;cudaGraphInstantiate&lt;/code&gt;, and &lt;code&gt;cudaGraphLaunch&lt;/code&gt;, alongside the standard CUDA API, driver API, and host kernel events.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xllhargia8p2hc2el9p.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xllhargia8p2hc2el9p.gif" alt="Fig. 1: " width="1050" height="901"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Above: a &lt;code&gt;torch.compile&lt;/code&gt; inference workload traced live. The CUDA Runtime table shows &lt;code&gt;graphLaunch&lt;/code&gt; (985 launches, p50=9.9µs), &lt;code&gt;graphBeginCapture&lt;/code&gt; (2 captures), &lt;code&gt;graphEndCapture&lt;/code&gt; (2 completions), right alongside &lt;code&gt;cudaLaunchKernel&lt;/code&gt;, &lt;code&gt;cudaStreamSync&lt;/code&gt;, and the rest of the standard CUDA operations. Host context shows 50,000+ &lt;code&gt;sched_switch&lt;/code&gt; events with CPU at 100%.&lt;/p&gt;

&lt;p&gt;The bottom of the display fires real-time anomaly correlation: &lt;em&gt;“cudaStreamSync p99=1.4ms (171.7x p50), correlated with 46,507 sched_switch events.”&lt;/em&gt; And the graph-specific finding: &lt;em&gt;“[MEDIUM] CPU contention delaying graph dispatch (985 launches, 184,619 sched_switch).”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This kind of cross-layer correlation (CUDA API latency tied to host scheduler pressure) is what we found missing from existing GPU profiling workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Causal Chain Diagnosis
&lt;/h3&gt;

&lt;p&gt;After tracing, &lt;code&gt;ingero explain&lt;/code&gt; reads the recorded events from SQLite and assembles causal chains: cross-layer correlations that explain &lt;em&gt;why&lt;/em&gt; something went wrong, not just &lt;em&gt;what&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmipslletcso3qxdqlou7.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmipslletcso3qxdqlou7.gif" alt="Fig. 2: “ingero explain” CLI producing causal chains linking CUDA Graph events to CPU contention and I/O pressure" width="800" height="836"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The incident report finds 8 causal chains (6 HIGH, 2 MEDIUM). The graph-specific findings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[MEDIUM] CPU contention delaying graph dispatch (12 launches, 2,251 sched_switch)

  Fix: Pin the inference process to dedicated CPU cores;
       reduce background CPU load during inference.

[MEDIUM] Graph launch rate dropped 99% (exec 0x0, PID 1789)
  Rate dropped from 163 to 2 launches/sec

  Root cause: graph pool exhaustion, likely re-capture triggered by new batch size.

  Fix: Pre-warm all expected batch sizes during model startup;
       set max_num_batched_tokens to limit batch size variability.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The causal chain links four layers: system context (CPU 100%) to host events (124,241 context switches, 7.9s off-CPU) to CUDA API latency spikes (cudaDeviceSync p99=20.8ms, 436x normal) to graph dispatch stalls. One chain, one root cause.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI-Assisted Investigation
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fingero.io%2Fwp-content%2Fuploads%2F2026%2F04%2Fdemo-graph-ai-mcp-real.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fingero.io%2Fwp-content%2Fuploads%2F2026%2F04%2Fdemo-graph-ai-mcp-real.gif" alt="Fig. 3: Claude Code investigating CUDA graph trace data via Ingero MCP tools" width="720" height="685"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ingero also exposes an MCP (Model Context Protocol) server. Any MCP-compatible AI assistant (Claude Code, Cursor, local models via Ollama) can query a trace database directly.&lt;/p&gt;

&lt;p&gt;Above: Claude Code runs broad diagnostics first (&lt;code&gt;get_check&lt;/code&gt;, &lt;code&gt;get_trace_stats&lt;/code&gt;, &lt;code&gt;get_causal_chains&lt;/code&gt; in parallel), then drills into &lt;code&gt;graph_lifecycle&lt;/code&gt;, &lt;code&gt;graph_frequency&lt;/code&gt;, and &lt;code&gt;get_stacks&lt;/code&gt;. &lt;strong&gt;Then the agent correlates the results across all layers and produces a root cause analysis with fix recommendations, without manual SQL or log parsing.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Two ways to reproduce or extend this investigation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Path 1: Instant Investigation (No GPU Needed)
&lt;/h3&gt;

&lt;p&gt;Download the pre-captured investigation database and query it immediately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download the CUDA graph investigation database&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; cuda-graph.db &lt;span class="se"&gt;\&lt;/span&gt;
  https://raw.githubusercontent.com/ingero-io/ingero/main/investigations/cuda-graph-cpu-contention.db

&lt;span class="c"&gt;# View causal chains&lt;/span&gt;
ingero explain &lt;span class="nt"&gt;--db&lt;/span&gt; cuda-graph.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This database contains a real &lt;code&gt;torch.compile&lt;/code&gt; inference workload under CPU contention: graph captures, instantiations, 985 launches, pool exhaustion, and the full causal chain explaining the 99% launch rate drop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Investigate with AI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# With Claude Code:&lt;/span&gt;
claude mcp add &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nb"&gt;local &lt;/span&gt;ingero &lt;span class="nt"&gt;--&lt;/span&gt; ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; cuda-graph.db

claude
&lt;span class="c"&gt;# Ask: "Use ingero tools to investigate this CUDA graph trace"&lt;/span&gt;

&lt;span class="c"&gt;# Install ollmcp (MCP client for Ollama):&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;ollmcp

&lt;span class="c"&gt;# Or with Ollama:&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/ingero-mcp.json &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "mcpServers": {
    "ingero": {
      "command": "ingero",
      "args": ["mcp", "--db", "cuda-graph.db"]
    }
  }
}
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Local model (no data leaves your machine)&lt;/span&gt;
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp.json

&lt;span class="c"&gt;# Or use a cloud-hosted model via Ollama (faster, data sent to provider)&lt;/span&gt;
ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; minimax-m2.7:cloud &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Path 2: Full End-to-End (Any NVIDIA GPU + Linux)
&lt;/h3&gt;

&lt;p&gt;Reproduce the entire investigation from scratch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Ingero&lt;/span&gt;
&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.9.1
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; &lt;span class="s2"&gt;"https://github.com/ingero-io/ingero/releases/download/v&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/ingero_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_linux_amd64.tar.gz"&lt;/span&gt; | &lt;span class="nb"&gt;tar &lt;/span&gt;xz
&lt;span class="nb"&gt;sudo mv &lt;/span&gt;ingero /usr/local/bin/

&lt;span class="c"&gt;# Run the CUDA graph demo workload (requires PyTorch 2.x)&lt;/span&gt;
python tests/workloads/cuda_graph_demo.py &amp;amp;

&lt;span class="c"&gt;# Add CPU contention to trigger the interesting behavior&lt;/span&gt;
stress-ng &lt;span class="nt"&gt;--cpu&lt;/span&gt; 2 &lt;span class="nt"&gt;--timeout&lt;/span&gt; 30s &amp;amp;

&lt;span class="c"&gt;# Trace&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ingero trace &lt;span class="nt"&gt;--pid&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;pgrep &lt;span class="nt"&gt;-f&lt;/span&gt; cuda_graph_demo&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;--db&lt;/span&gt; demo.db &lt;span class="nt"&gt;--duration&lt;/span&gt; 30s

&lt;span class="c"&gt;# Investigate&lt;/span&gt;
ingero explain &lt;span class="nt"&gt;--db&lt;/span&gt; demo.db

&lt;span class="c"&gt;# Or let AI investigate&lt;/span&gt;
claude mcp add &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nb"&gt;local &lt;/span&gt;ingero &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;ingero mcp &lt;span class="nt"&gt;--db&lt;/span&gt; demo.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reproduces the same graph lifecycle events, causal chains, and root cause analysis on your own hardware and workload.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;CUDA graphs aren’t going away. Every new NVIDIA toolkit release makes them more capable: conditional nodes, device-side launch, tighter framework integration. The workloads that depend on them (LLM inference, diffusion models, real-time serving) are only growing.&lt;/p&gt;

&lt;p&gt;Observability for these workloads needs to keep pace. If you’re running &lt;code&gt;torch.compile&lt;/code&gt; or serving models with vLLM, your GPU workload is already using CUDA graphs. The community needs better tooling to see what they’re doing under the hood. Furthermore, the eBPF architecture we are building at the Linux kernel level sets the foundation for tracking these same host-side bottlenecks across heterogeneous hardware in the future.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The &lt;a href="https://github.com/ingero-io/ingero/blob/main/investigations/cuda-graph-cpu-contention.db" rel="noopener noreferrer"&gt;investigation database&lt;/a&gt; from this post is available for download.&lt;br&gt;
Investigation performed on EC2 g4dn.xlarge (Tesla T4), Ubuntu 24.04, kernel 6.17, NVIDIA 580.126.09, PyTorch 2.10+CUDA 12.0. Also validated on RTX 4090, A100, H100, and GH200.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub (give us a star!):&lt;/strong&gt; &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;. No NVIDIA SDK, no code changes, production-safe by design.&lt;/p&gt;

&lt;p&gt;Interested in CUDA Graphs tracing? Drop us a message at info(@)ingero.io or &lt;strong&gt;&lt;a href="https://github.com/ingero-io/ingero/issues/new/choose" rel="noopener noreferrer"&gt;create an issue on GitHub&lt;/a&gt;&lt;/strong&gt; and we will gladly dive into it together.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ingero is free &amp;amp; open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, &amp;lt;2% overhead.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ingero.io/gpu-incident-at-3am-ebpf-tracing-from-page-to-root-cause-in-60-seconds/" rel="noopener noreferrer"&gt;GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/tracing-torch-cuda-empty_cache-on-an-rtx-4090-where-do-the-53-mb-go/" rel="noopener noreferrer"&gt;Tracing torch.cuda.empty_cache() on an RTX 4090&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ingero.io/your-gpu-is-97-utilized-but-your-training-is-3x-slower-than-expected/" rel="noopener noreferrer"&gt;GPU 97% Utilized But Training 3x Slower: What nvidia-smi Misses&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cuda</category>
      <category>gpu</category>
      <category>ebpf</category>
      <category>ai</category>
    </item>
    <item>
      <title>124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Wed, 01 Apr 2026 16:40:55 +0000</pubDate>
      <link>https://forem.com/ingero/124x-slower-what-pytorch-dataloader-actually-does-at-the-kernel-level-3o3a</link>
      <guid>https://forem.com/ingero/124x-slower-what-pytorch-dataloader-actually-does-at-the-kernel-level-3o3a</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; PyTorch's DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU workloads. We reproduced a real PyTorch issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root cause. The GPU wasn't slow - it was starving. DataLoader workers generated 200,000 CPU context switches and 300,000 page allocations in 40 seconds, leaving the GPU waiting an average of 301ms per data transfer that should take microseconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;A PyTorch user reported that DataLoader was 7-22x slower than direct tensor indexing for a simple MLP inference workload. Even with &lt;code&gt;num_workers=12&lt;/code&gt;, &lt;code&gt;pin_memory=True&lt;/code&gt;, and &lt;code&gt;prefetch_factor=12&lt;/code&gt;, the gap remained massive. GPU utilization sat at 10-20%.&lt;/p&gt;

&lt;p&gt;We reproduced it. The gap was even worse on our hardware:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;vs Direct&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Direct tensor indexing&lt;/td&gt;
&lt;td&gt;0.39s&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DataLoader (shuffle=True)&lt;/td&gt;
&lt;td&gt;48.49s&lt;/td&gt;
&lt;td&gt;124x slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DataLoader (optimized, 4 workers, pin_memory)&lt;/td&gt;
&lt;td&gt;43.29s&lt;/td&gt;
&lt;td&gt;111x slower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The workload is trivial: 7M samples, 100 features, 2-layer MLP, batch size 1M. The model processes a batch in milliseconds. So where does the time go?&lt;/p&gt;

&lt;h2&gt;
  
  
  What nvidia-smi Shows
&lt;/h2&gt;

&lt;p&gt;Nothing useful. GPU utilization flickers between 0% and 30%. Memory usage is stable. Temperature is fine. The GPU is clearly underutilized, but nvidia-smi can't explain why.&lt;/p&gt;

&lt;h2&gt;
  
  
  What torch.profiler Shows
&lt;/h2&gt;

&lt;p&gt;The reporter tried PyTorch's built-in profiler and "obtained no meaningful trace data." This is a common frustration - application-level profilers can show what CUDA kernels are running, but they cannot see the host-side scheduling, memory, and process lifecycle events that determine whether data arrives at the GPU on time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Kernel-Level Tracing Shows
&lt;/h2&gt;

&lt;p&gt;We ran the benchmark while tracing both CUDA API calls (via eBPF uprobes on libcudart.so) and Linux kernel events (scheduler context switches, memory page allocations, process forks) simultaneously. The results tell the complete story.&lt;/p&gt;

&lt;h3&gt;
  
  
  4 HIGH-severity causal chains
&lt;/h3&gt;

&lt;p&gt;The causal chain engine detected 4 high-severity patterns, all with the same root cause:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[HIGH] cudaStreamSync p99=42ms (1,638x p50=25us) - CPU 100% + 1,880 sched_switch events
Timeline:
  [SYSTEM] CPU 100%
  [HOST  ] 1,880 context switches (21s off-CPU)
  [CUDA  ] p99=42ms (1,638x p50=25us)
Root cause: DataLoader workers fighting for CPU, massive page allocation pressure

[HIGH] cudaLaunchKernel p99=24.67ms (349x p50=70us) - CPU 100%
Root: 34 sched_switch events

[HIGH] cuMemAlloc p99=627us (4.0x p50) - CPU 100%

[HIGH] cuLaunchKernel p99=106us (4.0x p50) - CPU 100%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;cudaStreamSync&lt;/code&gt; p99 is 1,638 times the p50. That's not GPU slowness - that's the GPU waiting for data that never arrives on time.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Per-Process Breakdown
&lt;/h3&gt;

&lt;p&gt;This is where it gets clear. The main process and its 4 DataLoader workers are visible as separate entities:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Main process:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;cudaMemcpyAsync&lt;/code&gt; (host-to-device transfer): avg 301ms, max 2.9 seconds&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cudaStreamSync&lt;/code&gt;: p99 = 42ms (normally 25us)&lt;/li&gt;
&lt;li&gt;1,567 context switches, avg 16ms off-CPU, worst stall 5 seconds&lt;/li&gt;
&lt;li&gt;799,018 page allocations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DataLoader worker 1:&lt;/strong&gt; 52,863 context switches, 89,338 page allocations, worst stall 5s&lt;br&gt;
&lt;strong&gt;DataLoader worker 2:&lt;/strong&gt; 50,638 context switches, 83,509 page allocations, worst stall 5s&lt;br&gt;
&lt;strong&gt;DataLoader worker 3:&lt;/strong&gt; 49,361 context switches, 70,035 page allocations, worst stall 5s&lt;br&gt;
&lt;strong&gt;DataLoader worker 4:&lt;/strong&gt; 38,862 context switches, 56,354 page allocations, worst stall 5s&lt;/p&gt;

&lt;p&gt;Total across workers: ~191,000 context switches and ~299,000 page allocations in 40 seconds.&lt;/p&gt;
&lt;h2&gt;
  
  
  What This Means
&lt;/h2&gt;

&lt;p&gt;The DataLoader workers are doing three expensive things that direct indexing avoids entirely:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shuffling and indexing.&lt;/strong&gt; DataLoader with &lt;code&gt;shuffle=True&lt;/code&gt; generates a random permutation of indices, then each worker selects its chunk. This requires random memory access across the full 7M-sample tensor - terrible for cache locality and triggers page faults.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Collation and copying.&lt;/strong&gt; Each worker gathers scattered samples into a contiguous batch tensor. This means allocating new memory (page allocations), copying data from random locations (cache misses), and serializing the result back to the main process via shared memory or a queue.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Competing for CPU.&lt;/strong&gt; Four workers + the main process on a 4-vCPU machine means constant preemption. Each worker gets descheduled 50,000 times. The worst-case stall is 5 seconds - during which the GPU has nothing to process.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With direct indexing: &lt;code&gt;X[i:i+batch_size]&lt;/code&gt; is a zero-copy view of a contiguous tensor already in memory. &lt;code&gt;.to(device)&lt;/code&gt; triggers one DMA transfer from a single contiguous region. No workers, no shuffling, no collation, no cross-process copies, no context switches. The GPU gets data in microseconds, not hundreds of milliseconds.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;For in-memory GPU workloads where the entire dataset fits in RAM:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Don't use DataLoader.&lt;/strong&gt; Direct indexing with a pre-shuffled index array is simpler and 100x faster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randperm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. When DataLoader is necessary,&lt;/strong&gt; match &lt;code&gt;num_workers&lt;/code&gt; to the actual CPU core count minus 1. On a 4-core machine, &lt;code&gt;num_workers=2&lt;/code&gt; reduces contention. Add &lt;code&gt;persistent_workers=True&lt;/code&gt; to avoid fork overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. For larger-than-memory datasets&lt;/strong&gt; where DataLoader is necessary, the real bottleneck shifts to disk I/O. Use &lt;code&gt;prefetch_factor=2&lt;/code&gt; (not higher - more prefetching means more memory pressure) and ensure the storage subsystem can keep up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;This investigation illustrates a pattern we see constantly in GPU workloads: the GPU is fast, the host is the bottleneck, and GPU metrics can't see it. nvidia-smi reported low utilization but couldn't explain why. torch.profiler captured CUDA kernels but missed the 200,000 context switches happening in userspace.&lt;/p&gt;

&lt;p&gt;The only way to see the full picture was to trace both sides simultaneously - CUDA API calls at the library level and Linux kernel scheduling events - and correlate them by time and process ID.&lt;/p&gt;

&lt;p&gt;The causal chain &lt;code&gt;CPU 100% -&amp;gt; 1,880 sched_switch -&amp;gt; cudaMemcpyAsync 301ms -&amp;gt; cudaStreamSync 42ms&lt;/code&gt; tells the complete story in one line. Without cross-stack tracing, this would have remained a mystery - as it was for the original reporter who spent weeks debugging it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Reproduce the benchmark:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.utils.data&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataLoader&lt;/span&gt;

&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7_000_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ReLU&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ReLU&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Fast path
&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1_048_576&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1_048_576&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Direct: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Slow path
&lt;/span&gt;&lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1_048_576&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DataLoader: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trace with Ingero to see what's happening under the hood:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ingero-io/ingero.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ingero &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make build
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./bin/ingero trace &lt;span class="nt"&gt;--duration&lt;/span&gt; 60s   &lt;span class="c"&gt;# in one terminal&lt;/span&gt;
python3 benchmark.py                      &lt;span class="c"&gt;# in another terminal&lt;/span&gt;
./bin/ingero explain &lt;span class="nt"&gt;--since&lt;/span&gt; 60s          &lt;span class="c"&gt;# after benchmark completes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Original issue: &lt;a href="https://github.com/pytorch/pytorch/issues/154318" rel="noopener noreferrer"&gt;pytorch/pytorch#154318&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Investigation performed on TensorDock RTX 4090 (24GB), Ubuntu 22.04, PyTorch 2.10.0+cu128.&lt;/p&gt;

</description>
      <category>pytorch</category>
      <category>gpu</category>
      <category>python</category>
      <category>cuda</category>
    </item>
    <item>
      <title>Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization</title>
      <dc:creator>Ingero Team</dc:creator>
      <pubDate>Tue, 31 Mar 2026 14:53:32 +0000</pubDate>
      <link>https://forem.com/ingero/tracing-a-13x-pytorch-slowdown-to-a-hidden-numpy-synchronization-4fho</link>
      <guid>https://forem.com/ingero/tracing-a-13x-pytorch-slowdown-to-a-hidden-numpy-synchronization-4fho</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; A &lt;code&gt;.cpu().numpy()&lt;/code&gt; call buried inside a forward pass was forcing a full CPU-GPU synchronization on every batch, every loop iteration. The GPU would finish its work in milliseconds, then sit idle for ~2 seconds waiting for Python and NumPy to catch up. Replacing the NumPy logic with pure PyTorch ops gave a 6.4x speedup on a T4 and 13x on an RTX 5080. The fix is two lines of code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bug
&lt;/h2&gt;

&lt;p&gt;Swin-MAE is a masked autoencoder built on Swin Transformers. A user training on 5.2 million images with an RTX 5090 noticed that GPU utilization kept dropping to ~30% during the forward pass. The model would spike, stall, spike, stall.&lt;/p&gt;

&lt;p&gt;The problem was in &lt;code&gt;window_masking&lt;/code&gt;, the function that decides which image patches to mask during training. Here is what the code looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The hot loop: runs once per batch, every forward pass
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;index_mask&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setdiff1d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index_all&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_keep&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;x_masked&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_mask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:],&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mask_token&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two Python for loops. Two &lt;code&gt;.cpu().numpy()&lt;/code&gt; calls per iteration. On a batch size of B, that is B x 2 implicit &lt;code&gt;cudaStreamSynchronize&lt;/code&gt; calls per forward pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this kills performance
&lt;/h2&gt;

&lt;p&gt;Every time you call &lt;code&gt;.cpu().numpy()&lt;/code&gt; on a CUDA tensor, PyTorch has to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Flush the GPU pipeline.&lt;/strong&gt; Any queued CUDA operations must finish before the data can be read back.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transfer data over PCIe.&lt;/strong&gt; The tensor moves from GPU VRAM to system RAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Block the Python thread.&lt;/strong&gt; Nothing else happens until the transfer completes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On older GPUs, this penalty was small enough to go unnoticed. The Swin-MAE maintainer confirmed the code was written four years ago. But on modern hardware (RTX 5080, 5090), the GPU finishes its batch computation in milliseconds. The NumPy detour takes ~2 seconds. The faster the GPU gets, the worse this anti-pattern becomes.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;np.setdiff1d&lt;/code&gt; call is doing set-difference math on the CPU to figure out which patches to mask. This is something PyTorch can do entirely on the GPU without ever leaving CUDA.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;Drop NumPy entirely. Keep everything as PyTorch tensor operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: B x 2 implicit cudaStreamSync per forward pass
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;index_mask&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setdiff1d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index_all&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_keep&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;x_masked&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_mask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:],&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mask_token&lt;/span&gt;

&lt;span class="c1"&gt;# After: zero CPU transfers
&lt;/span&gt;&lt;span class="n"&gt;index_mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ids_shuffle&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;mask_len&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;span class="n"&gt;x_masked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scatter_&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_mask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unsqueeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;expand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;mask_token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;code&gt;ids_shuffle&lt;/code&gt; already contains the full permutation on the GPU. The masked indices are just the tail end of that shuffle. No need to compute a set difference at all, and &lt;code&gt;scatter_&lt;/code&gt; handles the masked token assignment without leaving CUDA.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;T4 (AWS EC2)&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;6.4x faster&lt;/td&gt;
&lt;td&gt;6.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 5080&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;13x faster&lt;/td&gt;
&lt;td&gt;13x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The RTX 5080 benefits more because its raw compute is faster, which makes the CPU round-trip penalty proportionally larger.&lt;/p&gt;

&lt;h2&gt;
  
  
  This pattern is everywhere
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;.cpu().numpy()&lt;/code&gt; anti-pattern is not unique to Swin-MAE. We see it constantly in training code, especially in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom masking logic (MAE, BEiT, any masked pretraining)&lt;/li&gt;
&lt;li&gt;Dynamic batching and padding (NLP sequence collation)&lt;/li&gt;
&lt;li&gt;Custom augmentation pipelines that mix PyTorch and NumPy/SciPy&lt;/li&gt;
&lt;li&gt;Metric computation mid-training (accuracy checks that pull tensors to CPU every N steps)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tell is always the same: GPU utilization that spikes and drops in a sawtooth pattern. The GPU is not slow. It is waiting.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to find this deterministically
&lt;/h2&gt;

&lt;p&gt;The manual way to find this bug is to grep the codebase for &lt;code&gt;.cpu()&lt;/code&gt; and &lt;code&gt;.numpy()&lt;/code&gt; and hope the culprit is actually inside the hot loop. The slightly better way is to run a standard profiler, stare at a wall of timelines, and try to manually correlate host-side Python threads with GPU stream synchronizations.&lt;/p&gt;

&lt;p&gt;To catch these implicit syncs deterministically without code changes, we take it to the kernel level. &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;Ingero&lt;/a&gt;, an open-source eBPF tracer, attaches uprobes directly to &lt;code&gt;libcudart.so&lt;/code&gt; and &lt;code&gt;libcuda.so&lt;/code&gt;. Instead of polling metrics, it captures every CUDA API call with nanosecond precision and builds causal chains connecting host OS events directly to GPU stalls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Investigate with AI
&lt;/h2&gt;

&lt;p&gt;After running &lt;code&gt;ingero trace&lt;/code&gt;, you can point any MCP-compatible AI client at the resulting trace database and ask questions directly. No code required.&lt;/p&gt;

&lt;p&gt;Create the MCP config file at &lt;code&gt;/tmp/ingero-mcp-pytorch.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ingero"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./bin/ingero"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ingero-trace.db"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;ingero-trace.db&lt;/code&gt; with the path to the trace database created by &lt;code&gt;ingero trace&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With Ollama (local, free):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollmcp &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5:27b &lt;span class="nt"&gt;-j&lt;/span&gt; /tmp/ingero-mcp-pytorch.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Claude Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--mcp-config&lt;/span&gt; /tmp/ingero-mcp-pytorch.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then type &lt;code&gt;/investigate&lt;/code&gt; and let the model explore.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;If you are mixing NumPy and PyTorch inside a training loop, you are probably paying a synchronization tax on every batch. Modern GPUs are fast enough that a single &lt;code&gt;.cpu().numpy()&lt;/code&gt; call can dominate your total training time.&lt;/p&gt;

&lt;p&gt;Check your forward pass. Check your masking logic. Check your custom collation functions. If the GPU is waiting, the fix might be two lines.&lt;/p&gt;




&lt;p&gt;GitHub: &lt;a href="https://github.com/ingero-io/ingero" rel="noopener noreferrer"&gt;github.com/ingero-io/ingero&lt;/a&gt;&lt;br&gt;
Original issue: &lt;a href="https://github.com/Zian-Xu/Swin-MAE/issues/24" rel="noopener noreferrer"&gt;Zian-Xu/Swin-MAE#24&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The code is open source (Apache 2.0). Star it on GitHub.&lt;/p&gt;

</description>
      <category>pytorch</category>
      <category>cuda</category>
      <category>python</category>
      <category>gpu</category>
    </item>
  </channel>
</rss>
