Forem: Ingero Team

MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.

Ingero Team — Mon, 11 May 2026 13:00:00 +0000

MCP exposes the agent’s tool calls. eBPF exposes the kernel events that explain why those tool calls returned what they returned.

TL;DR

The Model Context Protocol (MCP) is converging on an industry standard. In the past 10 days, eight observability and security platforms have shipped MCP servers (Grafana, SAS Viya, AWS Bedrock AgentCore, Optro, Command Zero, BlueCat, DBmaestro, the open-source CVE MCP). All of them expose roughly the same shape: governed tool calls that an agent can invoke against the platform’s data plane. That answers the question “what did the agent do?” It does not answer the question “why was the underlying system slow when the agent did it?” That second question lives in the kernel, on every machine, and only kernel-level instrumentation can answer it. We walk through a concrete trace where MCP and eBPF together close the loop.

What MCP gives the agent

Anthropic’s MCP is a small JSON-RPC protocol with a fixed shape: a server exposes a set of tools (named functions with typed arguments and return values), the agent calls them, and the agent receives structured responses. The protocol is deliberately minimal. The interesting part is what the tools do.

Looking at the MCP servers shipped in the past ten days:

Grafana Cloud Remote MCP lets the agent query metrics, logs, and traces across a Grafana stack, plus the new o11y-bench evaluation benchmark.
AWS Bedrock AgentCore custom MCP proxies give the agent access to enterprise data sources, gated by IAM.
DBmaestro MCP exposes release automation, source control, CI/CD orchestration, and compliance workflows as MCP tools, all running inside the existing permission model.
Command Zero MCP opens an autonomous-SOC platform: investigation management, remediation execution, schema introspection.
BlueCat MCP Servers connect network DDI / DNS / IPAM data to AI agents.
Optro MCP exposes governed GRC data access.
CVE MCP Server wraps 27 tools across 21 vulnerability-triage APIs.
Ingero MCP exposes seven read-only tools against an eBPF trace database (get_check, get_trace_stats, get_causal_chains, get_stacks, run_demo, get_test_report, run_sql).

Every one of these answers a question of the form “what is in the data plane I already own, and what action would I like the agent to take on it?” None of them, by themselves, can answer “why is the underlying system that produced this data behaving the way it is?”

That is the gap.

Two questions, two layers

Take a concrete example. An agent investigating a vLLM latency spike calls a Grafana MCP tool and gets back a metric series: TTFT (time to first token) jumped from 200ms to 11s for a five-minute window. The agent then calls a logs tool and surfaces the relevant request IDs. So far, MCP has done its job: the agent now knows what happened in the application layer.

What it does not know:

Was the GPU busy or idle during that window?
If busy, was it busy with the right kernels?
If the right kernels, were they bandwidth-bound, compute-bound, or waiting on data?
If waiting, was the wait an explicit cudaDeviceSynchronize, an all-reduce on a slow rank, or a host-side context switch on the dispatcher thread?
If host-side, which other process took the CPU and for how long?

Those are kernel-level questions. They live in libcudart.so uprobes, libcuda.so uprobes, and Linux scheduler tracepoints. No application-layer telemetry can answer them, because no application-layer telemetry sees them.

This is the boundary between what and why. MCP fits naturally on the what side, because MCP is a tool-call surface over the data the platform already collected. eBPF is the why side, because eBPF is what collects the data nobody else does.

Closing the loop with two MCP servers in the same chat

In a recent investigation we wired two MCP servers into the same Claude Code session: a Grafana-style metrics MCP at the top of the stack, and the Ingero MCP at the bottom. The agent’s transcript, condensed:

Engineer: TTFT just spiked on the inference cluster. What happened?

Agent (calls metrics MCP): Median TTFT was 200ms before 14:32 UTC; from 14:32 to 14:37 it averaged 8.9s with p99 at 11.2s. Request volume did not change. CPU on the host was 12% mean. GPU utilization on the relevant nvidia-smi panel was 96% throughout. The “what” is: the GPU appeared busy, but request latency collapsed for five minutes.

Agent (calls Ingero get_trace_stats): During the 14:32-14:37 window, cudaLaunchKernel p50 was 17us; p99 was 13.1ms. That is a 770x spread. cudaDeviceSynchronize p99 was 4.7s.

Agent (calls Ingero get_causal_chains): Three HIGH-severity causal chains in the window. The dominant one: 428 sched_switch events on PID 84217 (vLLM engine thread) accumulating 8.9 seconds of off-CPU time, correlated with the cudaLaunchKernel tail. The chain points to a futex_wait blocking call from a sibling Python thread that is contending with the engine.

Agent (calls Ingero get_stacks): The top stack on the engine thread shows the call coming from vllm/engine/llm_engine.py:step() -> cudaLaunchKernel (libtorch_cuda.so). The off-CPU events resolve to schedule() -> futex_wait_queue_me events triggered by a co-scheduled tokenizer worker.

Agent (synthesizes): TTFT spiked because the vLLM engine thread was being preempted by a co-scheduled CPU-bound process for ~3.7% of wall time, and 18% of cudaLaunchKernel calls in the window experienced an off-CPU event between syscall entry and exit. The metrics MCP showed the GPU was busy (true, the dispatcher kept feeding it kernels). The kernel-level trace showed why the busy-ness did not translate into throughput.

Engineer: What’s the fix?

Agent: Pin the vLLM engine to dedicated CPU cores via taskset or cgroup cpuset, isolate it from the tokenizer workers. Re-run with the same workload and check if the off-CPU events on PID 84217 fall below 0.5%.

Two MCP servers, in series, on a single agent session. The metrics MCP narrowed the problem to a five-minute window. The eBPF MCP told the agent why the GPU was idle inside that window even though the duty-cycle counter said 96%.

The shape that closes the loop is not “agent-tracing on every host” or “yet another counter dashboard.” It is two complementary MCP surfaces, one over the application layer and one over the kernel layer, with the agent doing the synthesis.

Why the kernel layer needs eBPF specifically

A few teams have asked us why we ship the cause-side data through eBPF rather than through an application SDK. The short answer: every application SDK requires you to instrument the application, which means you cannot observe what the application doesn’t know about itself, and you cannot observe applications you don’t own.

eBPF doesn’t have either limitation. Uprobes attach to libcudart.so and libcuda.so from outside the process. They see every CUDA call regardless of which framework made it (PyTorch, TensorFlow, vLLM, SGLang, Triton, custom CUDA). Tracepoints on sched_switch, block:block_rq_issue, tcp:tcp_retransmit_skb see every host event regardless of which container produced it. The cost is a small fixed kernel overhead (under 2% on the workloads we have measured), independent of the number of processes.

That is what makes the why-layer agent-callable across vendors. An MCP tool over an eBPF database can answer the same question for vLLM and for a custom CUDA C++ binary, because eBPF treats both the same.

What this means for the MCP wave

Eight MCP servers in ten days is a strong signal that the protocol is settling. The category-vocabulary window is forming around “MCP server = governed agent control surface for X domain.” Most of the eight are over the what layer (metrics, logs, network state, security alerts, database release pipelines, vulnerability data). That’s the right layer to start: it’s where structured platform data already lives.

The next round of MCP servers will be over the why layer. The interesting design constraints are different there:

Read-only tool calls only (the agent can investigate, not remediate).
Schema is event-shaped, not metric-shaped. Aggregations come from run_sql against the captured events table, not from a pre-bucketed time series.
Causal chains are first-class. The MCP tool returns “kernel A on thread B was blocked because thread B was off-CPU because process C was holding futex D,” not just a count or a percentile.
Per-host data, not per-cluster. The cluster view is a fan-out of per-host calls, not a centralized index.

Ingero’s MCP server was an early example. Whatever the next eBPF-over-MCP servers look like, the ones that actually move agent investigations forward will share these properties.

More MCP servers shipped in the same window

Three public MCP launches from the same 10-day window worth tracking alongside the eight named above: PagerDuty’s AI SRE Agent (Slack-resident, MCP-native, 30+ AI tools); Grafana Cloud Remote MCP (announced GrafanaCON 2026, metrics + logs + traces tool surface); and SAS Viya MCP Server (April 28, governance-first design). All sit on the what-layer of the stack: governed tool calls over data the platform already collected.

Where the why-layer goes next

MCP gave agents a clean way to ask “what happened in the system I already monitor?” eBPF is what produces the data behind “why did it happen at the kernel layer?” The two are complementary, not overlapping. The investigation that took two MCP calls + a follow-up question above would have taken a senior SRE several hours of SSH-and-grep without either layer. With both, an agent does it in seconds, with the engineer reviewing the steps.

If the eight-MCP-servers-in-ten-days pattern continues, the next wave of platform integrations will not be “yet another what-layer dashboard.” It will be the why-layer. eBPF is where that layer is built.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are wiring AI agents into infrastructure observability and trying to close the gap between application-layer telemetry and kernel-level causes.

Investigation DB: investigations/vllm-37343-logprobs-amplification.db*

MCP Tools Are New API Surfaces. eBPF Sees What They Actually Touch.

Ingero Team — Thu, 07 May 2026 13:00:00 +0000

An MCP tool call is a tiny line of agent code that fans out to syscalls, library calls, and kernel paths the agent has no view of.

TL;DR

Through April and early May, vendors shipped MCP servers in batches: Datadog, BlueCat, Command Zero, DBmaestro, the public CVE MCP, Grafana Cloud Remote MCP, SAS Viya MCP. The agent-side abstraction is small (a tool name and a JSON schema). The kernel-side surface that runs when the agent calls the tool is large and unstated. eBPF fills in what the tool actually touches.

What an MCP tool call looks like to the agent

An MCP tool, from the agent’s perspective, is a function with a name and an input schema. The agent calls it; a JSON-RPC payload goes to the tool server; a result returns. The MCP specification covers transport and discovery, not what the tool does between request and response.

That gap is fine when the tool wraps a pure HTTP API. It widens fast for tools that wrap a database client, a cloud SDK, a filesystem helper, or a GPU runtime. A “run query” tool can spawn a subprocess, open a unix socket, hit an SDK that maintains a connection pool, and trigger a kernel scheduling event the agent will never see.

What a syscall trace shows for the same call

Point an eBPF tracer at the tool-server process while the agent issues the call. The trace records the syscalls the tool made, the libraries it pulled in (resolved via /proc/[pid]/maps), the network endpoints it opened, and the on-CPU time spent in user vs kernel mode. Now the call is no longer an opaque box. The agent’s “run analysis” maps to a concrete path through the host.

# capture an MCP tool's real footprint while the agent calls it
ingero trace --pid $(pgrep -f mcp-server) --duration 30s \
  --out /tmp/mcp-tool-trace.db

# then ask the trace what the tool touched
ingero query /tmp/mcp-tool-trace.db \
  "SELECT comm, syscall, COUNT(*) AS n
     FROM host_events
    GROUP BY comm, syscall
    ORDER BY n DESC LIMIT 20"

Why this matters more for GPU MCP tools

On a GPU host the unstated kernel-side surface is wider. A tool that “reads GPU utilization” might call nvidia-smi (a fork+exec), might open /dev/nvidia*, might link libnvidia-ml.so. DCGM exporters running alongside add their own surface. The agent still sees one tool name; the kernel sees many distinct paths.

When an MCP-driven workflow is slow or wrong, the question “which tool call is responsible” stops at the JSON layer. eBPF on the tool-server process pushes that question through to a syscall and a library, and often to a CUDA driver call.

Try it locally

Pick any MCP server you already run (Filesystem, Postgres, the Anthropic reference servers). Start the server. Run an agent against it. In another shell:

# 1. install
curl -fsSL https://github.com/ingero-io/ingero/releases/latest/download/install.sh | sh

# 2. capture the tool server's footprint for one minute
ingero trace --pid $(pgrep -f your-mcp-server) --duration 60s \
  --out /tmp/mcp.db

# 3. inspect what the tool actually did
ingero query /tmp/mcp.db "SELECT * FROM cuda_events LIMIT 20"
ingero query /tmp/mcp.db "SELECT * FROM net_events LIMIT 20"
ingero query /tmp/mcp.db "SELECT * FROM io_events  LIMIT 20"

Three queries against the same DB cover the three surfaces an MCP tool most often hides: GPU runtime calls, network calls, and disk I/O.

Smaller surface, same investigation

MCP narrows the agent-facing API. It does not narrow the host-side path a tool runs through. Treating an MCP call as a syscall pattern, not a JSON message, is what keeps a multi-MCP agent investigable when one of the tools is the slow or broken one.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are shipping or operating MCP servers and want a kernel-level view of what your tools actually touch.*

A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.

Ingero Team — Wed, 06 May 2026 13:30:00 +0000

Eight ranks on two hosts. Every per-host metric reads healthy. Rank 5 enters the barrier 290ms late. The cause lives in a cross-rank query, not in any single host’s trace.

TL;DR

Eight ranks on two hosts run an all-reduce. Token throughput drops 4x. Every per-host nvidia-smi reads 95-99% utilization. Every per-host eBPF trace looks clean. The cause is rank 5 on node B taking 380ms on a step that the other seven ranks finish in 90ms. The other seven ranks spend 290ms blocked inside ncclAllReduce, which counts as a running kernel and reports as healthy on every per-host metric. The diagnosis lives in a cross-rank query, not in any single host’s trace. We shipped Ingero Echo (a cluster-wide AI-investigation tool that auto-collects OTLP from every node and exposes it as MCP-over-DuckDB) to make those queries answerable for AI agents directly. This post walks through the proof: 2,000 events from two nodes, fan-in into one DuckDB, queries that surface the straggler.

What per-host tracing cannot see

A typical 8-GPU all-reduce on H100s runs at ~80GB/s ring bandwidth. The synchronization barrier is ncclAllReduce. Every rank enters at roughly the same wall-clock; the rank that finishes last sets the wall time for all eight. When one rank is slow, the other seven do not idle visibly. They sit inside ncclAllReduce, which is itself a CUDA kernel. nvidia-smi sees a kernel running. DCGM’s SM_ACTIVE ticks. Per-host eBPF sees cudaLaunchKernel -> ncclAllReduce -> cudaStreamSynchronize complete normally. The local trace is clean.

What each rank does NOT see:

That the seven other ranks are also blocked inside the same ncclAllReduce for the same window.
That one specific rank entered the barrier 290ms later than the others.
That this same rank-5 pattern repeats every 14 steps, hinting at a memory-fragmentation cycle on the slow host.

These are facts about the cluster, not facts about any host. A monitoring stack that ships per-host samples to a centralized dashboard can render them as time series, but a time-series view does not answer “which rank’s call stack caused the 290ms wait the other ranks observed?” That is a relational query across host boundaries.

The cluster-level question

The question that matters during a stall is:

For each step where end-to-end throughput dropped, which rank’s ncclAllReduce started later than its peers, and what was that rank doing in the previous 500ms?

To answer it, you need:

Every rank’s ncclAllReduce enter/exit events, timestamped with a clock that is consistent across hosts.
Every rank’s CUDA call stack and off-CPU events for the 500ms before each ncclAllReduce enter.
A causal chain identifier that links a single training step’s events across all ranks.
A single store you can SQL.

Ingero’s per-host agent already produces those events. The agent attaches uprobes to libcudart.so and libnccl.so, captures kernel-scheduler events with eBPF, and emits OTLP. The missing piece in v0.12.4 was the cluster-level destination: a place where every host’s stream could fan in, where the events keep their resource attributes (cluster ID, node ID, rank, nranks), and where SQL can join across ranks. v0.12.5 ships that piece.

Echo, in one paragraph

Ingero Echo is a cluster-wide AI-investigation tool for GPU observability. It auto-collects OTLP/gRPC streams from every Fleet collector in the cluster into embedded DuckDB (one writer, single-statement read-only SQL), then exposes that data through a Model Context Protocol server with four tools: fleet.cluster.event_history, fleet.cluster.find_outlier_nodes, fleet.cluster.run_analysis (SQL-only, gated by a lexical guard), and fleet.cluster.get_cost. AI agents (Claude, Cursor, ollmcp, any MCP client) drive cross-rank investigations through this MCP-over-DuckDB surface without ever touching the database directly. Echo ships as a single-binary StatefulSet behind a ClusterIP service. The event-store path holds a flock(2) so a rolling-update force-detach does not corrupt the WAL. The receiver enforces bearer-token auth on the OTLP listener. The image is 87MB. One Echo per cluster.

The fan-in correctness proof

Before shipping Echo as a single source of truth for cluster-level queries, we wanted a hard answer to one question:

When N agents push concurrently from N hosts, do all events land, do causal-chain identifiers survive, and can a SQL query distinguish the straggler from the healthy ranks?

The proof is cmd/ingero-echo/integration_test.go plus a hardware run on Lambda Cloud. Eight concurrent producers, 250 events each, OTLP/gRPC into a DuckDB-backed Echo. Mixed across two cluster IDs. Causal-chain markers injected every 25th event. Stragglers (synthetic low-health-score events) injected every 100th. Total: 2,000 events.

The test asserts:

All events land: SELECT COUNT(*) FROM events == 2000.
Per-rank counts are correct: each of the 8 producer node IDs has exactly 250 events.
Causal chains preserved across the wire: the 80 distinct causal_chain_id markers we inserted at the producer come back from the store. None lost. None merged.
Stragglers surface in a cross-cluster query: SELECT node_id, COUNT(*) FROM events WHERE value_double < 0.4 GROUP BY node_id returns the 18 events we injected and only those.
Burst write does not lose events under contention: a separate test runs 5,000 events at ~1k EPS through WriteEvents from 8 goroutines simultaneously, all serialized via Echo’s writer mutex. Zero loss; throughput floor met.

The integration test (TestEchoFanIn_AllEventsLand) runs in under 21 seconds on a CI runner. The hardware run on the populated DB is reproducible from the artifacts attached to this post.

The hardware run

We provisioned an A100 (40GB SXM4) on Lambda us-east-1 and an H100 (80GB SXM5) on Lambda us-south-3 to play the role of two GPU nodes. Echo ran on the A100. A simple OTLP/gRPC stress client (cmd/echo-stress/) pushed 1,000 events from each node into Echo. Cross-region OTLP from H100 to A100 was blocked by Lambda’s default firewall, so the H100 stream was simulated from the A100 host with --node-id=node-h100. Echo’s fan-in path treats both streams identically; the test exercises the same RPC handler, the same writer mutex, and the same DuckDB schema. The DB attached to this post (echo-fanin-demo.db, 1.0 MiB) is the result.

The exact commands are in commands.md next to this post. The runbook starts Echo with a bearer token, runs two echo-stress invocations with different node IDs, and validates with three DuckDB queries.

The queries that matter

These are the SQL queries that produced the assertions above. All four run against the attached DB.

Per-node event count.

SELECT cluster_id, node_id, COUNT(*) AS events
FROM events
GROUP BY cluster_id, node_id
ORDER BY node_id;

demo-cluster | node-a100 | 1000
demo-cluster | node-h100 | 1000

Causal chains preserved.

SELECT COUNT(DISTINCT json_extract_string(attrs_json, '$.causal_chain_id')) AS chains
FROM events
WHERE attrs_json LIKE '%causal_chain_id%';

Stragglers per node.

SELECT node_id, COUNT(*) AS straggler_events
FROM events
WHERE value_double < 0.40
GROUP BY node_id;

node-a100 | 9
node-h100 | 9

Median and p95 health by node.

SELECT node_id,
       quantile_cont(value_double, 0.5)  AS median_health,
       quantile_cont(value_double, 0.95) AS p95_health
FROM events
GROUP BY node_id;

The same shape of query is what an agent calls through the MCP run_analysis tool. The lexical SQL gate (sqlguard) rejects any query that touches the filesystem (read_csv, read_parquet, the READ_* family, the SNIFF_* family, bare-quoted FROM table sources, the httpfs/s3/gcs schemes) and any query that introspects DuckDB’s own catalog (duckdb_settings, duckdb_tables, the duckdb_* family). The guard runs once at the MCP boundary and once again inside the store, so the gate cannot drift between layers.

What this changes for AI agents

The MCP server is the part that matters for agents. An agent investigating “throughput dropped at 14:32 UTC, every rank reports healthy, why” can now ask Echo directly:

fleet.cluster.find_outlier_nodes(window="14:30-14:35", metric="ingero.health.score", threshold=0.4)

and receive back the ranked node list. It can then ask:

fleet.cluster.event_history(cluster_id="...", node_id="<outlier>", window="14:30-14:35", limit=1000)

to pull the call stack. It can finally call:

fleet.cluster.run_analysis(sql="SELECT ... causal_chain_id ... GROUP BY node_id ...")

to pivot the data into the shape that answers the question. The agent never sees the populated DB directly. It sees four tools, each returning JSON-shaped responses bounded by the lexical guard.

This is the gap that per-host MCP servers cannot close. A per-host MCP server can answer “what did the agent do on this host?” but it cannot answer “what was the cluster doing when the agent observed the spike?” Cross-rank causal questions need a cross-rank AI-investigation surface. Echo is that surface.

Try it locally

Two paths, depending on whether you want to run the demo end-to-end or just inspect the recorded output.

Reproduce the fan-in scenario from scratch. The integration test in cmd/ingero-echo/integration_test.go spins up Echo backed by a fresh DuckDB in a per-test temp directory, fans in 8 concurrent agents pushing 250 events each (2,000 events total), and asserts that all events landed, the planted outlier surfaces in the MCP query, and causal-chain events are preserved with all attributes. Each invocation produces its own DB.

git clone https://github.com/ingero-io/ingero-fleet.git
cd ingero-fleet/cmd/ingero-echo
go test -run TestEchoFanIn_AllEventsLand ./...

The test takes under 10 seconds on a developer laptop. Requirement: a Go toolchain plus DuckDB’s CGO build dependencies (libstdc++).

To inspect the populated DB after the test runs, set ECHO_BLOG_ARTIFACT=1 in the environment and the test will copy the final DB to /tmp/echo-fanin-demo.db. Then:

ECHO_BLOG_ARTIFACT=1 go test -run TestEchoFanIn_AllEventsLand ./...
duckdb /tmp/echo-fanin-demo.db

Run any of the queries from “The queries that matter” section above against this freshly captured DB; the schema is identical, only the random event IDs differ.

Inspect the published demo DB without running anything. The same DB referenced earlier in this post is published in the public Fleet repo. 2,000 events, 2 clusters, 80 causal chains preserved across the wire, 18 stragglers detected end-to-end.

curl -fsSL -o echo-fanin-demo.db \
  https://github.com/ingero-io/ingero-fleet/raw/main/investigations/echo-fanin-demo.db

duckdb echo-fanin-demo.db

Open it and run the same queries from “The queries that matter” section above. The Echo schema is documented in cmd/ingero-echo/store/schema.go: one row per OTLP data point, dedicated columns for cluster_id / node_id / metric_name / rank / nranks / value_double / value_int, and an attrs VARCHAR holding the rest as JSON. Two indexes target the most-used filters ((cluster_id, timestamp_ns) and (node_id, timestamp_ns)).

The two paths are independent: the test reproduction does not read the published DB, and the published DB does not require the test to be run. Both demonstrate the same Echo store schema, so a query that works on one works on the other.

Total wall time is under five minutes; total cost on Lambda was about $0.50 for the A100 alone.

Across hosts, not on hosts

A per-host trace can be perfectly correct and still useless. The pattern across hosts is what carries the cause. Eight ranks blocked inside ncclAllReduce looks identical to eight ranks running healthy work; the only thing that distinguishes the two is whether one rank entered late. That fact lives in a join, not in a single host’s events.

Echo’s job is to be the AI-investigation surface where the join can run, with the same OTLP semantic conventions on the wire, the same DuckDB schema underneath, and the same MCP shape that agents are already learning to use. The fan-in correctness proof is the gate before the rest of the work depends on the store. With v0.12.5 it is shipped, tested, and reproducible.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running distributed training or inference and seeing throughput drop while every rank reads healthy.

Investigation DB: investigations/echo-fanin-demo.db*

GPU Utilization Is a Counter, Not a Cause

Ingero Team — Mon, 04 May 2026 17:08:26 +0000

nvidia-smi reads 97% the entire window. The red gaps in the cause-side timeline are the throughput the GPU lost while the counter sat green.

TL;DR

A vLLM server reads 97% GPU utilization on nvidia-smi for an 8-minute window. Token throughput drops 3x in the middle of that window. Both statements are true, and both come from the same workload. The reason is that GPU utilization as nvidia-smi reports it is a duty-cycle counter (percent of time at least one kernel was running), not a measure of useful work. Five different failure modes score 100% on that counter while throughput collapses. Causal observability lives in the layer below: kernel runtime distributions, off-CPU time on the dispatcher thread, NCCL waits, I/O stalls.

The mystery

We were running an internal repro of a vLLM latency spike on a TensorDock RTX 4090 (vLLM 0.18.0, Qwen2.5-0.5B-Instruct). Two metrics from the same 8-minute window:

nvidia-smi: 97% GPU utilization (sampled every second, range 92-99%, never below 90%)
Token throughput: started at 2,180 tok/s, dropped to 730 tok/s by minute 4, recovered by minute 7

Nothing on the GPU dashboard moved. The fan curve was flat. Memory was steady. Power draw stayed at 320W. By every counter on the host, the workload was healthy.

It wasn’t.

The actual root cause was an n_completions=8 logprobs=20 request that expanded each decode step into 8 sequences with full-vocabulary softmax (~150K tokens). That request blocked every co-scheduled request for 9-11 seconds at a time. The GPU stayed “utilized” the entire window because some kernel was always running. None of those kernels were producing user-visible tokens.

This is not an exotic edge case. It is the standard failure mode of GPU monitoring when the only metric in the loop is utilization.

What nvidia-smi actually counts

NVIDIA’s own documentation defines GPU-Util as: percent of time over the past sample period during which one or more kernels was executing on the GPU. That is a duty-cycle measurement. It says nothing about whether the running kernel is doing useful work, whether it is bandwidth-bound, whether it is the right kernel, whether it is blocking other kernels, or whether the dispatcher thread on the host is feeding it efficiently.

DCGM exposes the same number with finer granularity (DCGM_FI_DEV_GPU_UTIL), plus per-engine counters (SM_ACTIVE, TENSOR_ACTIVE, MEM_COPY_UTIL). The deeper counters help, but they remain counters. A kernel that runs at 5% of peak FLOPS for 100ms still scores 100% on SM_ACTIVE for that interval.

Five ways to score 100% utilization with broken throughput

We have traced each of these on real workloads. The pattern is consistent: the counter is high, the throughput is low, the dashboard tells nobody anything.

1. Prefill/decode imbalance. vLLM, SGLang, and TGI all batch prefill (input tokens) and decode (output tokens) on the same hardware. When prefill is 100x more compute-heavy than decode, a single long-context request stalls every short-context request behind it. GPU utilization stays at 100% because prefill kernels are saturating the SMs. Decode latency for the queued requests is unbounded.

2. Collective-communication wait in distributed training. A 4-GPU all-reduce that waits on the slowest rank shows 100% utilization on every fast rank (the kernel that implements the wait is itself a kernel). Throughput is bounded by the slow rank, not by the average. We wrote this up in detail in a prior post on cross-rank straggler detection.

3. I/O stall on the dataloader. When PyTorch’s DataLoader does index permutation on the main process and the iteration becomes single-threaded, the GPU runs the same forward kernel over and over while the next batch is gated on a cudaStreamSync. The kernel runs at full speed; the next launch is blocked. We wrote this up in the DataLoader post.

4. CPU contention on the engine thread. vLLM’s engine loop is single-threaded. When the OS context-switches it for any reason (kernel work on a neighboring core, an interrupt, an unfortunate cgroup), cudaLaunchKernel from that thread blocks. We have measured cudaLaunchKernel p99 at 13.1ms (against a p50 of 16.7us, a 784x spread) on an otherwise-idle host, all attributable to context switches. The GPU continues running whatever kernel was launched before the stall, so utilization stays high.

5. Memory-bandwidth saturation. A kernel that streams more data than the SMs can consume scores 100% on SM_ACTIVE while running at a small fraction of peak FLOPS. The metric that matters here is DRAM bandwidth, not utilization.

In all five cases, the symptom is identical (high utilization, low throughput). The cause is in a different layer.

What the cause-side metrics look like

A useful question is: “what was the GPU waiting on, second by second?” Answering that requires four data sources, correlated by timestamp on the same host:

CUDA Runtime API calls (libcudart.so uprobe set: cudaLaunchKernel, cudaMemcpyAsync, cudaStreamSynchronize, cudaDeviceSynchronize)
CUDA Driver API calls (libcuda.so uprobe set: cuLaunchKernel for cuBLAS / cuDNN paths)
Linux scheduler tracepoints (sched_switch, sched_wakeup)
Per-thread off-CPU time accumulated against the dispatcher PID

Concretely, here is what the trace from the workload above looks like once those four sources are correlated:

Window:  minute 3 -> minute 7 (the 3x throughput drop)
GPU-Util: 95% mean
Cause-side metrics on the engine thread:

cudaLaunchKernel p50:        17us
cudaLaunchKernel p99:    13,100us       (770x spread)
cudaLaunchKernel n calls:  4,420
sched_switch events:       2,180        on the engine thread (PID 84217)
off-CPU time:                 8.9 s     accumulated across the window
total wall time on thread:  240   s
fraction off-CPU:           3.7%        of wall time, but
fraction of cudaLaunchKernel calls
  with off-CPU between
  start and finish:         18%

Top blocking call stacks (off-CPU):
  - schedule() -> futex_wait_queue_me   (1,840 events, mean 4.1ms)
  - schedule() -> io_schedule           (212 events, mean 19ms)
  - schedule() -> rwsem_down_read_slow  (128 events, mean 7.2ms)

The 18% of cudaLaunchKernel calls that experienced an off-CPU event between the syscall enter and exit is the actual root cause. The GPU sat idle for those microseconds because the dispatcher thread was off-CPU. The kernel that runs after the dispatcher returns scores its 100% on SM_ACTIVE. The damage was already done.

This is the kind of question utilization counters cannot answer. They were never built to.

Counter vs. cause, by metric

What you see	What it is	What it does not tell you
`GPU-Util` from nvidia-smi	Duty cycle: percent of time >= 1 kernel was running	Whether the kernel is doing useful work, whether dispatch is timely
`SM_ACTIVE` from DCGM	Per-SM duty cycle	Same gap, finer granularity
`TENSOR_ACTIVE` from DCGM	Tensor-core duty cycle	Whether tensor cores are bandwidth-starved
`MEM_COPY_UTIL` from DCGM	DMA engine duty cycle	Whether transfers gate compute
Token throughput	End-to-end work	Where the throughput went when it dropped

What you want underneath:

Cause-side signal	What it tells you
Kernel-runtime distribution per kernel name (p50, p99)	Is the same kernel taking 100x longer some calls than others?
`cudaLaunchKernel` p50/p99 spread	Is the dispatcher thread being preempted?
`sched_switch` count on dispatcher PID	How many context switches stole CPU from dispatch
Off-CPU time per dispatcher PID, decomposed by kernel call stack	What system event blocked the thread (futex, I/O, semaphore)
NCCL wait time per rank	Which rank is the straggler
I/O wait time on the dataloader process	Whether the dataloader is gating the GPU

These are the metrics that change when throughput changes. Utilization mostly does not.

Try it locally

Run a vLLM server on a single GPU. Hit it with a mixed workload (8 short prompts + 1 long prefill). Watch nvidia-smi. The utilization counter will sit between 90% and 99% for the entire window. Token throughput will drop sharply when the long prefill is in flight.

The investigation database from the vLLM repro described above is in the source repo at investigations/vllm-37343-logprobs-amplification.db. You can either reproduce the trace yourself or query the captured DB directly.

# 1. Capture a fresh trace (Linux, recent kernel, NVIDIA driver, root or CAP_BPF + CAP_PERFMON)
sudo ingero check
sudo ingero trace --duration 120s --db /tmp/vllm.db

# 2. Or skip the capture and query the prebuilt DB
git clone https://github.com/ingero-io/ingero.git
cd ingero

To investigate via an AI agent (Claude Code, Cursor, or a local model), point the Ingero MCP server at the DB and ask questions:

# Local model with no data leaving the machine
pip install mcp-client-for-ollama
cat > /tmp/ingero-mcp.json << 'EOF'
{"mcpServers":{"ingero":{"command":"./bin/ingero","args":["mcp","--db","investigations/vllm-37343-logprobs-amplification.db"]}}}
EOF
ollmcp -m qwen3:32b -j /tmp/ingero-mcp.json

The agent can call get_trace_stats to see the p50/p99 spread on every CUDA operation, get_causal_chains to surface the ranked stalls and their root causes, and run_sql for ad-hoc questions against the events table. The MCP server exposes seven tools in total; full list in the Ingero MCP docs.

Smoke and fire

Utilization is the smoke. The cause is what made the smoke. A monitor that reports the smoke is helpful for waking somebody up. It is not enough to point at the fire.

This is the gap that vendor-agent counters cannot close, because the questions they answer are duty-cycle questions (“was the GPU busy?”) rather than causal ones (“what was the GPU waiting on, and which thread on the host owns the wait?”). Those causal questions live one layer down, in the CUDA API and the kernel scheduler. eBPF can read both at production overhead. That combination is the difference between “the dashboard is green” and “we know why throughput fell at minute 4.”

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running production GPU workloads and seeing utilization counters disagree with throughput.
Investigation DB: investigations/vllm-37343-logprobs-amplification.db*

CUDA Out of Memory at 60% Utilization: Tracing PyTorch GPU Memory Fragmentation

Ingero Team — Mon, 04 May 2026 13:30:00 +0000

TL;DR

A PyTorch training job crashes with CUDA error: out of memory at 60-70% GPU memory utilization. nvidia-smi says there is free memory. torch.cuda.memory_summary() shows fragmented blocks. But neither tool explains why it happened or when it started. Tracing every cudaMalloc and cudaFree call at the kernel level via eBPF uprobes reveals the exact allocation pattern that caused fragmentation and which code path triggered it.

The Problem

A model trains fine for hours, then suddenly:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 256.00 MiB (GPU 0; 15.90 GiB total capacity;
10.24 GiB already allocated; 1.89 GiB free; 11.52 GiB reserved)

Wait. 1.89 GiB free, but can't allocate 256 MiB? That's memory fragmentation. The free memory exists, but it's scattered across hundreds of small non-contiguous blocks. No single block is large enough.

This is the #1 GPU debugging pain point for ML engineers. Everyone hits it. The standard advice is "reduce batch size" , but that's treating the symptom, not the cause.

What nvidia-smi Shows

+-------------------------------------------+
| GPU  Name        | Memory-Usage            |
|==================+=========================|
|   0  Tesla T4    | 10240MiB / 15360MiB     |
+-------------------------------------------+

66% utilization. Looks fine. nvidia-smi has no concept of fragmentation. It only reports total used vs. total available. It cannot show: how many individual allocations exist, what sizes they are, which ones are creating fragmentation, or when the fragmentation pattern started.

What torch.cuda.memory_summary() Shows

>>> print(torch.cuda.memory_summary())

|        Metric         | Cur Usage  | Peak Usage |
|-----------------------|------------|------------|
| Allocated memory      |  10240 MiB |  14336 MiB |
| Active memory         |   8192 MiB |  12288 MiB |
| GPU reserved memory   |  11520 MiB |  15360 MiB |
| Non-releasable memory |   3328 MiB |   4096 MiB |

Better: the gap is visible between allocated and reserved. But this is a snapshot. It doesn't show: the temporal pattern (when did fragmentation start?), which code path is causing the problematic allocations, whether host-side events (CPU contention, memory pressure) contributed, or the allocation/free cadence that led to fragmentation.

What the Trace Shows

The tracer traces every cudaMalloc and cudaFree call via eBPF uprobes on libcudart.so, with zero code changes and <2% overhead. Here's what a real investigation looks like.

Step 1: See the allocation pattern

$ ingero explain --per-process --since 300s

Process: train.py (PID 4821)
  cudaMalloc    | 5,012 calls | p50=65µs  | p99=2.1ms  | total: 406 GB allocated
  cudaFree      | 4,806 calls | p50=12µs  | p99=890µs  | total: 392 GB freed
  cudaStreamSync| 1,203 calls | p50=1.2ms | p99=45ms   |
  ⚠ malloc/free imbalance: 206 allocations without corresponding free
  ⚠ cudaMalloc p99 (2.1ms) is 32x p50 (65µs): fragmentation pressure

That 206-allocation imbalance over 5 minutes means memory is slowly leaking. And the p99/p50 ratio of 32x on cudaMalloc shows the allocator is struggling to find contiguous blocks.

Step 2: Find the causal chain

$ ingero explain --since 300s

Causal Chains (last 5 min):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[HIGH] Memory fragmentation → cudaMalloc latency spike
  Root: 5,012 cudaMalloc calls in 300s (16.7/sec), sizes 4KB-256MB
  Effect: cudaMalloc p99 climbed from 65µs → 2.1ms over 5 minutes
  Compounding: 4 DataLoader workers competing for CPU during alloc
  Fix: Use torch.cuda.memory.set_per_process_memory_fraction()
       or pre-allocate with torch.cuda.caching_allocator_alloc()

Step 3: Drill into the timeline with MCP

Using the MCP server (works with Claude, Cursor, or any MCP client):

SELECT
  (timestamp / 30000000000) * 30 as window_sec,
  COUNT(*) as allocs,
  AVG(duration_ns)/1000 as avg_us,
  MAX(duration_ns)/1000 as max_us,
  SUM(arg0)/1048576 as total_mb
FROM events
WHERE op = 'cudaMalloc'
GROUP BY window_sec
ORDER BY window_sec;

window_sec | allocs | avg_us | max_us  | total_mb
-----------|--------|--------|---------|----------
0          | 312    | 52     | 180     | 24,576
30         | 340    | 68     | 420     | 27,200
60         | 356    | 95     | 890     | 28,800
90         | 389    | 180    | 1,400   | 31,200
120        | 401    | 320    | 2,100   | 32,800   ← fragmentation visible

The average allocation latency is climbing monotonically. By window 120s, average cudaMalloc is 6x slower than at startup. This is the fragmentation building up in real-time, something no other tool reveals in production.

Step 4: Find the Python source line

With -stack enabled, The tracer captures the full call stack including CPython frames:

Top cudaMalloc callers:
  alloc_stress.py:74  → cudaMalloc | 4,009 calls | avg 1.0ms
  alloc_stress.py:74  → cuMemAlloc | 1,718 calls | avg 0.9ms  (FFI bypass)
  torch.cuda.empty_cache() → cudaMalloc | 156 calls | avg 0.7ms

Line 74 of the training script is doing tight cudaFree→cudaMalloc loops that fragment the memory pool. The FFI bypass path (1,718 calls going through cuMemAlloc directly) means some allocations skip PyTorch's caching allocator entirely.

The Fix

Once the cause is identified as fragmentation from rapid alloc/free cycling, the fix is straightforward:

Use PyTorch's memory pool: Replace manual torch.cuda.empty_cache() calls with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Pre-allocate at startup: Create your largest tensors early with torch.empty(…, device='cuda') so the caching allocator grabs contiguous blocks before memory fragments
Set memory fraction: torch.cuda.set_per_process_memory_fraction(0.8) prevents runaway allocation
Reduce DataLoader workers: In the investigation above, 4 workers competing for CPU during cudaMalloc created scheduling delays that compounded the fragmentation

Try It Yourself

Ingero runs on any Linux machine with a 5.15+ kernel. No GPU required for the demo:

git clone https://github.com/ingero-io/ingero.git && cd ingero 
bash scripts/install-deps.sh && source ~/.bashrc && make
# See a causal chain form in real-time
./bin/ingero demo incident 
# run ./bin/ingero demo (no other args) to see more demos

For real GPU training load tracing:

# in terminal #1
sudo ./bin/ingero trace
# in terminal #2: 
# run the training job ...
# in terminal #1, CTRL+C to stop tracing, then
./bin/ingero explain --per-process --since 300s

GitHub (give us a star!): github.com/ingero-io/ingero No NVIDIA SDK, no code changes, no CUPTI overhead.

If you are seeing CUDA memory fragmentation in your own workloads, we'd love to take a look. Drop an issue on GitHub and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

26 Seconds to Find a Straggler: Fleet v0.10 End-to-End on A100 and GH200

Ingero Team — Mon, 27 Apr 2026 18:08:56 +0000

TL;DR

Ingero Fleet v0.10 FOSS is live. We validated the full pipeline end-to-end on two 3-node Lambda Cloud clusters: 3x A100 SXM4 (x86_64) and 3x GH200 (aarch64, 64k pages, Grace kernel 6.8.0-1013-nvidia-64k). Same Fleet + agent + straggler-sink stack on both. One straggler per cluster, injected by removing the matmul workload from one node.

	A100	GH200
Region	us-east-1	us-east-3
Kernel	`6.8.0-60-generic`, 4k pages	`6.8.0-1013-nvidia-64k`, 64k pages
Steady-state fleet threshold	0.88	0.88
Time to STRAGGLER after injection	26 s	~30 s
Sink events (state / resolved)	23 / 1	79 / 5
Sink parse errors	0	0

One thing was not identical: the released multi-arch agent image’s BPF objects did not relocate against Lambda’s Grace kernel. We rebuild on-host (scripted), and we’re shipping the proper fix in v0.10.1 next week.

What Fleet v0.10 actually is

Fleet is an OpenTelemetry Collector distribution with two custom components. Agents on each GPU node push a health score (0.0 to 1.0) over OTLP every 5 seconds. A processor computes a peer-relative threshold using MAD (Median Absolute Deviation, 50% breakdown point against outliers). An extension serves that threshold back to agents via response headers (piggyback) and a fallback GET endpoint. Agents compare their own score against the threshold and emit a straggler event over a local Unix socket if they cross it. A reference straggler-sink sidecar converts the stream into Prometheus counters.

Nothing in v0.10 acts on straggler events. v0.10 is observability only, FOSS, Apache 2.0. Remediation (pause the NCCL collective, pin a new job to a different topology, whatever) is separate and not part of this ship.

The test

Three nodes per cluster. Node 01 is the k3s control plane AND a GPU worker. Fleet Deployment (replicaCount=1) lives on node 01. The agent runs as two DaemonSets on every GPU node: trace (writes signals to a local SQLite DB) and fleet-push (reads the DB, pushes OTLP to Fleet, consumes threshold, emits to the sink UDS). The sink is a sidecar in the fleet-push pod. An Alloy Deployment remote-writes Fleet self-metrics and per-node sink counters to a Grafana Cloud stack.

Baseline load: a 4096×4096 f32 CUDA matmul loop in a PyTorch container on every GPU node. After ~2 minutes the peer-relative threshold stabilizes around 0.88 with quorum_met=true.

Injection: delete the matmul pod from one node and taint the node so it does not reschedule. The agent on that node stops seeing CUDA activity. Its health score collapses. Fleet’s processor sees the divergence in the MAD. The agent on the divergent node polls the threshold, notices its local score is below it, and writes a straggler_state transition event to the sink UDS.

Numbers from the A100 run

Captured in ingero-fleet/examples/lambda-e2e/a100-artifacts/. Edited to the relevant lines:

2026-04-21 13:17:46  Fleet boots, OTLP gRPC on :4317, HTTP on :4318
2026-04-21 13:19:22  All 3 agents push_interval=5s, quorum_met=true, threshold=0.88
2026-04-21 13:26:14  Peer-relative stable: median=0.989  mad=0.000040
2026-04-21 13:26:20  Straggler injected (kubectl delete pod matmul-baseline-...)
2026-04-21 13:26:46  STRAGGLER fires at T+26s: score=0.8598 threshold=0.8767 mode=fleet

The MAD visibly spikes within 10 seconds of injection and the median drops from 0.989 to as low as 0.839 during active divergence. At T+26s the detector crosses threshold and writes an event.

Numbers from the GH200 run

Same test, same baseline. Captured in ingero-fleet/examples/lambda-e2e/gh200-artifacts/:

2026-04-21 14:45:11  3 GH200 nodes (aarch64), kernel 6.8.0-1013-nvidia-64k, 64k pages
2026-04-21 14:52:55  Peer quorum, threshold=0.88
2026-04-21 14:58:23  Straggler injected
2026-04-21 14:58:56  STRAGGLER at score=0.8292 threshold=0.9 mode=fleet

Detection fires around T+33s. Slightly slower than A100 but within the same push-interval noise band. Across the run the sink booked 79 straggler_state events and 5 straggler_resolved events (we let the workload drift around the threshold) with 0 parse errors.

The one GH200 wrinkle

The agent loads eBPF uprobes against libcudart.so and libcuda.so at startup. On GH200, the CO-RE relocation for uprobe_cuda_free failed:

loading eBPF objects: field UprobeCudaFree: program uprobe_cuda_free:
load program: bad CO-RE relocation: invalid func unknown#195896080

The BPF objects baked into the multi-arch image were compiled against a kernel BTF that doesn’t match Lambda’s 6.8.0-1013-nvidia-64k Grace kernel. CO-RE’s function-ID relocation didn’t resolve.

We worked around it by building on-host. One GH200 VM gets clang, llvm, libbpf-dev, linux-tools-$(uname -r), and Go. make generate build in the agent repo detects BPF_TARGET_ARCH=arm64, regenerates bpf/headers/vmlinux.h from /sys/kernel/btf/vmlinux (the exact kernel the agent will run on), recompiles the BPF objects, and links the binary. Then we repackage into an Alpine image and push it to our registry. The other GH200 nodes pull normally. Whole thing is scripted: examples/lambda-e2e/scripts/build-arm64-on-host.sh.

We’re shipping the proper fix in v0.10.1 soon: runtime libbpf compile from /sys/kernel/btf/vmlinux at agent startup, same pattern Cilium and Tetragon use. One image, any kernel with BTF. No on-host rebuild step.

What the detection pipeline actually does (and does not)

We’re deliberately narrow in v0.10:

3+ agent quorum before the peer-relative threshold is considered valid. Below quorum, agents fall back to a local rolling baseline.
MAD smoothed with EMA. A single straggler cannot shift the threshold (breakdown point is 50%).
Fail-open. If Fleet is unreachable, agents use their cached threshold first, then local baseline. Straggler detection degrades gracefully, never blocks workloads.
Stateless Fleet. Restart rebuilds state from incoming pushes in about 10 seconds. No database, no disk.

And deliberately out of scope:

No remediation orchestration. Straggler events land as Prometheus counters at the sink; nothing tries to act.
No multi-replica Fleet routing today. replicaCount: 1 is the recommended default. Multi-replica needs an L7 LB with consistent-hash on cluster_id. Native consistent-hash routing is a future release.
No long-soak proof yet. We ran ~1 hour across both clusters.
No real NCCL workload validation yet. We used a synthetic matmul for v0.10; a real NCCL all-reduce test is in the v0.10+ plan.

Try It Yourself

Full repro kit: https://github.com/ingero-io/ingero-fleet/examples/lambda-e2e

Prerequisites:

Lambda Cloud account with API token and SSH key registered (or swap provision.sh for any other GPU VM provider).
GHCR read token (CR_READ_PAT, read:packages).
Grafana Cloud free-tier stack with a MetricsPublisher API key (optional, wires the dashboard).
Local: curl, jq, ssh, python3, helm, clones of ingero-io/ingero and ingero-io/ingero-fleet.

A100 walkthrough:

source scripts/00-env.sh
export CLUSTER_ID=lambda-a100 REGION=us-east-1
export INSTANCE_TYPE=gpu_1x_a100_sxm4 NODE_COUNT=3
./scripts/provision.sh
source lambda-instances.env
./scripts/10-bootstrap-k3s.sh     # k3s + NVIDIA device plugin + RuntimeClass
./scripts/20-deploy-stack.sh      # Fleet + agent + Alloy
./scripts/30-baseline.sh          # matmul on every node, wait for peer quorum
./scripts/40-inject-straggler.sh  # remove matmul from one node, watch detection
REC_DIR=./artifacts ./scripts/50-record-artifacts.sh
./scripts/60-teardown.sh

GH200 is the same, with one extra step after 10-bootstrap-k3s.sh:

./scripts/build-arm64-on-host.sh          # rebuild agent against host BTF, push to your registry
export AGENT_IMAGE=ghcr.io/&lt;you&gt;/ingero:v0.10.0-gh200
./scripts/20-deploy-stack.sh
# ... rest same as A100

Full Lambda burn for both clusters was under $11 (about 1 hour each).

Artifacts

Both runs are archived and live:

Live Grafana dashboard: https://ingero.grafana.net/public-dashboards/11d240020d394fa382c4b9facb9fde69 (both clusters overlaid, cluster_id dropdown lets you view one at a time; time range locked to the 2026-04-21 run window)
Asciinema casts: A100 run and GH200 run (about 3 minutes each, pause and rewind in the player)
Lambda E2E kit with scripts and captured run artifacts: ingero-fleet/examples/lambda-e2e/
Reference dashboard JSON (import into your own Grafana Cloud or OSS Grafana): ingero-fleet/examples/grafana/v0.10.json
Release: https://github.com/ingero-io/ingero-fleet/releases/tag/v0.10.0

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running multi-node GPU training and want to measure straggler waste across A100, H100, or GH200 fleets.*

Production GPU Training is 34% Slower. Show Me Why

Ingero Team — Thu, 23 Apr 2026 14:05:34 +0000

A single slow GPU – a straggler – in a 1,000-node training cluster idles 999 healthy GPUs at every AllReduce barrier. The job does not crash. There is no error message. GPU stragglers just make training run slower than it should – sometimes for hours.

This is not hypothetical. Production data from the largest GPU operators tells a consistent story.

The numbers

ByteDance (FALCON, 2024): A five-month study of 3,079 training jobs across clusters of 128 to 5,000+ GPUs found that 60% of large-scale jobs (512-1024 GPUs) experienced fail-slow events. The average duration of a fail-slow: 72 minutes. One in five affected jobs was delayed by more than 50% of its intended compute time. The average job completion time was extended by 34.59%.

Meta (Llama 3, 2024): Training on 16,384 H100 GPUs over 54 days produced 419 unexpected failures – roughly one every three hours. 78% were attributed to hardware degradation. The team achieved greater than 90% effective training time, but only through unprecedented custom tooling, rapid checkpoint recovery, and constant manual oversight. Most organizations cannot replicate this.

ByteDance (ByteRobust, 2025): Hardware failure approximately every 2.78 hours on 16,000 GPUs. 38,236 explicit failures plus 5,948 implicit failures across 778,135 jobs over three months.

Projected at scale (Epoch AI): At 100,000 GPUs, one failure every 30 minutes. At 1,000,000 GPUs, one failure every 3 minutes.

Why nobody detects them

Fail-slow events are the worst kind of failure. The training job continues to make progress. The GPU reports healthy utilization. The process does not crash. No watchdog fires. No alert triggers.

The standard detection mechanism – NCCL’s built-in watchdog – has a default timeout of 30 minutes. It is designed to catch fail-stop events (total crashes), not fail-slow events (gradual degradation). A GPU that is 40% slower will never trigger a 30-minute timeout. It will just make the entire cluster 40% slower, silently, for the duration of the run.

nvidia-smi will show that GPU at 100% utilization. Because it is 100% utilized – it is executing the work it receives. The problem is that it takes longer to execute, and every other GPU waits for it at every synchronization barrier.

What causes them

Fail-slows are not exotic failures. They are ordinary infrastructure events that happen to affect a single node more than others:

Thermal throttling – one GPU hits 83C, clocks down from 1755 MHz to 345 MHz. The rest of the cluster is at 75C. Onset: under 10 seconds. Duration: entire training run. No error, no alert. Just 25-50% slower on that one node.

CPU contention – a background process (apt-daily, logrotate, a monitoring agent) steals CPU cycles from the DataLoader on one node. The GPU is starved of data. Onset: immediate. Duration: until the process finishes. This is the root cause we see most frequently in our investigations.

NUMA misalignment – GPU 0 is on NUMA node 0 but its CPU affinity is set to NUMA node 1 cores. Every memory access crosses the NUMA boundary. 10-30% slower, permanently, from job start.

Network asymmetry – one InfiniBand link runs at half bandwidth (firmware bug, bad cable, transceiver degradation). NCCL AllReduce on that rank takes twice as long. Every other rank waits.

ECC memory degradation – correctable ECC errors consume memory bandwidth. A few per day is normal. Hundreds per hour signals HBM degradation. The GPU still functions, but 5-15% slower. No error is logged.

The compounding effect: in a 1,000-GPU cluster, there are probably 5-10 GPU stragglers at any given time, each for different reasons. The worst of the GPU stragglers sets the pace. Fix it, and the second-worst becomes the new bottleneck.

The math

At 1,000 H100 GPUs with cloud pricing around $3.50/GPU-hour:

A single fail-slow event lasting 72 minutes (FALCON average) with a 65% performance degradation (Google’s estimate for a single GPU straggler):

72 min x 1,000 GPUs x $3.50/hr x 0.65 waste fraction / 60 = $2,730 per event

With FALCON’s finding that 60% of large jobs are affected, and assuming 2-3 events per job:

A multi-week training run accumulates $50,000-$100,000 in GPU stragglers waste that appears nowhere in any dashboard.

At 10,000 GPUs, multiply by 10.

How distributed systems solved this

This problem is not new. Every distributed system that depends on synchronous coordination has faced it.

Kafka solved replica lag detection by moving from message-count-based thresholds (brittle, requires tuning per topic) to time-based detection. The lesson: measure outcomes (time to process), not inputs (messages behind). Their evolution from replica.lag.max.messages (removed) to replica.lag.time.max.ms took years of production pain.

Elasticsearch solved query routing to degraded replicas with Adaptive Replica Selection, using EWMA of response times with a cubic penalty for queue depth. Based on the C3 paper (USENIX NSDI 2015). Result: 113% throughput improvement, 65% p90 latency reduction.

Redis Cluster solved node failure detection with a two-phase protocol: PFAIL (one node suspects) escalated to FAIL (majority of masters agree). No single observer’s opinion is sufficient.

Envoy solved backend health with statistical outlier detection: compare each backend’s success rate against the group’s standard deviation. A backend that is statistically worse than its peers gets ejected. No threshold tuning required.

The pattern across all of these: compare against peers, not against absolute thresholds. A node is unhealthy not because it crossed a magic number, but because it is performing worse than the nodes around it.

GPU clusters have not adopted any of these patterns. The standard approach is still: set a static threshold on DCGM metrics in Prometheus, hope someone is watching the dashboard, and debug manually when a job takes too long.

The detection gap

Time	What happens
T+0s	GPU starts thermal throttling (or CPU contention begins, or IB link degrades)
T+0s to T+72min	Training runs slower. No alert. nvidia-smi shows 100%. NCCL timeout (30 min) does not fire because the job is making progress, just slowly.
T+72min (average)	Either: the condition resolves on its own, or the job eventually fails/times out hours later, or nobody ever notices and the run just costs more than it should.

Between T+0 and T+72min, an automated system could have detected the straggler and acted: redistributed micro-batches (~30 seconds, per FALCON’s research), adjusted pipeline topology (~1 minute), or checkpointed and restarted on a healthy node (~2.5 minutes with FlashRecovery).

But you cannot remediate what you cannot detect. And today, most GPU clusters cannot detect a GPU straggler until the job fails or a human notices the throughput graph trending down.

What it would take

A detection system for GPU stragglers needs:

Per-node health scoring – not just GPU utilization, but a composite signal that captures throughput, compute efficiency, memory pressure, and CPU availability.
Peer-relative comparison – not “is this node below 80%?” but “is this node worse than the others in its cluster?” MAD (Median Absolute Deviation) works here because it resists outlier contamination. The same math that Datadog uses for fleet outlier detection.
Adaptive baselines – workloads change. A training node and an inference node have different “normal.” The baseline must adapt, not be configured.
Speed – detection in seconds, not minutes. The FALCON data shows fail-slows average 72 minutes. If you detect in 30 seconds, you have 71.5 minutes to remediate before the job is impacted. If you detect in 30 minutes (NCCL timeout), you have already lost most of the value.

Every one of these techniques exists in production distributed systems today. They have not been applied to GPU clusters.

The question is not whether this problem is solvable. It is whether GPU operators will continue paying the GPU stragglers tax because no one has built the tool.

Ingero – open-source eBPF agent for GPU debugging that traces the full chain from kernel events to CUDA API calls. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running GPU training at scale and want to measure your actual straggler waste.*

Originally published on ingero.io.

Agent + MCP + eBPF: 10,869 CUDA Kernel Events, Now Queryable

Ingero Team — Tue, 21 Apr 2026 13:30:00 +0000

A vLLM inference server handles hundreds of requests per second. Then one request with n_completions=8 and logprobs=20 arrives, and every other request blocks for 9-11 seconds. GPU utilization monitors stay green. Kubernetes reports healthy pods. Latency dashboards show a spike but no why. An eBPF trace of every CUDA call is the only view that catches this.

Originally published at ingero.io.

This is a real issue (vLLM #37343). We reproduced it on an RTX 4090 running vLLM 0.18.0 with Qwen3.5 with 27B params. Ingero's eBPF trace captured everything: 10,869 events, 550 cudaLaunchKernel calls, 7,757 context switches, 6 causal chains. All stored in a 1.2 MB SQLite database.

Then we pointed Claude at it via MCP.

The session

The entire investigation below is real. These are actual MCP tool calls, actual data, actual conclusions. Nothing was edited except formatting for readability.

Claude's first move - it called get_trace_stats:

10,869 events traced:
  7,757 sched_switch (context switches)
    550 cudaLaunchKernel   p50: 20us  p99: 5,079us
  1,672 cuLaunchKernel     p50: 15us  p99: 892us
    734 cudaMemcpyAsync
     10 cudaDeviceSync     p50: 19us  p99: 4,720us (239x median)

Two numbers jump out: cudaLaunchKernel p99 is 254x the p50, and there are 7,757 context switches. That ratio is not normal.

Claude's second move - it called get_causal_chains:

The most severe chain:

cudaLaunchKernel p99 = 13.1ms (784x the p50 of 16.7us), correlated with 428 context switches totaling 8.9 seconds off-CPU.

This is the signal. The kernel launch itself takes microseconds. But the CPU thread that feeds the GPU is getting preempted 428 times, accumulating 8.9 seconds of idle time. The GPU sits waiting.

Claude's third move - without being asked, it called get_stacks to look at the actual code paths:

Kernel	Calls	Avg Latency	Max Latency
fused_add_rms_norm	118	54 us	3.0 ms
reshape_and_cache_flash	75	42 us	161 us
silu_and_mul	69	44 us	156 us
rotary_embedding	62	38 us	222 us

All running under VLLM::EngineCore, the single-threaded engine coroutine.

Claude's diagnosis (47 seconds from first tool call):

"The vLLM engine coroutine is running on a CPU shared with other processes. When those processes get scheduled, the coroutine is preempted, the GPU gets no new work, and throughput collapses. The CPU thread driving the GPU was context-switched out 5,347 times, causing 18.4 seconds of CPU unavailability. The root cause is CPU contention, not GPU contention."

Claude's fix recommendation:

Pin the engine thread to dedicated CPU cores (taskset)
Reduce priority of background processes (nice -n 19)
Do NOT use CUDA_LAUNCH_BLOCKING (it will make things worse)
Do NOT trust nvidia-smi (it shows utilization, not causation)

What happened here

No SSH. No log files. No dashboard hopping. No "let me check nvidia-smi on each node."

An AI agent made 4 MCP tool calls against a 1.2 MB SQLite database containing kernel-level eBPF traces. It identified the root cause (CPU scheduling contention), the specific code path (EngineCore coroutine), and the fix (CPU pinning) - all in under a minute.

The key insight: nvidia-smi would have shown 100% GPU utilization during this entire incident. The GPU was "utilized" - it was executing the work it was given. The problem was that it wasn't being given work fast enough because the CPU thread feeding it was being preempted. That distinction - between "GPU is busy" and "GPU is being fed work efficiently" - is invisible to every standard GPU monitoring tool.

What made this possible

This is not a wrapper around nvidia-smi. The eBPF trace attaches uprobes directly to libcudart.so (CUDA Runtime) and libcuda.so (CUDA Driver), plus tracepoints on the Linux kernel scheduler (sched_switch, sched_wakeup), memory allocator (mm_page_alloc), and I/O subsystem. Every CUDA API call is captured with nanosecond precision. Every context switch that preempted a GPU-feeding thread is recorded. The causal chain engine connects them automatically.

The MCP server exposes this data through 10 tools. The AI agent decides what to query. There is no pre-aggregation layer, no dashboard, no human selecting which metrics to look at. The agent gets the raw events and builds the diagnosis.

Try the eBPF trace yourself

The trace database is in the Ingero repo. The investigation works with any MCP-compatible AI:

# 1. Clone and build
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

# 2. With Claude Code
claude --mcp-config <(echo '{"mcpServers":{"ingero":{"command":"./bin/ingero","args":["mcp","--db","investigations/vllm-37343-logprobs-amplification.db"]}}}')

# 3. With Ollama (any open model)
pip install mcp-client-for-ollama
ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp.json

Type /investigate to start the guided workflow. The AI will walk through the same investigation you just read.

The pattern repeats

This is not a one-off. We have traced dozens of GPU performance issues. The pattern is consistent:

124x slower PyTorch DataLoader - kernel tracing revealed 191,000 context switches and 299,000 page allocations in 40 seconds. The GPU was starved because DataLoader workers were fighting for CPU cores.
13x PyTorch slowdown from hidden NumPy sync - a tensor.cpu().numpy() call in a masking function triggered B x 2 implicit cudaStreamSynchronize calls per forward pass. On faster GPUs, the bottleneck got worse, not better.
GPU 97% utilized but training 3x slower - nvidia-smi reported healthy utilization while Prometheus node exporter and Fluent Bit were consuming 51.7% of available CPU time through 14,504 context switches.

Every one of these follows the same pattern: the GPU is fast, the host is the bottleneck, and standard GPU metrics cannot see it. The causal chain from host event to CUDA API call is the missing link.

What this means for GPU debugging

The traditional approach: alert fires, SSH into the machine, check nvidia-smi, check dmesg, check logs, open profiler, wait for reproduction, analyze flame graphs, correlate across tools. Hours.

The MCP-native approach: point an AI agent at the kernel traces, let it query what it needs, read the diagnosis. Minutes.

We are not saying the AI is smarter than a senior SRE. We are saying it has access to data the SRE cannot see (kernel scheduling decisions, per-CUDA-call latency distributions, automated causal chains) and it can query that data faster than a human can navigate dashboards.

The investigation databases are open source. The agent is open source. Try it locally.

Ingero - open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub** star - Open an issue if you are seeing vLLM or CUDA runtime issues. Investigation DB: investigations/vllm-cuda-kernel-events.db - Original issue: vllm-project/vllm#37343.*

11-Second Time to First Token on a Healthy vLLM Server

Ingero Team — Tue, 21 Apr 2026 13:30:00 +0000

TL;DR

A vLLM health endpoint says "ok." nvidia-smi says 95% utilization. But a user just waited 11 seconds for their first token. We reproduced a real vLLM issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root causes: head-of-line blocking during prefix caching. This is invisible to standard monitoring. The trace databases are available in the Ingero repo for independent investigation. We traced a production case of vLLM latency spikes down to kernel-level scheduling contention.

The Problem Nobody Can See

vLLM's continuous batching is one of the best things to happen to LLM serving. It lets the engine process multiple requests simultaneously, filling GPU capacity that would otherwise sit idle between sequential requests.

But continuous batching has a dark side: when requests compete for GPU resources inside the same batch, one expensive request can silently starve all others. No error. No health check failure. No metric spike. Just users waiting 10x-250x longer than expected for their first token.

We investigated a real vLLM issue reported in the last week (#37308) to understand what happens at the kernel level during these silent latency spikes.

Setup

The investigation used the same server configuration:

python -m vllm.entrypoints.openai.api_server \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --port 8000 \
 --gpu-memory-utilization 0.95 \
 --max-model-len 32768 \
 --enable-prefix-caching

Hardware: RTX 4090 (24GB), 4 vCPUs, Ubuntu 22.04, vLLM 0.17.1.

We ran Ingero alongside each test to trace CUDA Runtime/Driver API calls and host kernel events (scheduler context switches, memory allocations) simultaneously.

Prefix Caching Head-of-Line Blocking

Issue: vllm-project/vllm#37308

What happens

6 concurrent requests arrive within 40ms. 4 are heavy (2048-token prompts, 128-512 output tokens) and 2 are light (128-token prompts, 32-64 output tokens). All share a 32-token prefix so the prefix cache groups them together.

The light requests should complete in under 100ms. Instead:

Run	r08 (128 tok)	r05 (128 tok)	r07 (2048 tok)	r02 (2048 tok)
1	1,131ms	1,406ms	1,654ms	1,851ms
2	54ms	129ms	258ms	234ms
3	66ms	177ms	175ms	156ms

Run 1 is catastrophic: the light requests are 14x over threshold. Subsequent runs settle to 2-4x because the prefix cache warms up. But that first cold-cache batch is brutal.

What the tracer shows

3 causal chains detected. The most revealing one:

[MEDIUM] cudaLaunchKernel p99=444us (6.4x p50) - 371 sched_switch events
 Timeline:
 [HOST ] 371 context switches (5.9s off-CPU)
 [CUDA ] p99=444us (6.4x p50=70us)

The per-process breakdown tells the full story:

VLLM::EngineCore (the GPU scheduling loop):

24,347 context switches, max stall 2.5 seconds
40,632 cuLaunchKernel calls, avg 29us but max 34ms
34,087 cudaLaunchKernel calls, avg 96us but max 356ms

The engine core process – the single-threaded loop that decides which requests get GPU time – was descheduled for 2.5 seconds in the worst case. During that stall, the GPU kernel queue drained and the light requests had nothing submitted on their behalf.

The 356ms cudaLaunchKernel spike (3,700x the average) is the smoking gun. That's not the GPU being slow. That's the CPU failing to submit work to the GPU because the scheduling loop was preempted.

Why nvidia-smi misses this

nvidia-smi shows high utilization because the GPU IS working – on the heavy requests' prefills. The light requests are starving, but from the GPU's perspective there's always a kernel to run. The starvation is in the CPU-side scheduling loop, not on the GPU.

What Standard Tools Show vs What Kernel Tracing Shows

Signal	nvidia-smi	vLLM /health	vLLM metrics	Kernel tracing
GPU utilization	95%+	–	–	95%+ (but wrong work)
Server health	–	"ok"	requests_running=5	–
TTFT regression	–	–	Visible in histograms	Visible + root cause
Engine stall (2.5s)	Not visible	Not visible	Not visible	24,347 sched_switch events
Kernel launch drop (80%)	Not visible	Not visible	Not visible	1,051 -> 208 ops/s
Memory pressure	Not visible	Not visible	Not visible	43,606 mm_page_alloc
Which process is blocked	Not visible	Not visible	Not visible	VLLM::EngineCore PID 2438

The key insight: GPU utilization was high because the GPU was doing work. It was just doing the wrong work – processing heavy prefills or computation while light requests starved. No GPU-side metric can distinguish "GPU is busy computing my request" from "GPU is busy computing someone else's request while mine waits."

Implications for Production vLLM

If you're running vLLM in production with mixed workloads (different prompt sizes, some requests with or ), you're likely experiencing these silent regressions:

Monitor TTFT per-request, not just aggregate throughput. Aggregate metrics hide the tail – your p99 might be 100x worse than p50 during batch contention.
Be careful with . A single request with n=8 and =20 can block your entire server for 11+ seconds on a cold cache. Consider routing these to dedicated instances.
First-request-after-idle is the worst case. This issue showed the most extreme regression on Run 1 (cold prefix cache). If your traffic is bursty, the first batch after a quiet period will hit hardest.
GPU utilization is not a proxy for request health. Your dashboards might show 95% utilization while individual users experience 256x TTFT regression.

Investigate It Yourself

The trace database from this investigations are in the Ingero repo:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

# View the causal chains
./bin/ingero explain --db investigations/vllm---amplification.db --since 5m

# Per-process breakdown
./bin/ingero explain --db investigations/vllm---amplification.db --per-process --since 5m

# Connect your AI assistant for interactive investigation
./bin/ingero mcp --db investigations/vllm---amplification.db

Investigate with AI (recommended)

You can point any MCP-compatible AI client at the trace database and ask questions directly. No code required.

First, create the MCP config file at /tmp/ingero-mcp-vllm.json:

{
  "mcpServers": {
    "ingero": {
      "command": "./bin/ingero",
      "args": ["mcp", "--db", "investigations/vllm-37308-hol-blocking.db"]
    }
  }
}

With Ollama (local & free: no data sent outside):

# Install ollmcp (MCP client for Ollama)
pip install ollmcp

# Investigate with a local model (no data leaves your machine)
ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp-vllm.json

With Claude Code (with data sent to remote models / Anthropic):

claude --mcp-config /tmp/ingero-mcp-vllm.json

Then type /investigate and let the model explore. Follow up with questions like "what was the root cause?" or "which kernel calls had the highest latency spikes?"

Ask your AI assistant: "What caused the 80% throughput drop?" or "Which process had the most context switches?" The trace data has the full story.

The investigation database from this post is available for download. Investigations performed on TensorDock RTX 4090 (24GB), Ubuntu 22.04, vLLM 0.17.1, Qwen/Qwen2.5-0.5B-Instruct with prefix caching enabled.

GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

If you are seeing vLLM issues in your own workloads, we'd love to take a look. Drop an issue on GitHub and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

What Happens When an AI Agent Gets Kernel-Level GPU Traces

Ingero Team — Thu, 16 Apr 2026 16:48:26 +0000

TL;DR

A GPU trace of a PyTorch DataLoader bottleneck (114x slower than direct indexing) was loaded into an MCP server and handed to Claude for investigation. The AI identified the root cause in under 30 seconds: 3,676 CPU context switches starving the GPU of data. Below is the full investigation session with the trace database available for independent reproduction. We walked through a real case of Claude MCP GPU debugging, from raw eBPF traces to root cause identification.

The Idea

GPU performance debugging usually goes like this: training is slow, nvidia-smi shows nothing useful, print statements get added, hours pass. What happens when raw trace data gets handed to an AI assistant with the question “what went wrong?”

That’s what the MCP server enables. The tracer traces CUDA API calls and Linux kernel events, stores them in a SQLite database, then exposes them to AI assistants via the Model Context Protocol (MCP). The AI can query the data, read causal chains, inspect per-process breakdowns, and run custom SQL through natural conversation.

We tested this on a real investigation: a PyTorch DataLoader bottleneck where DataLoader was 114x slower than direct tensor indexing on an RTX 4090. The trace database is in the Ingero repo for independent reproduction.

Setup (2 minutes)

1. Get the trace database

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

The investigation database is at investigations/pytorch-dataloader-starvation.db (764KB).

2. Start the MCP server

./bin/ingero mcp --db investigations/pytorch-dataloader-starvation.db

This starts the MCP server in stdio mode. To connect Claude Desktop, add this to your MCP config (~/.config/claude/claude_desktop_config.json):

{
  "mcpServers": {
    "ingero": {
      "command": "/path/to/ingero",
      "args": ["mcp", "--db", "/path/to/investigations/pytorch-dataloader-starvation.db"]
    }
  }
}

For Claude Code (CLI), the MCP server is automatically available when running from the repo directory.

The Investigation Session

Here’s the actual conversation. We started with a simple question and let the AI dig deeper.

“What happened in this trace?”

The AI calls get_trace_stats and gets back:

3,807 events over 77 seconds
CUDA Runtime: cudaMemcpyAsync (14 calls, avg 31.5ms, max 65ms)
              cudaLaunchKernel (16 calls, avg 1.7ms, max 25.8ms)
              cudaStreamSync (21 calls, avg 31us)
              cudaDeviceSync (3 calls, avg 17.2ms)
Host: sched_switch (3,676 calls, avg 401us, max 23.9ms)
      process_fork (11 calls)
      process_exit (5 calls)
Driver: cuLaunchKernel (41 calls, avg 40us)
        cuMemAlloc (10 calls, avg 307us)

Right away, two things stand out:

cudaMemcpyAsync averages 31.5ms per call. That’s host-to-device data transfer. It should be microseconds for the batch sizes involved.

3,676 sched_switch events in 77 seconds. That’s the Linux scheduler preempting processes, something is fighting for CPU.

“What are the causal chains?”

The AI calls get_causal_chains:

40 causal chains detected (all MEDIUM severity)

Recurring pattern:
  cudaLaunchKernel p99=25.8ms (352x p50=73us) - N sched_switch events
  Context switches escalate from 34 -> 106 -> 268 -> 308 -> 319 over time

The chains show a clear escalation: context switches grow from 34 to 319 over the trace window. The DataLoader workers are creating more and more scheduling pressure as the benchmark runs.

“Which processes are involved?”

The AI calls run_sql:

SELECT pn.name, COUNT(*) as events
FROM events e
JOIN process_names pn ON e.pid = pn.pid
GROUP BY pn.name
ORDER BY events DESC

python3          3,799 events
pt_data_worker     (visible in fork/exit events)
pt_data_pin        (visible in fork/exit events)
systemd-journal       17 events

The main python3 process owns almost all events. But the DataLoader workers (pt_data_worker, pt_data_pin) are visible through fork and scheduling events, they’re being created, fighting for CPU, and getting descheduled.

“Why is cudaMemcpyAsync so slow?”

The AI connects the dots:

The 14 cudaMemcpyAsync calls average 31.5ms (max 65ms). These are host-to-device data transfers that should complete in microseconds for batch sizes under 100MB.

The 3,676 sched_switch events show the main process being preempted 47 times per second. Each preemption delays the data pipeline. The DataLoader workers (pt_data_worker) are competing with the main process for CPU, on a 4-vCPU machine, 4 workers + the main process + system processes means constant context switching.

The GPU is receiving cudaMemcpyAsync calls, but the source data isn’t ready in host memory because the DataLoader worker that was supposed to prepare it got descheduled. The GPU waits.

Root cause: CPU scheduling contention between DataLoader workers and the main training process. The GPU is fast, it’s the host that can’t keep up.

“What would you recommend?”

Reduce num_workers to 2 (on a 4-core machine, 4 workers causes over-subscription)

For in-memory datasets, skip DataLoader entirely, direct tensor indexing is 114x faster because it avoids the multiprocessing overhead

If DataLoader is required, use persistent_workers=True to avoid fork overhead on each epoch

Total investigation time: under 30 seconds from first question to root cause + fix.

What the AI Has Access To

The MCP server exposes 7 tools:

Tool	What It Does
`get_check`	System diagnostics (kernel, GPU, CUDA, driver)
`get_trace_stats`	CUDA + host statistics (p50/p95/p99 per operation)
`get_causal_chains`	Automated root cause chains with severity ranking
`get_stacks`	Resolved call stacks (symbols + Python source lines)
`run_demo`	Run synthetic demo scenarios
`get_test_report`	GPU integration test results
`run_sql`	Read-only SQL against the trace database

The run_sql tool is the most powerful: the AI can write arbitrary queries against the event table, joining with process names, ops, and sources.

Try It Yourself

The trace database from this investigation is in the repo:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

# Quick analysis (no MCP needed)
./bin/ingero explain --db investigations/pytorch-dataloader-starvation.db --since 5m

# Interactive AI investigation via MCP
./bin/ingero mcp --db investigations/pytorch-dataloader-starvation.db

With Claude Desktop

Add to ~/.config/claude/claude_desktop_config.json:

{
  "mcpServers": {
    "ingero": {
      "command": "./bin/ingero",
      "args": ["mcp", "--db", "investigations/pytorch-dataloader-starvation.db"]
    }
  }
}

Then ask Claude: “What caused the GPU performance problem in this trace?”

With Any MCP Client

The MCP server works with any MCP-compatible client: Cursor, Windsurf, or custom implementations. The stdio transport is universal.

Investigate with AI (recommended)

# With Ollama (local, free)
pip install ollmcp
ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp-dataloader.json

# With Claude Code
claude --mcp-config /tmp/ingero-mcp-dataloader.json

Type /investigate and let the model explore.

Why This Matters

Traditional GPU debugging is manual: run nvidia-smi, add print statements, read logs, guess. The AI-assisted approach is different:

The tracer captures everything at the kernel level: CUDA API calls, host scheduling, memory events, with zero code changes
The trace database is self-contained: no need to reproduce the issue, no need for the original hardware
The AI asks the right follow-up questions: it sees the context switches, connects them to CUDA latency, and identifies the root cause pattern

This turns GPU debugging from “spend hours staring at logs” into “ask a question, get an answer.”

Investigation DB: investigations/pytorch-dataloader-starvation.db

Original issue: pytorch/pytorch#154318

GitHub: github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

MCP as Observability Interface: Connecting AI Agents to Kernel Tracepoints

Ingero Team — Thu, 16 Apr 2026 07:35:33 +0000

TL;DR

MCP is becoming the interface between AI agents and infrastructure
data. Datadog shipped an MCP
Server connecting dashboards to AI agents.
Qualys flagged MCP servers as the new shadow IT risk.
We think both are right, and we think the architecture should
go further: the MCP server should not wrap an existing observability
platform. It should BE the observability layer. This post explores how
MCP can serve as a direct observability interface to kernel
tracepoints, bypassing traditional metric pipelines entirely.

Three signals in one week

Three things happened in the same week of March 2026 that signal where
observability is headed.

Datadog shipped an MCP Server
Their implementation connects real-time observability data to AI agents for automated detection and remediation. An AI agent can now query Datadog dashboards, pull metrics, and trigger responses through the Model Context Protocol. This is a big company validating a small protocol.

Qualys published a security analysis of MCP
servers.
Their TotalAI team called MCP servers “the new shadow IT for AI” and
found that over 53% of servers rely on static secrets for
authentication. They recommended adding observability to MCP servers:
logging capability discovery events, monitoring invocation patterns,
alerting on anomalies.

Cloud Native Now covered eBPF for Kubernetes network observability.
Microsoft Retina deploys as a DaemonSet, captures network telemetry via
eBPF without application changes, and provides kernel-level drop reasons. The article draws a clear line between “monitoring” (predefined questions) and “observability” (asking questions nobody planned for).

The thread connecting all three: AI agents need direct access to
infrastructure telemetry, and MCP is becoming the way they get it.

Two approaches to MCP observability

There are two ways to connect observability data to AI agents via MCP.

Approach 1: Wrap existing platforms. Datadog’s strategy. Take
existing metrics, logs, and traces, already collected and aggregated,
and expose them through MCP tools. The AI agent queries the dashboard
API, gets pre-processed data, and acts on it. This makes sense for teams
with a mature observability stack that want to add AI-powered automation
on top.

Approach 2: Build MCP-native observability. This is what we did with
the tracer. Instead of wrapping an existing platform, we built an eBPF
agent that traces CUDA Runtime and Driver APIs via uprobes, stores the
results in SQLite, and exposes everything through 7 MCP tools. The MCP
interface is not an adapter layer; it is the primary interface.

Neither approach is wrong. They solve different problems.

The wrapper approach works well for aggregate analysis: “What was the
p99 latency for service X over the last hour?” The data is already
summarized, indexed, and queryable.

The native approach works better for root-cause investigation: “Why did
this specific GPU request take 14.5x longer than expected?” That
requires raw kernel events, CUDA call stacks, and causal chains – not
summaries. The AI agent needs to drill down, not roll up.

What MCP-native observability looks like in practice

Here is a concrete example. We traced a vLLM TTFT regression where the
first token took 14.5x longer than baseline. The trace database captured
every CUDA API call, every kernel context switch, every memory
allocation.

When Claude connects to the MCP server and loads this database, it can:

get_trace_stats – See the full trace summary: 12,847 CUDA events, 4 causal chains, total GPU time
get_causal_chains – Read the causal chains that explain why latency spiked, in plain English
run_sql – Run custom queries against the raw event data (“show me all cudaMemcpyAsync calls over 100ms”)
get_stacks – Inspect call stacks for any flagged event

Claude identified the root cause in under 30 seconds: logprobs
computation was blocking the decode loop, creating a 256x slowdown on
the critical path. That root cause was not visible in any aggregate
metric. It only appeared in the raw causal chain between specific CUDA
API calls.

A dashboard MCP adapter could not have found this. The data granularity
does not survive aggregation.

The security angle matters too

Qualys raised valid concerns about MCP server security. Their finding
that 53% of servers rely on static secrets is alarming. Their
recommendation to log discovery and invocation events is exactly right.

For MCP servers that touch GPU infrastructure, the attack surface is
different. An MCP server with access to CUDA traces can expose timing
information, memory layouts, and model architecture details. The
security model needs to account for this.

In Ingero, the MCP server runs inside the same process as the eBPF tracing pipeline. There is no separate data layer between the AI agent and the kernel-level telemetry - the MCP tools query the same event store that the eBPF probes write to. This is why Ingero can answer causal questions in real time: the AI agent has direct access to raw kernel and CUDA events, not a pre-aggregated summary.

Try It Yourself

The project is open source. The investigation database from this post is available for download. Claude (or any MCP client) can connect to it and run an investigation:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero mcp --db investigations/pytorch-dataloader-starvation.db

Investigate with AI (recommended)

You can point any MCP-compatible AI client at the trace database and ask questions directly. No code required.

First, create the MCP config file at /tmp/ingero-mcp-dataloader.json:

{
  "mcpServers": {
    "ingero": {
      "command": "./bin/ingero",
      "args": ["mcp", "--db", "investigations/pytorch-dataloader-starvation.db"]
    }
  }
}

With Ollama (local, free):

# Install ollmcp (MCP client for Ollama)
pip install ollmcp

# Investigate with a local model (no data leaves your machine)
ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp-dataloader.json

With Claude Code:

claude --mcp-config /tmp/ingero-mcp-dataloader.json

Then type /investigate and let the model explore. Follow up with questions like “what was the root cause?” or “which processes were competing for CPU time?”

The MCP server exposes 7 tools. Claude will figure out the rest.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead. Give us a star at GitHub!

One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes

Ingero Team — Mon, 13 Apr 2026 17:18:24 +0000

TL;DR

A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine.

The problem we kept hitting

We’ve been building Ingero – an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well.

But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening.

We wanted cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works?

What we shipped in v0.9.1

Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports.

1. Node identity

Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback:

sudo ingero trace --node gpu-node-01

Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) – no extra configuration needed.

2. Fleet fan-out queries

Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS – --ca-cert, --client-cert, --client-key – so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then it’s intended for trusted VPC networks only.

The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically:

fleet:
  nodes:
    - gpu-node-01:8080
    - gpu-node-02:8080
    - gpu-node-03:8080
    - gpu-node-04:8080

A full example config is in configs/ingero.yaml.

Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving:

$ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \
    "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us
     FROM events GROUP BY node, source"

node              source  cnt    avg_us
----------------  ------  -----  ------
gpu-node-01       4       11009  5.2
gpu-node-01       3       847    18400  # ← 9x higher than peers
gpu-node-02       4       10892  5.1
gpu-node-02       3       412    2100
gpu-node-03       4       10847  5.3
gpu-node-03       3       398    1900
gpu-node-04       4       10901  5.0
gpu-node-04       3       421    2200

  8 rows from 4 node(s)

Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains:

$ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080

FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s)

[HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O
  Root cause: 847 sched_switch events + heavy block I/O
  Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs

[MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O
  Root cause: 855 sched_switch events + heavy block I/O
  Fix: Pin training process to dedicated cores with taskset

Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O – checkpoint writes preempting the training process.

Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.”

3. Offline merge and Perfetto export

Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints – there are real reasons the network path isn’t always available.

For those cases, ingero merge combines SQLite databases from each node into a single queryable file:

# 1. Collect traces from each node
scp gpu-node-01:~/.ingero/ingero.db node-01.db
scp gpu-node-02:~/.ingero/ingero.db node-02.db

# 2. Merge and analyze
ingero merge node-01.db node-02.db -o cluster.db
ingero explain -d cluster.db

Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node.

For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline.

Why we built it this way

The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb – the well-trodden path.

We deliberately avoided that.

No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure – the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet.

Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity.

Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble – and knowing which nodes failed is diagnostic information in itself.

Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query – 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query.

Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time.

MCP: AI-driven fleet investigation

The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster:

query_fleet(action="chains", since="5m")

Fleet Chains: 2 chain(s)
[HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O
[MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O

That’s the complete response – an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected.

Where this stands

v0.9.1 is the initial step in cluster-level tracing, not the destination.

What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export – these are diagnostic tools for after the fact.

We’re actively working on cross-node correlation and straggler detection – more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor.

The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments.

We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub.

The investigations/ directory has ready-to-query databases for trying this without a GPU cluster:

sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node cluster
sample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks)

GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

Forem: Ingero Team

MCP Shows What the Agent Did. eBPF Shows Why the GPU Stalled.

TL;DR

What MCP gives the agent

Two questions, two layers

Closing the loop with two MCP servers in the same chat

Why the kernel layer needs eBPF specifically

What this means for the MCP wave

More MCP servers shipped in the same window

Where the why-layer goes next

Related reading

MCP Tools Are New API Surfaces. eBPF Sees What They Actually Touch.

TL;DR

What an MCP tool call looks like to the agent

What a syscall trace shows for the same call

Why this matters more for GPU MCP tools

Try it locally

Smaller surface, same investigation

Related reading

A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.

TL;DR

What per-host tracing cannot see

The cluster-level question

Echo, in one paragraph

The fan-in correctness proof

The hardware run

The queries that matter

What this changes for AI agents

Try it locally

Across hosts, not on hosts

Related reading

GPU Utilization Is a Counter, Not a Cause

TL;DR

The mystery

What nvidia-smi actually counts

Five ways to score 100% utilization with broken throughput

What the cause-side metrics look like

Counter vs. cause, by metric

Try it locally

Further reading on GPU utilization metrics

Smoke and fire

Related reading

CUDA Out of Memory at 60% Utilization: Tracing PyTorch GPU Memory Fragmentation

TL;DR

The Problem

What nvidia-smi Shows

What torch.cuda.memory_summary() Shows

What the Trace Shows

Step 1: See the allocation pattern

Step 2: Find the causal chain

Step 3: Drill into the timeline with MCP

Step 4: Find the Python source line

The Fix

Try It Yourself

Related reading

26 Seconds to Find a Straggler: Fleet v0.10 End-to-End on A100 and GH200

TL;DR

What Fleet v0.10 actually is

The test

Numbers from the A100 run

Numbers from the GH200 run

The one GH200 wrinkle

What the detection pipeline actually does (and does not)

Try It Yourself

Artifacts

Related reading

Production GPU Training is 34% Slower. Show Me Why

The numbers

Why nobody detects them

What causes them

The math

How distributed systems solved this

The detection gap

What it would take

Agent + MCP + eBPF: 10,869 CUDA Kernel Events, Now Queryable

The session

What happened here

What made this possible

Try the eBPF trace yourself

The pattern repeats