<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mayank Ketkar</title>
    <description>The latest articles on Forem by Mayank Ketkar (@mketkar).</description>
    <link>https://forem.com/mketkar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3760623%2F9be3928f-38db-49e7-9b2c-c79c3ef5cd70.png</url>
      <title>Forem: Mayank Ketkar</title>
      <link>https://forem.com/mketkar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mketkar"/>
    <language>en</language>
    <item>
      <title>How to Read GPU Profiling Logs: A Ground-Up Guide</title>
      <dc:creator>Mayank Ketkar</dc:creator>
      <pubDate>Sun, 15 Feb 2026 17:47:38 +0000</pubDate>
      <link>https://forem.com/mketkar/how-to-read-gpu-profiling-logs-a-ground-up-guide-3akl</link>
      <guid>https://forem.com/mketkar/how-to-read-gpu-profiling-logs-a-ground-up-guide-3akl</guid>
      <description>&lt;p&gt;You ran &lt;code&gt;nsys profile&lt;/code&gt;, got a 2GB &lt;code&gt;.nsys-rep&lt;/code&gt; file, exported it to SQLite, and found yourself staring at 88 tables with names like &lt;code&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/code&gt; and &lt;code&gt;ENUM_WDDM_PAGING_QUEUE_TYPE&lt;/code&gt;. The kernel names are integer IDs. The timestamps are in nanoseconds. Nothing is human-readable. You closed the file and went back to guessing.&lt;/p&gt;

&lt;p&gt;This post exists so you never have to guess again.&lt;/p&gt;

&lt;p&gt;I'm going to teach you to read any nsys trace in under 10 minutes — using four tables, one SQL join pattern, and four queries. Then we'll use those tools to solve a real mystery: why does a model give different results at batch size 1 vs batch size 16, even though both traces show exactly 8,955 kernel launches?&lt;/p&gt;




&lt;h2&gt;
  
  
  What nsys actually records
&lt;/h2&gt;

&lt;p&gt;Imagine you're standing in a factory watching an assembly line. You have a stopwatch, and your job is to write down: &lt;em&gt;when&lt;/em&gt; each machine started, &lt;em&gt;when&lt;/em&gt; it stopped, &lt;em&gt;which&lt;/em&gt; machine it was, and &lt;em&gt;what&lt;/em&gt; it was building.&lt;/p&gt;

&lt;p&gt;That's what NVIDIA Nsight Systems does for your GPU. It records every kernel launch, every memory copy, every synchronization event — with nanosecond timestamps.&lt;/p&gt;

&lt;p&gt;The output is two files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;profile.nsys-rep     ← visual report (open in the Nsight GUI)
profile.sqlite       ← raw data in a database (query with SQL)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;.sqlite&lt;/code&gt; file is the gold mine. Everything in the &lt;code&gt;.nsys-rep&lt;/code&gt; is derived from it.&lt;/p&gt;




&lt;h2&gt;
  
  
  88 tables, but only 4 matter
&lt;/h2&gt;

&lt;p&gt;Open any nsys SQLite export and you'll find 88 tables. Most are enum lookup tables or metadata. You need these four:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Table&lt;/th&gt;
&lt;th&gt;What it records&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Every GPU kernel execution: start, end, grid, block, name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CUPTI_ACTIVITY_KIND_MEMCPY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Every memory transfer: CPU→GPU, GPU→CPU, GPU→GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CUPTI_ACTIVITY_KIND_MEMSET&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Every memory fill (zeroing buffers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NVTX_EVENTS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Human-readable markers programmers add to their code&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Plus one helper table: &lt;strong&gt;&lt;code&gt;StringIds&lt;/code&gt;&lt;/strong&gt; — the Rosetta Stone that maps integer IDs to actual names.&lt;/p&gt;

&lt;p&gt;Here's how to discover them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sqlite3&lt;/span&gt;
&lt;span class="c1"&gt;# nsys export --type=sqlite profile.nsys-rep  -&amp;gt; produces profile.sqlite
&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;profile.sqlite&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT name FROM sqlite_master WHERE type=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tables&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total tables: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 88 tables!
# But only 4 matter:
#   CUPTI_ACTIVITY_KIND_KERNEL  ← GPU kernel executions
#   CUPTI_ACTIVITY_KIND_MEMCPY  ← memory transfers
#   CUPTI_ACTIVITY_KIND_MEMSET  ← memory fills
#   NVTX_EVENTS                 ← human-readable markers
#   + StringIds                 ← name lookup table
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The #1 gotcha: names are integers
&lt;/h2&gt;

&lt;p&gt;This is the single most confusing thing when you first look at nsys data. Kernel names are &lt;strong&gt;not&lt;/strong&gt; stored as strings. They're stored as integer foreign keys into the &lt;code&gt;StringIds&lt;/code&gt; table.&lt;/p&gt;

&lt;p&gt;When you query the kernel table, you'll see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;demangledName = 58
shortName     = 59
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are pointers. To get the actual name, you JOIN:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;StringIds&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shortName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And suddenly &lt;code&gt;59&lt;/code&gt; becomes &lt;code&gt;vectorized_elementwise_kernel&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Every query you write will include this JOIN. It becomes muscle memory after your second trace.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decoding kernel names: the Rosetta Stone
&lt;/h2&gt;

&lt;p&gt;Once you run that JOIN, you'll see names like &lt;code&gt;nvjet_tst_192x192_64x4_2x1_v_bz_coopB_TNN&lt;/code&gt;. This isn't gibberish — every character means something:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nvjet_tst _ 192x192 _ 64x4 _ 2x1 _ v _ bz _ coopB _ TNN
   │          │        │      │     │    │     │       │
   │          │        │      │     │    │     │       └─ transpose: T=yes, N=no
   │          │        │      │     │    │     │          TNN = A^T × B → C
   │          │        │      │     │    │     │
   │          │        │      │     │    │     └─ cooperative mode
   │          │        │      │     │    │        coopA/coopB = how SMs share work
   │          │        │      │     │    │
   │          │        │      │     │    └─ block-zero init strategy
   │          │        │      │     │
   │          │        │      │     └─ layout: v=vertical, h=horizontal
   │          │        │      │        (how tiles map to SMs)
   │          │        │      │
   │          │        │      └─ warp tiling: 2 warps in M, 1 in N
   │          │        │
   │          │        └─ block tile: 64×4 threads per block
   │          │
   │          └─ output tile: 192×192 chunk per SM
   │
   └─ nvjet_tst = NVIDIA JIT persistent kernel (deterministic!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Think of it like a car's VIN number. Once you know the format, you can read any GPU kernel at a glance.&lt;/p&gt;

&lt;p&gt;Besides &lt;code&gt;nvjet_tst_*&lt;/code&gt;, you'll encounter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;device_kernel&lt;/code&gt;&lt;/strong&gt; — output of &lt;code&gt;torch.compile&lt;/code&gt;. Opaque, but often 70%+ of GPU time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;vectorized_elementwise_kernel&lt;/code&gt;&lt;/strong&gt; — PyTorch's generic ops (add, multiply, cast).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;rms_norm_kernel&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;act_and_mul_kernel&lt;/code&gt;&lt;/strong&gt; — normalization and activation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;*_splitK_*&lt;/code&gt;&lt;/strong&gt; — Split-K GEMM with atomic reduction. &lt;em&gt;Potential non-determinism source&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Grid and block: mapping kernels to hardware
&lt;/h2&gt;

&lt;p&gt;The kernel table has &lt;code&gt;gridX/Y/Z&lt;/code&gt; and &lt;code&gt;blockX/Y/Z&lt;/code&gt; columns. These map to physical GPU hardware.&lt;/p&gt;

&lt;p&gt;An H200 has 132 Streaming Multiprocessors (SMs) — 132 independent assembly lines. Each can process one thread block at a time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  GPU with 132 SMs:
  ┌─────┬─────┬─────┬─────┬─── ── ──┬─────┐
  │SM 0 │SM 1 │SM 2 │SM 3 │  ...    │SM131│
  │blk 0│blk 1│blk 2│blk 3│         │blk  │
  │128  │128  │128  │128  │         │131  │
  │thrds│thrds│thrds│thrds│         │thrds│
  └─────┴─────┴─────┴─────┴─── ── ──┴─────┘

  grid=132 × block=128 = 16,896 total threads
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;gridX=1&lt;/strong&gt;: One SM active, 131 idle. Tiny work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gridX=132&lt;/strong&gt;: Every SM busy. What persistent GEMM targets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gridX=64,000&lt;/strong&gt;: Blocks queue in waves. GPU stays saturated.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Four SQL queries that answer every question
&lt;/h2&gt;

&lt;p&gt;Every GPU investigation follows the same four-step pattern: &lt;strong&gt;The Census&lt;/strong&gt;, &lt;strong&gt;The Lineup&lt;/strong&gt;, &lt;strong&gt;The Stakeout&lt;/strong&gt;, and &lt;strong&gt;The Timeline&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 1: The Census — "What's slow?"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Where is the GPU spending its time?&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;kernel_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;call_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_us&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;StringIds&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shortName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_ms&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On our H200 running vLLM inference:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Kernel&lt;/th&gt;
&lt;th&gt;Calls&lt;/th&gt;
&lt;th&gt;Total (ms)&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;device_kernel (torch.compile)&lt;/td&gt;
&lt;td&gt;748&lt;/td&gt;
&lt;td&gt;877.38&lt;/td&gt;
&lt;td&gt;70.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vectorized_elementwise_kernel&lt;/td&gt;
&lt;td&gt;883&lt;/td&gt;
&lt;td&gt;60.98&lt;/td&gt;
&lt;td&gt;4.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nvjet_tst_192x192_64x4_2x1...&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;51.94&lt;/td&gt;
&lt;td&gt;4.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;70% of GPU time in one kernel type — the compiled vision encoder.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 2: The Lineup — "What category?"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="k"&gt;CASE&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%nvjet%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Persistent GEMM'&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%cublas%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'cuBLAS (DANGER)'&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%device_kernel%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'torch.compile'&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%flash%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Flash Attention'&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%norm%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Normalization'&lt;/span&gt;
    &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'Other'&lt;/span&gt;
  &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_ms&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;StringIds&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shortName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_ms&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results: 17.3% Persistent GEMM, 7.7% Elementwise, 2% Normalization, and... &lt;strong&gt;0.1% Flash Attention&lt;/strong&gt;. Just 1.2ms. Barely a blip.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Remember that 0.1%. It becomes 5.15x the smoking gun.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 3: The Stakeout — "What changed?"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Drill into every launch of a specific kernel&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dur_us&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gridX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gridY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gridZ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blockX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;registersPerThread&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dynamicSharedMemory&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;StringIds&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shortName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'reshape_and_cache_flash_kernel'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Level 4: The Timeline — "What happened when?"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Unified GPU timeline: kernels + memory transfers&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'KERNEL'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dur_us&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;StringIds&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shortName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'MEMCPY'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
       &lt;span class="s1"&gt;'copyKind='&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copyKind&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;CUPTI_ACTIVITY_KIND_MEMCPY&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;start&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Case study: the cascade attention smoking gun
&lt;/h2&gt;

&lt;p&gt;Here's the mystery: We're running vLLM inference with batch-invariant mode, which &lt;em&gt;guarantees&lt;/em&gt; bitwise-identical results regardless of batch size. BS=1 works perfectly. BS=16 gives different results. Both traces: exactly 8,955 kernel launches. Where's the bug?&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: The Census — nothing obvious
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;BS=1&lt;/th&gt;
&lt;th&gt;BS=16&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total kernels&lt;/td&gt;
&lt;td&gt;8,955&lt;/td&gt;
&lt;td&gt;8,955&lt;/td&gt;
&lt;td&gt;1.00x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total GPU time&lt;/td&gt;
&lt;td&gt;1.25s&lt;/td&gt;
&lt;td&gt;1.37s&lt;/td&gt;
&lt;td&gt;1.10x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memcpy ops&lt;/td&gt;
&lt;td&gt;1,099&lt;/td&gt;
&lt;td&gt;1,147&lt;/td&gt;
&lt;td&gt;1.04x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memset ops&lt;/td&gt;
&lt;td&gt;345&lt;/td&gt;
&lt;td&gt;429&lt;/td&gt;
&lt;td&gt;1.24x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 2: The Lineup — the outlier appears
&lt;/h3&gt;

&lt;p&gt;Most categories grow modestly. Persistent GEMM +35%, Normalization +52%. Expected.&lt;/p&gt;

&lt;p&gt;But Flash Attention: &lt;strong&gt;1.20ms → 6.17ms — 5.15x increase&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In absolute terms it's 6ms. But the &lt;em&gt;ratio&lt;/em&gt; is an extreme statistical outlier.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: The Stakeout — same calls, more data
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;reshape_and_cache_flash_kernel&lt;/code&gt; runs 392 times in &lt;em&gt;both&lt;/em&gt; profiles (same count!), but takes 5.15x longer per call in BS=16. More data per call, not more calls.&lt;/p&gt;

&lt;p&gt;Memory: 83% more device-to-device copies (53 vs 29 ops).&lt;/p&gt;

&lt;p&gt;GEMM: 7 new kernel variants in BS=16 (88ms) that don't exist in BS=1 (which had 7 different variants, only 18ms).&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: The code — root cause
&lt;/h3&gt;

&lt;p&gt;Every clue points to one place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/v1/attention/backends/flash_attn.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;use_cascade_attention&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;common_prefix_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_lens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
    &lt;span class="c1"&gt;# Too short prefix — not worth splitting
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;common_prefix_len&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# ← BS=1 exits here
&lt;/span&gt;
    &lt;span class="c1"&gt;# Too few requests — not worth splitting
&lt;/span&gt;    &lt;span class="n"&gt;num_reqs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_lens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;num_reqs&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# ← BS=1-7 exits here
&lt;/span&gt;
    &lt;span class="c1"&gt;# BS=16 with shared system prompt (&amp;gt;256 tokens):
&lt;/span&gt;    &lt;span class="c1"&gt;# → cascade ON → split prefix/suffix → LSE merge
&lt;/span&gt;    &lt;span class="c1"&gt;# → mathematically equivalent, but FP-different
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# ← determinism breaks here
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cascade attention&lt;/strong&gt; splits attention into prefix and suffix passes, then merges with LSE arithmetic. Mathematically equivalent. Floating-point different. That's the determinism break.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix: &lt;code&gt;disable_cascade_attn=True&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;
  The Evidence Board
  &lt;br&gt;
| Clue | Evidence | Verdict |&lt;br&gt;
|------|----------|---------|&lt;br&gt;
| 5.15x attention slowdown | reshape_and_cache_flash_kernel: 1.20ms → 6.17ms | More data per call |&lt;br&gt;
| 83% more D-to-D copies | 29 → 53 ops, 81 → 163 MB | Internal tensor splitting |&lt;br&gt;
| 7 new GEMM variants | 88ms new vs 18ms removed | Autotuner adapting |&lt;br&gt;
| Identical kernel count | 8,955 both profiles | Same graph, different PATH |&lt;br&gt;
| &lt;strong&gt;All clues →&lt;/strong&gt; | &lt;strong&gt;CASCADE ATTENTION&lt;/strong&gt; | &lt;strong&gt;flash_attn.py:673&lt;/strong&gt; |&lt;br&gt;


&lt;/p&gt;




&lt;h2&gt;
  
  
  The cheat sheet
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│  NSYS SQLITE CHEAT SHEET                                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  THE 4 TABLES:                                              │
│    CUPTI_ACTIVITY_KIND_KERNEL  → kernel executions          │
│    CUPTI_ACTIVITY_KIND_MEMCPY  → memory transfers           │
│    CUPTI_ACTIVITY_KIND_MEMSET  → memory fills               │
│    NVTX_EVENTS                 → human-added markers        │
│    + StringIds                 → name lookup                │
│                                                             │
│  THE JOIN (everywhere):                                     │
│    JOIN StringIds s ON k.shortName = s.id                   │
│                                                             │
│  TIMESTAMPS:  nanoseconds                                   │
│    / 1e3 = μs    / 1e6 = ms    / 1e9 = seconds             │
│                                                             │
│  COPYKIND:  1=CPU→GPU  2=GPU→CPU  8=GPU→GPU                │
│                                                             │
│  GRID/BLOCK:                                                │
│    grid × block = total threads                             │
│    grid=132 → all H200 SMs active                           │
│    grid=1 → one SM, 131 idle                                │
│                                                             │
│  DANGER ZONE:                                               │
│    *cublas* → non-deterministic GEMM                        │
│    *splitK* → non-deterministic reduction                   │
│    cascade/merge_attn → FP divergence                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Your first 10 minutes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Capture&lt;/span&gt;
nsys profile &lt;span class="nt"&gt;-o&lt;/span&gt; my_trace &lt;span class="nt"&gt;--stats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true &lt;/span&gt;python my_script.py

&lt;span class="c"&gt;# 2. Export to SQLite&lt;/span&gt;
nsys &lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sqlite my_trace.nsys-rep

&lt;span class="c"&gt;# 3. Find your bottleneck&lt;/span&gt;
python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
import sqlite3
conn = sqlite3.connect('my_trace.sqlite')
cur = conn.cursor()
cur.execute('''
    SELECT s.value, COUNT(*), ROUND(SUM(k.end-k.start)/1e6,2) AS ms
    FROM CUPTI_ACTIVITY_KIND_KERNEL k
    JOIN StringIds s ON k.shortName = s.id
    GROUP BY s.value ORDER BY ms DESC LIMIT 10
''')
for row in cur.fetchall():
    print(f'{row[0]:50s}  calls={row[1]:5d}  total={row[2]:8.2f} ms')
"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In 10 minutes you'll know which kernel is your bottleneck. GPU profiling is not a dark art. It's four tables, one JOIN, and four queries.&lt;/p&gt;

&lt;p&gt;The answer is in the trace. Go look.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;All data from real H200 GPU traces: 8,955 kernel launches during Qwen3-VL-2B inference with vLLM 0.15.2.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>profiling</category>
      <category>performance</category>
      <category>cuda</category>
    </item>
  </channel>
</rss>
