<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Myoungho Shin</title>
    <description>The latest articles on Forem by Myoungho Shin (@codinginavan).</description>
    <link>https://forem.com/codinginavan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3788007%2F09e840db-b49e-41e5-9eca-f28d4329e1d4.png</url>
      <title>Forem: Myoungho Shin</title>
      <link>https://forem.com/codinginavan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/codinginavan"/>
    <language>en</language>
    <item>
      <title>Memory Coalescing: Same computation, 6x Performance Difference</title>
      <dc:creator>Myoungho Shin</dc:creator>
      <pubDate>Thu, 09 Apr 2026 18:39:47 +0000</pubDate>
      <link>https://forem.com/codinginavan/memory-coalescing-same-computation-6x-performance-difference-339</link>
      <guid>https://forem.com/codinginavan/memory-coalescing-same-computation-6x-performance-difference-339</guid>
      <description>&lt;p&gt;In software engineering, if two approaches are both O(n), that is often good enough for the discussion.&lt;br&gt;
But in low-level or performance engineering, that is not the end of the story. Even when two algorithms have the same time complexity, the actual performance can be very different depending on how they access memory. &lt;/p&gt;

&lt;p&gt;A simple example is iterating through an array versus a linked list. Both are O(n), but arrays are usually much faster in practice because their memory layout is contiguous, which allows the CPU to use caches much more efficiently.&lt;/p&gt;

&lt;p&gt;The same idea applies on GPUs too, but the effect is often much bigger because many threads are accessing memory at the same time.&lt;/p&gt;
&lt;h3&gt;
  
  
  What is Memory Coalescing?
&lt;/h3&gt;

&lt;p&gt;On NVIDIA GPUs, threads execute in groups called warps, which contain 32 threads.&lt;/p&gt;

&lt;p&gt;When those threads access memory in a well-structured way, the GPU can combine their requests into a small number of memory transactions. That is called memory coalescing.&lt;/p&gt;

&lt;p&gt;When the access pattern is poor, the opposite happens. Instead of serving the whole warp efficiently, the GPU ends up issuing many separate memory transactions. That wastes bandwidth and increases latency.&lt;/p&gt;

&lt;p&gt;So the idea is simple: neighboring threads should access neighboring memory whenever possible.&lt;/p&gt;
&lt;h2&gt;
  
  
  Measuring It in Practice
&lt;/h2&gt;

&lt;p&gt;The concept itself is well known, but measuring it in real code is not always convenient.&lt;/p&gt;

&lt;p&gt;Tools like NVIDIA Nsight Compute usually require attaching a profiler and replaying kernels. That is fine for deep analysis, but it is not something you continuously leave on during normal execution.&lt;/p&gt;

&lt;p&gt;With GPUFlight, I wanted to measure this kind of issue continuously during normal runs, without a debugger and without replaying the kernel.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Setup: Two Matmul Kernels
&lt;/h2&gt;

&lt;p&gt;For this example, I used two simple matrix multiplication kernels:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;C = A × B&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Both kernels compute the exact same result. The only difference is how the work is assigned to threads.&lt;/p&gt;
&lt;h3&gt;
  
  
  Row-per-thread
&lt;/h3&gt;

&lt;p&gt;Each thread computes one row of the output matrix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;matmul_row_per_thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                      &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Col-per-thread&lt;/strong&gt; — Each thread computes one column of the output matrix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;matmul_col_per_thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                      &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same math. Same number of floating-point operations. The only difference is which dimension maps to &lt;code&gt;threadIdx.x&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That small mapping change turns out to matter a lot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters: How GPUs Read Memory
&lt;/h2&gt;

&lt;p&gt;A GPU does not read one float at a time in the way people often imagine. When a warp executes a load instruction, the hardware tries to combine the addresses from all 32 threads into as few memory transactions as possible.&lt;/p&gt;

&lt;p&gt;In the best case, all 32 threads access consecutive floats, and the warp can be served efficiently.&lt;/p&gt;

&lt;p&gt;In the worst case, each thread touches a different cache line, so the GPU ends up issuing many separate transactions. Most of the fetched data is not even used by that warp.&lt;/p&gt;

&lt;p&gt;That is exactly what happens here.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;matmul_row_per_thread&lt;/code&gt;, adjacent threads (thread 0, 1, 2, ...) are assigned rows 0, 1, 2, .... When they read &lt;code&gt;A[row * K + i]&lt;/code&gt;, thread 0 reads address &lt;code&gt;0*K + i&lt;/code&gt; and thread 1 reads &lt;code&gt;1*K + i&lt;/code&gt; — these are K floats apart. With K=256, that's a stride of 1024 bytes between adjacent threads. Every thread hits a different cache line.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;matmul_col_per_thread&lt;/code&gt;, adjacent threads access columns 0, 1, 2, .... When they read &lt;code&gt;B[i * N + col]&lt;/code&gt;, thread 0 reads &lt;code&gt;i*N + 0&lt;/code&gt; and thread 1 reads &lt;code&gt;i*N + 1&lt;/code&gt; — consecutive addresses. One cache line serves all 32 threads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring with GPUFlight
&lt;/h2&gt;

&lt;p&gt;GPUFlight instruments your CUDA application using CUPTI's SASS metrics and PC sampling APIs. You add a few lines to your code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;"gpufl/gpufl.hpp"&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;InitOptions&lt;/span&gt; &lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;app_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"memory_coalescing_demo"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;profiling_engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ProfilingEngine&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;PcSamplingWithSass&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;GFL_SCOPE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"row-per-thread"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;matmul_row_per_thread&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d_B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d_C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;GFL_SCOPE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"col-per-thread"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;matmul_col_per_thread&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d_B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d_C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;generateReport&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPUFlight collects data during normal execution — no debugger, no replay, no kernel serialization.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;Here's the report from an RTX 5060 (Blackwell, sm_120):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  matmul_row_per_thread  (13,268 stall samples)
  ------------------------------------------------------------------
    Stalls:
      Wait                           4,592   34.6%  #######
      Wait (idle)                    4,298   32.4%  ######
      Long Scoreboard                1,441   10.9%  ##
      Long Scoreboard (idle)         1,376   10.4%  ##
      Branch Resolving                 459    3.5%  #
      Selected                         351    2.6%  #
    Instructions:
      Warp Insts:                 12,042,560
      Thread Insts:              385,361,920
      Warp Efficiency:            32.0 / 32 (100.0%)
    Memory:
      Global Sectors:             69,468,160
      Ideal Sectors:              10,518,528
      Memory Efficiency:               15.1%
    Hints:
      * Low memory efficiency (15%) — consider coalesced access
        patterns or shared memory tiling.

  matmul_col_per_thread
  ------------------------------------------------------------------
    Instructions:
      Warp Insts:                 10,428,736
      Thread Insts:              333,719,552
      Warp Efficiency:            32.0 / 32 (100.0%)
    Memory:
      Global Sectors:             10,518,528
      Ideal Sectors:              10,518,528
      Memory Efficiency:              100.0%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Breaking Down the Numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Memory Efficiency: 15% vs 100%
&lt;/h3&gt;

&lt;p&gt;This is the main number to look at. GPUFlight measures two things per kernel:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Global Sectors&lt;/strong&gt;: actual 32-byte memory sectors transferred&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ideal Sectors&lt;/strong&gt;: minimum sectors needed if every access were perfectly coalesced&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Kernel&lt;/th&gt;
&lt;th&gt;Actual Sectors&lt;/th&gt;
&lt;th&gt;Ideal Sectors&lt;/th&gt;
&lt;th&gt;Efficiency&lt;/th&gt;
&lt;th&gt;Waste&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Row-per-thread&lt;/td&gt;
&lt;td&gt;69,468,160&lt;/td&gt;
&lt;td&gt;10,518,528&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15.1%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6.6×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Col-per-thread&lt;/td&gt;
&lt;td&gt;10,518,528&lt;/td&gt;
&lt;td&gt;10,518,528&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.0×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The row-per-thread kernel transfers &lt;strong&gt;6.6× more data&lt;/strong&gt; than necessary. For every useful float, the GPU fetches an entire cache line that only one thread uses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stall Analysis: Where the Time Goes
&lt;/h3&gt;

&lt;p&gt;PC sampling tells us what each warp was doing when sampled:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Wait (34.6%) + Wait idle (32.4%) = 67%&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long Scoreboard (10.9%)&lt;/strong&gt; — 
That means a large portion of the time, the warps are not doing useful math. They are mostly waiting for memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the part I like most about seeing the data together: the memory inefficiency is not just an abstract metric. You can see it show up directly in the stall breakdown.&lt;/p&gt;

&lt;p&gt;The col-per-thread kernel has so few stalls that PC sampling barely accumulates much data there. It simply finishes too quickly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wall-Clock Impact
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Row-per-thread (uncoalesced): 245 ms
Col-per-thread (coalesced):   155 ms
Speedup: 1.6×
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The coalesced version is &lt;strong&gt;1.6× faster&lt;/strong&gt; on this setup. &lt;/p&gt;

&lt;p&gt;That is already a meaningful gain, and this is from a very small change in how work is mapped to threads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Warp Efficiency Can Be Misleading
&lt;/h3&gt;

&lt;p&gt;Both kernels show 100% warp efficiency (32/32 active threads). That means there is no thread divergence here. Every thread in each warp follows the same control flow.&lt;/p&gt;

&lt;p&gt;If you only looked at warp efficiency, both kernels would look healthy.&lt;/p&gt;

&lt;p&gt;But they are not equally healthy. The real problem is memory access, and memory efficiency exposes it immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  What GPUFlight Collects Under the Hood
&lt;/h2&gt;

&lt;p&gt;GPUFlight uses two CUPTI mechanisms that run during normal execution:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SASS Metrics&lt;/strong&gt; — The GPU binary is patched at load time to count per-instruction execution, thread activity, and memory sector usage. This is how we get the &lt;code&gt;Global Sectors&lt;/code&gt; and &lt;code&gt;Ideal Sectors&lt;/code&gt; numbers. No sampling bias — every instruction is counted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PC Sampling&lt;/strong&gt; — The hardware periodically interrupts each SM and records what every warp is doing: executing, or stalled and why. This gives us the stall reason distribution (Wait, Long Scoreboard, etc.).&lt;/p&gt;

&lt;p&gt;GPUFlight also disassembles the GPU binary (SASS assembly) so you can see exactly which instructions are hot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="cm"&gt;/*0x2a0*/&lt;/span&gt; &lt;span class="n"&gt;LDG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CONSTANT&lt;/span&gt; &lt;span class="n"&gt;R20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;UR12&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;R18&lt;/span&gt;&lt;span class="mf"&gt;.64&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="n"&gt;load&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hot&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="cm"&gt;/*0x2c0*/&lt;/span&gt; &lt;span class="n"&gt;LDG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CONSTANT&lt;/span&gt; &lt;span class="n"&gt;R22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;UR12&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;R16&lt;/span&gt;&lt;span class="mf"&gt;.64&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="n"&gt;load&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hot&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="cm"&gt;/*0x340*/&lt;/span&gt; &lt;span class="n"&gt;FFMA&lt;/span&gt; &lt;span class="n"&gt;R35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;R20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;R21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;R37&lt;/span&gt;                   &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;fused&lt;/span&gt; &lt;span class="n"&gt;multiply&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;LDG.E.CONSTANT&lt;/code&gt; instructions are the global memory loads. In the row-per-thread kernel, these are where 67% of the time is spent waiting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix Is One Line
&lt;/h2&gt;

&lt;p&gt;The entire difference between 15% and 100% memory efficiency comes down to which dimension you assign to &lt;code&gt;threadIdx.x&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- int row = blockIdx.x * blockDim.x + threadIdx.x;  // threads map to rows
&lt;/span&gt;&lt;span class="gi"&gt;+ int col = blockIdx.x * blockDim.x + threadIdx.x;  // threads map to columns
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Same algorithm, same math, same number of operations. Just a different mapping of threads to data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The complete example is available as &lt;code&gt;memory_coalescing_demo.cu&lt;/code&gt; in the &lt;a href="https://github.com/gpu-flight/gpufl-client/blob/main/example/cuda/memory_coalescing_demo.cu" rel="noopener noreferrer"&gt;GPUFlight client repository&lt;/a&gt;. To run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build with GPUFlight&lt;/span&gt;
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DCMAKE_CUDA_ARCHITECTURES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;native
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--target&lt;/span&gt; memory_coalescing_demo

&lt;span class="c"&gt;# Run (admin/root for PC sampling on some platforms)&lt;/span&gt;
./build/example/cuda/memory_coalescing_demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Memory coalescing is one of those concepts that sounds simple when explained in theory, but it becomes much more convincing when you can see the numbers in a real kernel.&lt;/p&gt;

&lt;p&gt;In this example, it is not a tiny optimization. It is the difference between 15% and 100% memory efficiency, 6.6× more memory traffic than necessary, and a 1.6× wall-clock slowdown.&lt;/p&gt;

&lt;p&gt;That is why memory access patterns matter so much on GPUs.&lt;/p&gt;

</description>
      <category>cuda</category>
      <category>gpu</category>
      <category>aiops</category>
      <category>cpp</category>
    </item>
    <item>
      <title>Detecting Thread Divergence with SASS Metrics and GPU Flight</title>
      <dc:creator>Myoungho Shin</dc:creator>
      <pubDate>Tue, 10 Mar 2026 06:54:51 +0000</pubDate>
      <link>https://forem.com/codinginavan/detecting-thread-divergence-with-sass-metrics-and-gpu-flight-kfc</link>
      <guid>https://forem.com/codinginavan/detecting-thread-divergence-with-sass-metrics-and-gpu-flight-kfc</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/codinginavan/profiling-gpu-cuda-getting-started-with-gpu-flights-python-package-1pl8"&gt;previous post&lt;/a&gt; I showed how to set up GPU Flight with Python and read kernel-level profiling data — occupancy, register counts, and resource bottlenecks. That tells you &lt;em&gt;how well&lt;/em&gt; a kernel uses the hardware. But it doesn't tell you what's happening &lt;em&gt;inside&lt;/em&gt; the kernel.&lt;/p&gt;

&lt;p&gt;Today I want to look at one specific problem: &lt;strong&gt;thread divergence&lt;/strong&gt;. When threads within a warp take different code paths, the GPU serializes execution — it runs one branch, then the other, while idle threads wait. If half the threads branch left and half branch right, you're running at 50% efficiency on those instructions.&lt;/p&gt;

&lt;p&gt;GPU Flight's &lt;strong&gt;SASS Metrics engine&lt;/strong&gt; gives you a direct way to measure this. It instruments the GPU at the assembly (SASS) level and reports two key counters per instruction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;smsp__sass_inst_executed&lt;/code&gt;&lt;/strong&gt; — the number of warp-level instruction executions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;smsp__sass_thread_inst_executed&lt;/code&gt;&lt;/strong&gt; — the total number of thread-level instruction executions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ratio &lt;code&gt;thread_executed / (inst_executed × 32)&lt;/code&gt; tells you the average number of active threads per warp at each instruction. If it's 32.0, every thread was active. If it's 16.0, half were diverged. If it's 8.0, only a quarter was doing useful work.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Demo: Five Divergence Patterns
&lt;/h2&gt;

&lt;p&gt;I wrote a small CUDA program with five kernels, each demonstrating a different divergence pattern. The full source is in the GPU Flight repo at &lt;code&gt;example/cuda/sass_divergence_demo.cu&lt;/code&gt;. Here's a summary:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Kernel&lt;/th&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Expected Active Threads&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;uniformWork&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No divergence (baseline)&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;branchByWarpLane&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;if (threadIdx.x % 2)&lt;/code&gt; — even/odd split&lt;/td&gt;
&lt;td&gt;16 in each branch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;branchByWarpQuad&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;if (threadIdx.x % 4 == 0)&lt;/code&gt; — 1-in-4&lt;/td&gt;
&lt;td&gt;8 in hot path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;earlyExit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Data-dependent early return&lt;/td&gt;
&lt;td&gt;Varies (~16)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;indirectBranch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4-way switch on random data&lt;/td&gt;
&lt;td&gt;Varies (~8)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each kernel is wrapped in a &lt;code&gt;GFL_SCOPE&lt;/code&gt; so GPU Flight can attribute the SASS metrics to the right section.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel 1: Uniform Work (Baseline)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;uniformWork&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.01&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every thread does the same thing. No branches inside the loop, no divergence. This is the baseline — you should see &lt;code&gt;thread_executed / inst_executed&lt;/code&gt; close to 32 for the loop body instructions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel 2: Even/Odd Divergence
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;branchByWarpLane&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.01&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the classic divergence example. Within every warp, 16 threads go left, 16 go right. The GPU executes both paths sequentially with half the threads masked off each time. The SASS metrics will show ~16 active threads for instructions inside each branch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel 3: Quad Divergence
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;branchByWarpQuad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.001&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.0001&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only every 4th thread enters the loop. That's 8 out of 32 threads doing the heavy work while 24 sit idle. Worse than 50/50 — 75% of the warp is wasted during the loop body.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel 4: Early Exit
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;earlyExit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.01&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.005&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is data-dependent. Threads whose input is below the threshold return early, while the rest do the expensive computation. With random inputs in [0, 1) and a threshold of 0.5, roughly half the threads will exit early. But unlike Kernel 2, the split isn't uniform across warps — some warps might have 20 threads exit, others might have 10.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel 5: Data-Dependent Switch
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;indirectBranch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;4.0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.01&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.005&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.99&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A 4-way branch driven by random data. On average, each case gets ~8 threads per warp, but the GPU must execute all 4 paths sequentially. This is the worst case — 4x the instruction count for the branch body.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running the Demo
&lt;/h2&gt;

&lt;p&gt;Build and run from the GPU Flight repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/gpu-flight/gpufl-client.git
&lt;span class="nb"&gt;cd &lt;/span&gt;gpufl-client
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--target&lt;/span&gt; sass_divergence_demo
./build/example/cuda/sass_divergence_demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key part is in &lt;code&gt;main()&lt;/code&gt; — initializing GPU Flight with the SASS Metrics engine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;InitOptions&lt;/span&gt; &lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;app_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"sass_divergence_demo"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"sass_divergence"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enable_kernel_details&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sampling_auto_start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;profiling_engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ProfilingEngine&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SassMetrics&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting &lt;code&gt;profiling_engine&lt;/code&gt; to &lt;code&gt;SassMetrics&lt;/code&gt; tells GPU Flight to instrument every kernel at the SASS level. Each &lt;code&gt;GFL_SCOPE&lt;/code&gt; block then collects per-instruction counters for the kernels launched inside it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results: RTX 3090
&lt;/h2&gt;

&lt;p&gt;Here's what I got running on an NVIDIA GeForce RTX 3090 (Ampere, SM 8.6, 82 SMs) with 1M elements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kernel                    Weighted Avg Active Threads    Instructions
----------------------------------------------------------------------
uniformWork                                      32.0             277
branchByWarpLane                                 16.3             796
branchByWarpQuad                                  8.2             281
earlyExit                                        16.2             280
indirectBranch                                    1.5            1062
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "Weighted Avg Active Threads" is &lt;code&gt;thread_inst_executed / inst_executed&lt;/code&gt; across all SASS instructions in each kernel, weighted by execution count. "Instructions" is the number of unique PC offsets (SASS instructions) instrumented.&lt;/p&gt;

&lt;p&gt;Let's walk through what this tells us:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;uniformWork&lt;/code&gt; — 32.0 active threads.&lt;/strong&gt; Perfect. Every warp runs at full width. This is the expected baseline for a kernel with no divergence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;branchByWarpLane&lt;/code&gt; — 16.3 active threads.&lt;/strong&gt; Very close to the theoretical 16. The slight overshoot comes from instructions outside the branch (the &lt;code&gt;if (idx &amp;lt; n)&lt;/code&gt; guard, loop control, and the final store) where all 32 threads are active. The 796 unique instructions — nearly 3x the baseline — show the cost: the compiler generates separate code for each branch, and both paths must be executed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;branchByWarpQuad&lt;/code&gt; — 8.2 active threads.&lt;/strong&gt; Again close to the theoretical 8 (only 1 in 4 threads enters the loop). Similar instruction count to the baseline since there's only one branch path — but every instruction in the hot loop runs with 75% of threads idle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;earlyExit&lt;/code&gt; — 16.2 active threads.&lt;/strong&gt; Matches the expectation for a 50% threshold with random data. Threads that exit early become inactive for the remaining instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;indirectBranch&lt;/code&gt; — 1.5 active threads, 1062 instructions.&lt;/strong&gt; This is the most striking result. A 4-way switch on random data drops the weighted average to just 1.5 active threads per warp — far worse than the other kernels. It also generates the highest instruction count at 1062, nearly 4x the baseline. This is a crucial insight: &lt;strong&gt;divergence doesn't just halve your throughput — multi-way branching on random data can drop you below 5%&lt;/strong&gt; when measured at the instruction level.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means in Practice
&lt;/h2&gt;

&lt;p&gt;Thread divergence is easy to create and hard to notice. Your kernel still produces correct results. But you might be leaving 50-95% of your GPU's compute on the table.&lt;/p&gt;

&lt;p&gt;Here are the common patterns to watch for:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lane-based branching&lt;/strong&gt; — &lt;code&gt;if (threadIdx.x % N)&lt;/code&gt;. This is almost always unintentional. Consider rearranging your data so that threads within a warp take the same path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data-dependent branches&lt;/strong&gt; — like the &lt;code&gt;earlyExit&lt;/code&gt; kernel. If your input distribution is skewed, some warps diverge heavily while others don't. The average might look okay, but the worst warps are bottlenecks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Switch statements on computed values&lt;/strong&gt; — like &lt;code&gt;indirectBranch&lt;/code&gt;. This was the worst offender in our test — each additional case multiplies the predicated instruction overhead.&lt;/p&gt;

&lt;p&gt;The fix depends on the situation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sort or bin your data&lt;/strong&gt; so threads in the same warp hit the same branch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replace branches with predicated arithmetic&lt;/strong&gt; — branchless code runs all threads at full width&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Restructure your algorithm&lt;/strong&gt; so the branch happens at the warp or block level, not the thread level&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>gpu</category>
      <category>cpp</category>
      <category>cuda</category>
      <category>performance</category>
    </item>
    <item>
      <title>Profiling GPU (CUDA) — Getting Started with GPU Flight's Python Package</title>
      <dc:creator>Myoungho Shin</dc:creator>
      <pubDate>Mon, 09 Mar 2026 03:59:53 +0000</pubDate>
      <link>https://forem.com/codinginavan/profiling-gpu-cuda-getting-started-with-gpu-flights-python-package-1pl8</link>
      <guid>https://forem.com/codinginavan/profiling-gpu-cuda-getting-started-with-gpu-flights-python-package-1pl8</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/codinginavan/profiling-gpu-cuda-what-is-actually-limiting-your-kernel-211e"&gt;previous posts&lt;/a&gt; I've been showing how to investigate GPU occupancy utilization and optimize kernels that aren't using the hardware fully. That was just one case — I'll cover more occupancy scenarios in future posts.&lt;/p&gt;

&lt;p&gt;Today, I want to go through how to use GPU Flight in Python, especially with PyTorch. Since GPU Flight is still in active development, the current version is &lt;code&gt;v0.1.0.dev7&lt;/code&gt;. You can install it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;gpufl&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;0.1.0.dev7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, I highly recommend &lt;strong&gt;building from source inside a CUDA container&lt;/strong&gt;. There are two reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Prerequisite libraries&lt;/strong&gt; — GPU Flight's backend needs CUPTI, the CUDA runtime, and NVML headers at compile time. Getting these right on a bare system is fiddly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVML support&lt;/strong&gt; — the pre-built PyPI wheel is compiled in a minimal CI environment that doesn't include NVML stubs. This means the wheel works for kernel profiling, but can't collect runtime GPU utilization or VRAM usage. Building from source inside the &lt;code&gt;nvidia/cuda:*-devel&lt;/code&gt; image picks up NVML automatically.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this post, I'll show how to use Docker to set up an environment that's ready to go — with GPU Flight built from source, PyTorch, and Jupyter Lab all pre-installed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dockerfile
&lt;/h2&gt;

&lt;p&gt;Here's the full Dockerfile. It's straightforward — CUDA 13.1 base, PyTorch, GPU Flight, and Jupyter Lab:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; nvidia/cuda:13.1.0-devel-ubuntu24.04&lt;/span&gt;

&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; DEBIAN_FRONTEND=noninteractive&lt;/span&gt;

&lt;span class="c"&gt;# System dependencies (Ubuntu 24.04 ships Python 3.12)&lt;/span&gt;
&lt;span class="c"&gt;# NOTE: cmake/ninja come from pip (build-system.requires needs &amp;gt;=3.31, apt has 3.28)&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    python3 &lt;span class="se"&gt;\
&lt;/span&gt;    python3-venv &lt;span class="se"&gt;\
&lt;/span&gt;    python3-dev &lt;span class="se"&gt;\
&lt;/span&gt;    python3-pip &lt;span class="se"&gt;\
&lt;/span&gt;    git &lt;span class="se"&gt;\
&lt;/span&gt;    curl &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /var/lib/apt/lists/&lt;span class="k"&gt;*&lt;/span&gt;

&lt;span class="c"&gt;# Create venv to avoid PEP 668 issues&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv /opt/venv
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; PATH="/opt/venv/bin:$PATH"&lt;/span&gt;

&lt;span class="c"&gt;# Upgrade pip&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip

&lt;span class="c"&gt;# Install PyTorch with CUDA 13.1 support&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;torch &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cu130

&lt;span class="c"&gt;# Build gpufl from source so it picks up NVML from the CUDA devel image&lt;/span&gt;
&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; GPUFL_VERSION=main&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;git clone &lt;span class="nt"&gt;--depth&lt;/span&gt; 1 &lt;span class="nt"&gt;--branch&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GPUFL_VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;        https://github.com/gpu-flight/gpufl-client.git /tmp/gpufl-client &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nv"&gt;CMAKE_ARGS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-DBUILD_TESTING=OFF"&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;       pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"/tmp/gpufl-client[analyzer,viz]"&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /tmp/gpufl-client

&lt;span class="c"&gt;# Install Jupyter&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;jupyterlab

&lt;span class="c"&gt;# Working directory for notebooks&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /workspace&lt;/span&gt;

&lt;span class="c"&gt;# Expose Jupyter port&lt;/span&gt;
&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8888&lt;/span&gt;

&lt;span class="c"&gt;# Start Jupyter Lab&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", \&lt;/span&gt;
     "--allow-root", "--NotebookApp.token=''", "--NotebookApp.password=''"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things to note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ubuntu 24.04&lt;/strong&gt; — ships Python 3.12 natively, which is what GPU Flight requires. No PPA hacks needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;devel&lt;/code&gt; image&lt;/strong&gt; — we use &lt;code&gt;nvidia/cuda:13.1.0-devel-ubuntu24.04&lt;/code&gt; because the &lt;code&gt;devel&lt;/code&gt; variant includes CUPTI, CUDA headers, and &lt;strong&gt;NVML stubs&lt;/strong&gt; that GPU Flight's backend needs at compile time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Building from source&lt;/strong&gt; — we clone the repo and build with &lt;code&gt;pip install&lt;/code&gt; rather than using the pre-built PyPI wheel. This is important: the &lt;code&gt;devel&lt;/code&gt; image has NVML stubs at &lt;code&gt;/usr/local/cuda/lib64/stubs/libnvidia-ml.so&lt;/code&gt;, so CMake detects them and compiles in the NVML collector. The pre-built wheel doesn't have this, which means no GPU utilization or VRAM monitoring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyTorch &lt;code&gt;cu130&lt;/code&gt;&lt;/strong&gt; — at the time of writing, PyTorch doesn't publish a &lt;code&gt;cu131&lt;/code&gt; wheel yet. The &lt;code&gt;cu130&lt;/code&gt; build is forward-compatible with the CUDA 13.1 runtime in the container, so this works fine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No token&lt;/strong&gt; — Jupyter starts without authentication. This is fine for local development; don't expose this to the internet.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Building and Running
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;You need two things on your host machine:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; — any recent version&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA Container Toolkit&lt;/strong&gt; — this lets Docker containers access your GPU&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Having an NVIDIA driver installed on your host is not enough. Docker doesn't know how to talk to your GPU on its own — you need the NVIDIA Container Toolkit to bridge that gap. Without it, &lt;code&gt;--gpus all&lt;/code&gt; will fail with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can check if it's already installed by running &lt;code&gt;nvidia-ctk --version&lt;/code&gt;. If not, here's how to set it up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add the NVIDIA container toolkit repo&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://nvidia.github.io/libnvidia-container/gpgkey &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sudo &lt;/span&gt;gpg &lt;span class="nt"&gt;--dearmor&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-L&lt;/span&gt; https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/nvidia-container-toolkit.list

&lt;span class="c"&gt;# Install and configure&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nvidia-container-toolkit
&lt;span class="nb"&gt;sudo &lt;/span&gt;nvidia-ctk runtime configure &lt;span class="nt"&gt;--runtime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;docker
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last &lt;code&gt;systemctl restart&lt;/code&gt; is easy to forget — Docker needs to be restarted after the runtime is configured, or it won't pick up the new GPU capability.&lt;/p&gt;

&lt;p&gt;You can verify it worked with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--gpus&lt;/span&gt; all nvidia/cuda:13.1.0-base-ubuntu24.04 nvidia-smi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see your GPU listed, you're good to go.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build the Image
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; gpufl-python &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will take a few minutes the first time — mostly downloading PyTorch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Run the Container
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="nt"&gt;-p&lt;/span&gt; 8888:8888 &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/notebooks:/workspace gpufl-python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Breaking that down:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--gpus all&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Passes all GPUs into the container&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-p 8888:8888&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Maps Jupyter's port to your host&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-v $(pwd)/notebooks:/workspace&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Mounts a local folder so your notebooks persist&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Connect
&lt;/h3&gt;

&lt;p&gt;Open your browser and go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://localhost:8888
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll land in Jupyter Lab with GPU Flight, PyTorch, and a CUDA-capable GPU ready to go.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Smoke Test
&lt;/h2&gt;

&lt;p&gt;Create a new notebook and run this to verify everything is working:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;gpufl&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;gpufl&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProfilingEngine&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PyTorch: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CUDA available: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Device: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_device_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize GPU Flight
&lt;/span&gt;&lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;smoke-test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./smoke_test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;sampling_auto_start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;enable_kernel_details&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;enable_stack_trace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;profiling_engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ProfilingEngine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RangeProfiler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run a simple operation
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Scope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RandomGeneration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Scope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a @ b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;gpufl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPU Flight logs written!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running this, you should see &lt;code&gt;*.log&lt;/code&gt; files in your working directory. These are your GPU Flight recordings — every kernel launch, memory copy, and timing event that happened during that matrix multiply.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9h1s6nst0bym86n8g6bs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9h1s6nst0bym86n8g6bs.png" alt="Running Test"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Analyzing the Results
&lt;/h2&gt;

&lt;p&gt;GPU Flight's Python analyzer can load those logs directly in the notebook:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;gpufl.analyzer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GpuFlightSession&lt;/span&gt;

&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GpuFlightSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;smoke_test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;print_summary&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;GpuFlightSession&lt;/code&gt; takes two main arguments: the directory where logs live, and the &lt;code&gt;log_prefix&lt;/code&gt; matching your &lt;code&gt;log_path&lt;/code&gt; from init. It automatically finds and loads &lt;code&gt;smoke_test.device.log&lt;/code&gt;, &lt;code&gt;smoke_test.scope.log&lt;/code&gt;, and &lt;code&gt;smoke_test.system.log&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;print_summary()&lt;/code&gt; gives you a quick dashboard — total duration, kernel count, GPU busy time, average utilization, and peak VRAM.&lt;/p&gt;

&lt;p&gt;Now let's look at the kernel hotspots:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inspect_hotspots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you a Rich-formatted table of your hottest kernels with occupancy, register usage, shared memory, and the per-resource occupancy breakdown showing exactly what's limiting each kernel.&lt;/p&gt;

&lt;p&gt;Here's what that actually looks like — this is real output from the matrix multiply we just ran:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbh2ra05j0eh83xg0dym.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbh2ra05j0eh83xg0dym.png" alt="Analyzing Logs"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let's look at 33.3% occupancy. That sounds bad, right? Let's break it down.&lt;/p&gt;

&lt;p&gt;The kernel is &lt;code&gt;ampere_sgemm_128x64_nn&lt;/code&gt; — cuBLAS's single-precision matrix multiply. It uses &lt;strong&gt;122 registers per thread&lt;/strong&gt;. That's a lot. Let's trace through what happens on an Ampere SM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;128 threads per block&lt;/strong&gt; = 4 warps per block&lt;/li&gt;
&lt;li&gt;122 regs/thread × 32 threads/warp = 3,904 → rounded up to the hardware allocation granularity of 256 → &lt;strong&gt;4,096 regs/warp&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;4 warps × 4,096 = &lt;strong&gt;16,384 registers per block&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;An Ampere SM has 65,536 registers total → 65,536 / 16,384 = &lt;strong&gt;4 blocks max&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;4 blocks × 4 warps = 16 active warps out of 48 max = &lt;strong&gt;33.3%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The breakdown confirms it: &lt;code&gt;reg 33.3%&lt;/code&gt; is the bottleneck, while shared memory (66.7%), warps (100%), and block count (100%) all have headroom.&lt;/p&gt;

&lt;h3&gt;
  
  
  But Is This Actually a Problem?
&lt;/h3&gt;

&lt;p&gt;Not necessarily. If the algorithm itself doesn't require all those registers, high register usage might be a problem — but it could also be by design. This is a good example of why occupancy alone doesn't tell the whole story — you need to understand &lt;em&gt;what's limiting it&lt;/em&gt; and &lt;em&gt;whether that tradeoff makes sense for the workload&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you saw 33% occupancy with &lt;code&gt;limiting_resource: shared_mem&lt;/code&gt; on your own custom kernel, that might be worth investigating.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Now that you have a working environment, you can start profiling your own models. The occupancy breakdown makes it easy to spot which kernels are underutilizing the GPU and — more importantly — &lt;em&gt;why&lt;/em&gt;. Not every low-occupancy kernel is a problem, but when one is, you'll know exactly which resource to optimize.&lt;/p&gt;

&lt;p&gt;In the next post, I'll cover GPU Flight's &lt;strong&gt;profiling engines&lt;/strong&gt; — PC sampling, SASS metrics, and the range profiler — which let you go beyond kernel metadata and collect hardware-level data about what's happening &lt;em&gt;inside&lt;/em&gt; the GPU while your kernels run.&lt;/p&gt;

</description>
      <category>cuda</category>
      <category>cpp</category>
      <category>gpu</category>
      <category>python</category>
    </item>
    <item>
      <title>Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?</title>
      <dc:creator>Myoungho Shin</dc:creator>
      <pubDate>Mon, 02 Mar 2026 01:19:03 +0000</pubDate>
      <link>https://forem.com/codinginavan/profiling-gpu-cuda-what-is-actually-limiting-your-kernel-211e</link>
      <guid>https://forem.com/codinginavan/profiling-gpu-cuda-what-is-actually-limiting-your-kernel-211e</guid>
      <description>&lt;p&gt;In my &lt;a href="https://dev.to/codinginavan/profiling-gpu-cuda-introducing-gpu-flight-4p67"&gt;last post&lt;/a&gt; I introduced &lt;strong&gt;GPU Flight&lt;/strong&gt; — a lightweight CUDA observability tool that acts like a flight recorder for your GPU. We covered what it collects: system metrics, device capabilities, and per-kernel events.&lt;/p&gt;

&lt;p&gt;Today I want to talk about one specific metric that GPU Flight captures: &lt;strong&gt;occupancy&lt;/strong&gt;. It's one of the most important numbers for understanding GPU performance, and also one of the most misunderstood.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Occupancy?
&lt;/h2&gt;

&lt;p&gt;A GPU is organized around &lt;strong&gt;Streaming Multiprocessors (SMs)&lt;/strong&gt;. Each SM can run many threads simultaneously — not by context-switching like a CPU, but by actually running them in parallel. The unit of scheduling on an SM is a &lt;strong&gt;warp&lt;/strong&gt;: a group of 32 threads that execute the same instruction in lockstep.&lt;/p&gt;

&lt;p&gt;An SM has a fixed warp budget — say, 48 warps on a typical Ampere GPU. When you launch a kernel with blocks of 256 threads (8 warps each), the SM can hold up to 6 blocks concurrently to fill those 48 warp slots. If something prevents that — too many registers, too much shared memory — fewer blocks fit, and some warp slots sit idle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Occupancy&lt;/strong&gt; measures how well those warp slots are filled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;occupancy = active warps / maximum warps per SM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A value of 1.0 means every slot is in use. A value of 0.5 means half the SM's compute capacity is being wasted while your kernel runs.&lt;/p&gt;




&lt;h2&gt;
  
  
  How GPU Flight Captures It
&lt;/h2&gt;

&lt;p&gt;GPU Flight records occupancy automatically for every kernel launch. No code changes needed — just initialize with &lt;code&gt;enableKernelDetails: true&lt;/code&gt; and it shows up in the log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"kernel_event"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"_Z18block_reduce_naivePKfPfi"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.833333&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"num_regs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"static_shared_bytes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dyn_shared_bytes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"block"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"(256,1,1)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"grid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"(16384,1,1)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"max_active_blocks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, GPU Flight calls &lt;code&gt;cudaOccupancyMaxActiveBlocksPerMultiprocessor&lt;/code&gt; at kernel launch time to get &lt;code&gt;max_active_blocks&lt;/code&gt;, then divides by the SM's warp budget to compute &lt;code&gt;occupancy&lt;/code&gt;. This happens inside the CUPTI callback — zero overhead to your kernel execution.&lt;/p&gt;

&lt;p&gt;That &lt;code&gt;0.833333&lt;/code&gt; immediately tells you something is off. This kernel only fills 5 out of 6 possible concurrent blocks on each SM. Some compute is being left on the table.&lt;/p&gt;




&lt;h2&gt;
  
  
  But What Is Actually Causing It?
&lt;/h2&gt;

&lt;p&gt;Here's where a single number hits its limit.&lt;/p&gt;

&lt;p&gt;Is it registers? Shared memory? The hardware block count cap? Looking at the log fields, you can make an educated guess — &lt;code&gt;static_shared_bytes: 16384&lt;/code&gt; is 16 KB of shared memory per block, which is pretty large. But you still have to do the math yourself against your specific GPU's properties to confirm.&lt;/p&gt;

&lt;p&gt;That manual detective work is exactly what I wanted to eliminate. So GPU Flight now also computes a &lt;strong&gt;per-resource occupancy breakdown&lt;/strong&gt; and identifies the limiting resource automatically. Let me show what this looks like with a concrete kernel.&lt;/p&gt;

&lt;h3&gt;
  
  
  The kernel
&lt;/h3&gt;

&lt;p&gt;Here's a simple parallel block reduction — it sums an array by having all 256 threads in a block cooperate through shared memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;block_reduce_naive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;__shared__&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="c1"&gt;// 16 KB — statically reserved&lt;/span&gt;

    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;gid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Load one element per thread into shared memory&lt;/span&gt;
    &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gid&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;gid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;__syncthreads&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="c1"&gt;// Reduce in shared memory — each step halves the active threads&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="n"&gt;__syncthreads&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Thread 0 writes the block's result&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Launched with 256 threads per block across 4M elements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;GRID&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// ~16384 blocks&lt;/span&gt;
&lt;span class="n"&gt;block_reduce_naive&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;GRID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d_in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d_out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing unusual here — this is a textbook reduction. But GPU Flight flags a problem immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  What GPU Flight sees
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="mf"&gt;0.833333&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reg_occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"smem_occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;0.833333&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"warp_occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"block_occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"limiting_resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"shared_mem"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;*_occupancy&lt;/code&gt; field answers: &lt;em&gt;"if only this constraint existed, what would occupancy be?"&lt;/em&gt; The &lt;code&gt;limiting_resource&lt;/code&gt; field names the one that's actually binding. Here — &lt;code&gt;smem_occupancy&lt;/code&gt; matches &lt;code&gt;occupancy&lt;/code&gt; and everything else is 1.0 — shared memory is definitively the culprit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why
&lt;/h3&gt;

&lt;p&gt;The problem is &lt;code&gt;__shared__ float smem[4096]&lt;/code&gt;. Static shared memory is sized at compile time and reserved in full for the block's entire lifetime — even if the kernel only uses part of it. With 256 threads per block, this reduction only ever touches &lt;code&gt;smem[0]&lt;/code&gt; through &lt;code&gt;smem[255]&lt;/code&gt;, but all 4096 floats (16 KB) are locked up on the SM regardless. Every block is paying a 16 KB reservation it doesn't actually need, and that prevents the SM from scheduling as many concurrent blocks as the warp budget would otherwise allow.&lt;/p&gt;

&lt;h3&gt;
  
  
  The fix
&lt;/h3&gt;

&lt;p&gt;Switch to &lt;strong&gt;dynamic shared memory&lt;/strong&gt;, which is sized at launch time rather than compiled in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="k"&gt;__global__&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;block_reduce_optimized&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;extern&lt;/span&gt; &lt;span class="k"&gt;__shared__&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt; &lt;span class="c1"&gt;// size comes from the launch call&lt;/span&gt;

    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threadIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;gid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gid&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;gid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;__syncthreads&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blockDim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="n"&gt;__syncthreads&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;blockIdx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;smem&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The kernel body is completely unchanged. The only differences are &lt;code&gt;extern __shared__&lt;/code&gt; instead of a fixed-size array, and passing the size as the third launch argument:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cuda"&gt;&lt;code&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;smem_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// 256 × 4 = 1 KB&lt;/span&gt;
&lt;span class="n"&gt;block_reduce_optimized&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;GRID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;smem_bytes&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d_in&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d_out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The shared memory footprint drops from 16 KB to 1 KB per block — 16× smaller — and now the SM can fit all 6 concurrent blocks instead of 5.&lt;/p&gt;

&lt;p&gt;GPU Flight confirms the fix worked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"occupancy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"limiting_resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"warps"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;"warps"&lt;/code&gt; as the limiting resource means full occupancy — every SM warp slot is filled and shared memory is no longer in the way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full Sample Code&lt;/strong&gt;: &lt;a href="https://github.com/gpu-flight/gpufl-client/blob/main/example/cuda/occupancy_demo.cu" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;




</description>
      <category>performance</category>
      <category>cuda</category>
      <category>gpu</category>
      <category>cpp</category>
    </item>
    <item>
      <title>Profiling GPU (CUDA) — Introducing GPU Flight</title>
      <dc:creator>Myoungho Shin</dc:creator>
      <pubDate>Tue, 24 Feb 2026 01:32:26 +0000</pubDate>
      <link>https://forem.com/codinginavan/profiling-gpu-cuda-introducing-gpu-flight-4p67</link>
      <guid>https://forem.com/codinginavan/profiling-gpu-cuda-introducing-gpu-flight-4p67</guid>
      <description>&lt;p&gt;Last year, I took a GPU programming course at Johns Hopkins University as part of my graduate studies, where I learned CUDA programming. For my final project, I built a lightweight GPU monitoring and profiling tool focused on CUDA.&lt;/p&gt;

&lt;p&gt;I enjoyed the process so much that I decided to continue developing it beyond the course.&lt;/p&gt;

&lt;p&gt;In this post, I’d like to briefly introduce the project:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPU Flight&lt;/strong&gt; — a 100% open-source GPU observability tool&lt;br&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/gpu-flight/gpufl-client" rel="noopener noreferrer"&gt;https://github.com/gpu-flight/gpufl-client&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Why I Started GPU Flight
&lt;/h2&gt;

&lt;p&gt;When profiling a CUDA application, you typically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install profiling tools such as &lt;strong&gt;Nsight&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Or manually integrate &lt;strong&gt;CUPTI&lt;/strong&gt; into your application, which often makes the code complex and difficult to manage&lt;/li&gt;
&lt;li&gt;Deal with additional complexity in cloud or containerized environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This workflow can be inconvenient — especially in production systems.&lt;/p&gt;

&lt;p&gt;I wanted something lighter.&lt;br&gt;&lt;br&gt;
Something that works more like a &lt;strong&gt;flight recorder for GPUs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I built GPU Flight.&lt;/p&gt;

&lt;p&gt;Instead of requiring heavy tooling at runtime, GPU Flight writes structured profiling logs directly on the host machine. A separate component (&lt;strong&gt;GPUFL Agent&lt;/strong&gt;) crawls these log files and forwards them to a backend service or other destinations.&lt;/p&gt;

&lt;p&gt;This makes GPU observability more flexible and easier to integrate into distributed systems.&lt;/p&gt;


&lt;h2&gt;
  
  
  What is GPU Flight?
&lt;/h2&gt;

&lt;p&gt;GPU Flight is designed to be &lt;strong&gt;lightweight and modular&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you only need monitoring, the overhead is minimal.&lt;/li&gt;
&lt;li&gt;Enabling deeper profiling provides more detailed metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to expose useful GPU metrics so you can clearly understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How the GPU manages resources&lt;/li&gt;
&lt;li&gt;How your program utilizes GPU resources&lt;/li&gt;
&lt;li&gt;Where performance bottlenecks occur&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;

&lt;p&gt;GPU Flight currently consists of several components:&lt;/p&gt;
&lt;h3&gt;
  
  
  1️⃣ gpufl-client
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/gpu-flight/gpufl-client" rel="noopener noreferrer"&gt;https://github.com/gpu-flight/gpufl-client&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;The client library that users embed into their applications for monitoring and profiling.&lt;/p&gt;


&lt;h3&gt;
  
  
  2️⃣ gpufl-agent
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/gpu-flight/gpufl-agent" rel="noopener noreferrer"&gt;https://github.com/gpu-flight/gpufl-agent&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;Despite the name, this is &lt;strong&gt;not an AI agent 🙂&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
It tracks log files and forwards profiling data to the configured destination.&lt;/p&gt;


&lt;h3&gt;
  
  
  3️⃣ gpufl-desktop
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/gpu-flight/gpufl-desktop" rel="noopener noreferrer"&gt;https://github.com/gpu-flight/gpufl-desktop&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;Originally, I planned to build a desktop viewer. Due to time constraints and the need for better cross-platform accessibility, I pivoted to a web-based frontend.&lt;br&gt;
I am currently keeping the web frontend and backend repositories private as I develop them into a hosted cloud platform. To ensure the open-source community can still easily parse and utilize the trace logs locally, I am providing a lightweight Python viewer alongside the open-source C++ client.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Metrics Does GPU Flight Support?
&lt;/h2&gt;

&lt;p&gt;GPU Flight captures observability at multiple layers.&lt;/p&gt;
&lt;h3&gt;
  
  
  1️⃣ System &amp;amp; GPU Monitoring (NVML)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Host memory usage&lt;/li&gt;
&lt;li&gt;GPU memory usage (used/free/total)&lt;/li&gt;
&lt;li&gt;GPU utilization&lt;/li&gt;
&lt;li&gt;Memory utilization&lt;/li&gt;
&lt;li&gt;Temperature&lt;/li&gt;
&lt;li&gt;Power consumption&lt;/li&gt;
&lt;li&gt;Clock speeds (GFX / SM / Memory)&lt;/li&gt;
&lt;li&gt;PCIe RX/TX bandwidth&lt;/li&gt;
&lt;li&gt;Power and thermal throttling flags&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example JSON snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system_sample"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"util_gpu"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;57&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"temp_c"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"power_mw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;54415&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"clk_sm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1740&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2️⃣ CUDA Device Capabilities
&lt;/h3&gt;

&lt;p&gt;Static architectural information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compute capability&lt;/li&gt;
&lt;li&gt;L2 cache size&lt;/li&gt;
&lt;li&gt;Shared memory per block&lt;/li&gt;
&lt;li&gt;Registers per block&lt;/li&gt;
&lt;li&gt;SM count&lt;/li&gt;
&lt;li&gt;Warp size&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3️⃣ CUDA API &amp;amp; Kernel Events (CUPTI)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;API enter/exit timestamps&lt;/li&gt;
&lt;li&gt;Kernel execution start/end timestamps&lt;/li&gt;
&lt;li&gt;Grid/block dimensions&lt;/li&gt;
&lt;li&gt;Shared memory usage&lt;/li&gt;
&lt;li&gt;Register usage&lt;/li&gt;
&lt;li&gt;Occupancy&lt;/li&gt;
&lt;li&gt;Correlation IDs&lt;/li&gt;
&lt;li&gt;Memory copy events (HtoD, DtoH)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Python Support
&lt;/h2&gt;

&lt;p&gt;GPU Flight is also being extended to support Python applications that use CUDA (e.g., PyTorch).&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;&lt;br&gt;
&lt;a href="https://github.com/gpu-flight/gpufl-client/blob/main/example/python/03_pytorch_benchmark.py" rel="noopener noreferrer"&gt;https://github.com/gpu-flight/gpufl-client/blob/main/example/python/03_pytorch_benchmark.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This allows profiling GPU-heavy ML workloads without deeply modifying existing code.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;In the next post, I’ll walk through a minimal CUDA example and show how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integrate &lt;code&gt;gpufl-client&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Run a kernel&lt;/li&gt;
&lt;li&gt;Inspect generated profiling logs&lt;/li&gt;
&lt;li&gt;Interpret stall reasons and metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks for reading — this is just the beginning &lt;/p&gt;

</description>
      <category>cuda</category>
      <category>gpu</category>
      <category>cpp</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
