Forem: Myoungho Shin

Profiling a CUDA Python Program with GPUFlight

Myoungho Shin — Fri, 22 May 2026 05:35:35 +0000

In the previous post, I used a C++ CUDA example to look at memory coalescing and how memory access patterns affect GPU performance.

This time, I wanted to look at a similar performance problem from Python.

I usually write CUDA code in C++, but recently I have been spending more time with Python, especially PyTorch and Numba.

Numba is interesting because it lets you write a real GPU kernel directly in Python. You can decorate a function with @cuda.jit, launch it with kernel[grid, block](...), and Numba compiles it down to GPU machine code that runs on the actual hardware.

The good news is that GPUFlight can profile Python GPU programs as well.

In this post, I’ll profile a simple Numba matrix multiplication kernel with GPUFlight. Then I’ll read the report step by step and show how the report points to a real optimization: shared-memory tiling.

One important note before we start: this example uses GPUFlight’s deeper profiling mode with SASS-level metrics and PC sampling. So the duration numbers in the report should not be treated as clean baseline kernel timing. They include profiling overhead. The main goal here is not to benchmark Numba against an optimized library like cuBLAS. The goal is to show how GPUFlight helps explain what is happening inside the kernel.

Setup

Both GPUFlight and Numba can be installed from PyPI. On a fresh Linux machine:

sudo apt-get install -y python3.12-venv
python3 --version            # expect Python 3.12.x

python3 -m venv ~/gpufl-venv
source ~/gpufl-venv/bin/activate

pip install --upgrade pip
pip install gpufl "numba-cuda[cu13]"

python -c "import gpufl; print('gpufl', gpufl.__version__)"

You should see something like:

gpufl 1.x.x

At the time I am writing this, the version is 1.0.2.

Before using the profiler, it is a good idea to confirm that Numba can find your GPU:

python -c "from numba import cuda; print('cuda available:', cuda.is_available()); cuda.detect()"

Now we are ready to run a Python CUDA application with GPUFlight.

The sample kernel

Here is the sample code I am using:

import gpufl as gfl
from gpufl.report import generate_report
from numba import cuda
import numpy as np
import math
import os

@cuda.jit
def matmul_kernel(A, B, C):
    row, col = cuda.grid(2)

    if row < C.shape[0] and col < C.shape[1]:
        tmp = 0.0

        for k in range(A.shape[1]):
            tmp += A[row, k] * B[k, col]

        C[row, col] = tmp

LOG_PATH = "./gfl_logs"

gfl.init(
    app_name="matmul_sample",
    log_path=LOG_PATH,
    sampling_auto_start=True,
    system_sample_rate_ms=100,
    profiling_engine=gfl.ProfilingEngine.PcSamplingWithSass,
)

try:
    N = 2048

    A = cuda.to_device(np.random.rand(N, N).astype(np.float32))
    B = cuda.to_device(np.random.rand(N, N).astype(np.float32))
    C = cuda.to_device(np.zeros((N, N), dtype=np.float32))

    tpb = (16, 16)
    bpg = (math.ceil(N / tpb[0]), math.ceil(N / tpb[1]))

    with gfl.Scope("matrix_mul_compute", "math"):
        for _ in range(10):
            matmul_kernel[bpg, tpb](A, B, C)

    _ = C.copy_to_host()
    print("[OK] compute finished")

finally:
    gfl.shutdown()

    print(
        generate_report(
            os.path.dirname(LOG_PATH) or ".",
            log_prefix=os.path.basename(LOG_PATH),
            top_n=10,
        )
    )

This is a very simple matrix multiplication kernel.

Each thread computes one output element. For each element, the thread walks through one full row of A and one full column of B.

This is intentionally not optimized. I want to start with a simple kernel, because it makes the profiling report easier to understand.

Let’s run it and see what GPUFlight tells us.

===============================================================================
                           GPU Flight Session Report
                       Generated: 2026-05-22 05:05:33 UTC
===============================================================================

===============================================================================
  Session Summary
===============================================================================
  Application:          matmul_sample
  Session ID:           565d3c32-86cc-415d-8642-9c140f856f2b
  Duration:             17.91 s
  GPU Device:           NVIDIA GeForce RTX 5060 Laptop GPU
    SMs:                26
    Registers/Block:    65536

===============================================================================
  Kernel Execution Summary
===============================================================================
  Total Kernels:        10
  Unique Kernels:       1
  Total GPU Time:       17.40 s
  GPU Busy:             97.2%
  Avg Duration:         1.74 s
  Median Duration:      1.74 s
  Min Duration:         1.71 s
  Max Duration:         1.78 s

===============================================================================
  Top 10 Kernels by Total GPU Time
===============================================================================
  #   Kernel                                   Calls       Total         Avg         Max
  --------------------------------------------------------------------------------------
  1   __main__::matmul_kernel                     10     17.40 s      1.74 s      1.78 s

===============================================================================
  Kernel Details (Top 10)
===============================================================================

  __main__::matmul_kernel
  =======================
    Grid:               (128,128,1)
    Block:              (16,16,1)
    Occupancy:          100.0%
    Reg Occupancy:      100.0%
    SMem Occupancy:     100.0%
    Warp Occupancy:     100.0%
    Block Occupancy:    100.0%
    Limiting Resource:  warps
    Registers/Thread:   40
    Shared Memory:      0 B dyn + 0 B static

===============================================================================
  Memory Transfer Summary
===============================================================================
  Total Transfers:      4
  Total Bytes:          64.0 MB

  Direction      Count     Total Bytes    Avg Throughput
  ------------------------------------------------------
  HtoD               3         48.0 MB        11.68 GB/s
  DtoH               1         16.0 MB         4.40 GB/s

===============================================================================
  System Metrics
===============================================================================
  GPU Metrics:
    Utilization:        avg 96.6%  peak 100%  min 0%
    Temperature:        avg 53.4 C  peak 58 C
    Power:              avg 71.0 W  peak 75.6 W
    VRAM Usage:         peak 1105 MiB
    SM Clock:           avg 2631 MHz  peak 2790 MHz

  Host Metrics:
    CPU Utilization:    avg 8.6%  peak 29.1%
    RAM Usage:          peak 27593 / 32189 MiB (85.7%)

===============================================================================
  Scope Summary
===============================================================================
  Scope Timing:
  Scope                          Calls       Total         Avg         Max
  ------------------------------------------------------------------------
  matrix_mul_compute                 1   195.21 ms   195.21 ms   195.21 ms

  GPU Time by Scope:
  Scope                          Kernels      GPU Time         Avg
  ----------------------------------------------------------------
  matrix_mul_compute                  10       17.40 s      1.74 s

===============================================================================
  Profile / SASS Analysis
===============================================================================

  SASS Metrics Summary:
  Metric                                                   Total
  --------------------------------------------------------------
  smsp__sass_thread_inst_executed                   2235815690240
  smsp__sass_inst_executed                           69869240320
  smsp__sass_sectors_mem_global                      45654999040
  smsp__sass_sectors_mem_global_ideal                13427015680

  Thread Divergence Analysis:
    Warp Instructions:    69869240320
    Thread Instructions:  2235815690240
    Avg Threads/Warp:     32.0 / 32
    Warp Efficiency:      100.0%

Now let’s read the report carefully.

A profiling report is only useful if we can turn it into a decision. So instead of just looking at numbers, I usually ask a few questions.

1. Is the GPU actually busy?

Yes.

The report shows:

GPU Busy:             97.2%
GPU Util avg:         96.6%
Total GPU Time:       17.40 s
Duration:             17.91 s

This means the GPU was working for almost the entire run. Out of 17.91 s of wall-clock time, 17.40 s were spent running GPU kernels.

The SM clock is also boosted to 2631 MHz, and power is around 71.0 W, which is close to the laptop GPU’s power limit.

So this is not a case where the CPU is too slow, the input data is too small, or the GPU is waiting for work. The GPU is busy.

That means if we want to improve performance, we need to look inside the kernel.

2. How long did each profiled launch take?

The report shows:

Avg Duration:         1.74 s
Median Duration:      1.74 s
Min Duration:         1.71 s
Max Duration:         1.78 s

However, this number needs to be read carefully.

This run includes deeper profiling, including SASS-level metrics and sampling. That means the measured duration includes profiling overhead. So I should not treat 1.74 s as the clean baseline runtime of the kernel.

I would not use this number alone to claim how fast or slow the raw Numba kernel is. But it is still useful as the runtime under this profiling configuration.

3. Is the problem occupancy?

Probably not.

The report shows:

Occupancy:          100.0%
Reg Occupancy:      100.0%
SMem Occupancy:     100.0%
Warp Occupancy:     100.0%
Block Occupancy:    100.0%
Limiting Resource:  warps

This tells us the GPU has enough active warps. The SMs are not sitting empty because we launched too few threads.

Occupancy is not the same thing as performance, but in this case low occupancy does not look like the main problem.

4. Is the problem thread divergence?

Also no.

The report shows:

Avg Threads/Warp:     32.0 / 32
Warp Efficiency:      100.0%

This means every warp is using all 32 threads. There is no meaningful branch divergence here.

That makes sense because the kernel is simple. The 16 x 16 block and 128 x 128 grid map cleanly to the 2048 x 2048 output matrix.

So far, the report says:

The GPU is busy.
Occupancy is high.
Warp efficiency is perfect.

So now we need to look at memory behavior.

5. What do the memory sectors say?

This is the most useful part of the report:

SASS Metrics Summary:
Metric                                                   Total
--------------------------------------------------------------
smsp__sass_thread_inst_executed                   2235815690240
smsp__sass_inst_executed                           69869240320
smsp__sass_sectors_mem_global                      45654999040
smsp__sass_sectors_mem_global_ideal                13427015680

The important two numbers are:

smsp__sass_sectors_mem_global          45,654,999,040
smsp__sass_sectors_mem_global_ideal    13,427,015,680

The kernel is accessing about 45.7B global memory sectors, while the ideal number is about 13.4B.

That is roughly:

45.7 / 13.4 ≈ 3.4x

So the kernel is moving about 3.4x more global memory traffic than the ideal case.

Another way to read it:

13.4 / 45.7 ≈ 29%

The memory access efficiency is only around 29%.

This is the real story.

The naive kernel makes each thread re-read values from global memory. Many threads need overlapping data from A and B, but the kernel does not reuse that data efficiently. So the same data crosses the memory system again and again.

The GPU is busy, the warps are full, and the lanes are active. But the memory access pattern is wasteful.

6. The fix: shared-memory tiling

For this kind of matrix multiplication kernel, the classic fix is shared-memory tiling.

Instead of letting each thread repeatedly read everything from global memory, each block cooperatively loads a tile of A and a tile of B into shared memory. Then the threads reuse those values many times before loading the next tile.

Here is the improved kernel:

from numba import cuda, float32

TPB = 16

@cuda.jit
def matmul_kernel_perf(A, B, C):
    sA = cuda.shared.array((TPB, TPB), dtype=float32)
    sB = cuda.shared.array((TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y

    tmp = float32(0.0)

    n_tiles = (A.shape[1] + TPB - 1) // TPB

    for i in range(n_tiles):
        sA[ty, tx] = 0.0
        sB[ty, tx] = 0.0

        if y < A.shape[0] and (tx + i * TPB) < A.shape[1]:
            sA[ty, tx] = A[y, tx + i * TPB]

        if x < B.shape[1] and (ty + i * TPB) < B.shape[0]:
            sB[ty, tx] = B[ty + i * TPB, x]

        cuda.syncthreads()

        for j in range(TPB):
            tmp += sA[ty, j] * sB[j, tx]

        cuda.syncthreads()

    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp

Now let’s run the same profiling mode again.

===============================================================================
                           GPU Flight Session Report
                       Generated: 2026-05-22 05:20:40 UTC
===============================================================================

===============================================================================
  Session Summary
===============================================================================
  Application:          matmul_sample_perf
  Session ID:           d44e5478-ba19-4cd1-b3cf-f6d31ab8b0ca
  Duration:             2.90 s
  GPU Device:           NVIDIA GeForce RTX 5060 Laptop GPU
    SMs:                26
    Registers/Block:    65536

===============================================================================
  Kernel Execution Summary
===============================================================================
  Total Kernels:        10
  Unique Kernels:       1
  Total GPU Time:       2.22 s
  GPU Busy:             76.4%
  Avg Duration:         221.64 ms
  Median Duration:      216.89 ms
  Min Duration:         215.38 ms
  Max Duration:         250.06 ms

===============================================================================
  Top 10 Kernels by Total GPU Time
===============================================================================
  #   Kernel                                   Calls       Total         Avg         Max
  --------------------------------------------------------------------------------------
  1   __main__::matmul_kernel_perf                10      2.22 s   221.64 ms   250.06 ms

===============================================================================
  Kernel Details (Top 10)
===============================================================================

  __main__::matmul_kernel_perf
  ============================
    Grid:               (128,128,1)
    Block:              (16,16,1)
    Occupancy:          100.0%
    Reg Occupancy:      100.0%
    SMem Occupancy:     100.0%
    Warp Occupancy:     100.0%
    Block Occupancy:    100.0%
    Limiting Resource:  warps
    Registers/Thread:   37
    Shared Memory:      0 B dyn + 2.0 KB static

===============================================================================
  Memory Transfer Summary
===============================================================================
  Total Transfers:      4
  Total Bytes:          64.0 MB

  Direction      Count     Total Bytes    Avg Throughput
  ------------------------------------------------------
  HtoD               3         48.0 MB         9.87 GB/s
  DtoH               1         16.0 MB         4.45 GB/s

===============================================================================
  System Metrics
===============================================================================
  GPU Metrics:
    Utilization:        avg 74.9%  peak 100%  min 0%
    Temperature:        avg 43.0 C  peak 48 C
    Power:              avg 51.0 W  peak 76.1 W
    VRAM Usage:         peak 958 MiB
    SM Clock:           avg 2180 MHz  peak 2812 MHz

  Host Metrics:
    CPU Utilization:    avg 16.0%  peak 46.0%
    RAM Usage:          peak 27019 / 32189 MiB (83.9%)

===============================================================================
  Scope Summary
===============================================================================
  Scope Timing:
  Scope                          Calls       Total         Avg         Max
  ------------------------------------------------------------------------
  matrix_mul_compute_perf            1   330.58 ms   330.58 ms   330.58 ms

  GPU Time by Scope:
  Scope                          Kernels      GPU Time         Avg
  ----------------------------------------------------------------
  matrix_mul_compute_perf             10        2.22 s   221.64 ms

===============================================================================
  Profile / SASS Analysis
===============================================================================

  SASS Metrics Summary:
  Metric                                                   Total
  --------------------------------------------------------------
  smsp__sass_thread_inst_executed                   298005299200
  smsp__sass_inst_executed                            9312665600
  smsp__sass_sectors_mem_global                       1347420160
  smsp__sass_sectors_mem_global_ideal                 1347420160

  Thread Divergence Analysis:
    Warp Instructions:    9312665600
    Thread Instructions:  298005299200
    Avg Threads/Warp:     32.0 / 32
    Warp Efficiency:      100.0%

The result is much better under the same profiling configuration.

The full session duration goes down from 17.91 s to 2.90 s.

Total GPU time goes down from 17.40 s to 2.22 s.

The average profiled kernel duration goes down from 1.74 s to 221.64 ms.

Again, these are still profiled durations, not clean baseline timings. But because both runs use the same deep profiling mode, this comparison is still useful. It tells us the tiled version behaves much better under the same measurement setup.

7. What changed?

The most important change is in the memory-sector metrics.

Naive version:

smsp__sass_sectors_mem_global          45,654,999,040
smsp__sass_sectors_mem_global_ideal    13,427,015,680

Tiled version:

smsp__sass_sectors_mem_global           1,347,420,160
smsp__sass_sectors_mem_global_ideal     1,347,420,160

In the naive kernel, actual global memory sectors were about 3.4x higher than ideal.

In the tiled kernel, actual and ideal global memory sectors are the same.

That is exactly what we wanted to see.

The optimized kernel also uses shared memory:

Shared Memory:      0 B dyn + 2.0 KB static

That means each block is now reusing data through shared memory instead of repeatedly pulling the same values from global memory.

Instruction count also drops a lot:

Naive thread instructions:  2,235,815,690,240
Tiled thread instructions:    298,005,299,200

So the optimized kernel is not only reducing memory traffic. It is also doing much less total instruction work.

Summary

This example is not a full benchmark. I am not comparing Numba against cuBLAS, and I am not claiming these numbers are the raw kernel runtimes. The run uses SASS-level profiling and sampling, so there is overhead.

But the report is still useful because both versions were measured with the same profiling mode. More importantly, the report explains why the naive kernel is slow.

The first version had:

high GPU utilization,
100% occupancy,
100% warp efficiency,
but very inefficient global memory access.

That means the problem was not lack of work or branch divergence. The problem was the memory access pattern.

After changing the kernel to use shared-memory tiling:

total profiled GPU time dropped from 17.40 s to 2.22 s,
average profiled kernel time dropped from 1.74 s to 221.64 ms,
global memory sectors dropped from 45.65B to 1.35B,
and actual global memory sectors matched the ideal number.

So the main takeaway is not just “the optimized kernel is faster.”

The more important takeaway is that GPUFlight helped point to the right fix. The report showed that the naive kernel was wasting memory bandwidth, and the optimized version confirmed that shared-memory tiling reduced that waste.

That is the workflow I want GPUFlight to support:

Run your program normally, collect useful GPU metrics, and turn the report into a concrete optimization decision.

Memory Coalescing: Same computation, 6x Performance Difference

Myoungho Shin — Thu, 09 Apr 2026 18:39:47 +0000

In software engineering, if two approaches are both O(n), that is often good enough for the discussion.
But in low-level or performance engineering, that is not the end of the story. Even when two algorithms have the same time complexity, the actual performance can be very different depending on how they access memory.

A simple example is iterating through an array versus a linked list. Both are O(n), but arrays are usually much faster in practice because their memory layout is contiguous, which allows the CPU to use caches much more efficiently.

The same idea applies on GPUs too, but the effect is often much bigger because many threads are accessing memory at the same time.

What is Memory Coalescing?

On NVIDIA GPUs, threads execute in groups called warps, which contain 32 threads.

When those threads access memory in a well-structured way, the GPU can combine their requests into a small number of memory transactions. That is called memory coalescing.

When the access pattern is poor, the opposite happens. Instead of serving the whole warp efficiently, the GPU ends up issuing many separate memory transactions. That wastes bandwidth and increases latency.

So the idea is simple: neighboring threads should access neighboring memory whenever possible.

Measuring It in Practice

The concept itself is well known, but measuring it in real code is not always convenient.

Tools like NVIDIA Nsight Compute usually require attaching a profiler and replaying kernels. That is fine for deep analysis, but it is not something you continuously leave on during normal execution.

With GPUFlight, I wanted to measure this kind of issue continuously during normal runs, without a debugger and without replaying the kernel.

The Setup: Two Matmul Kernels

For this example, I used two simple matrix multiplication kernels:

C = A × B

Both kernels compute the exact same result. The only difference is how the work is assigned to threads.

Row-per-thread

Each thread computes one row of the output matrix:

__global__ void matmul_row_per_thread(const float* A, const float* B,
                                      float* C, int M, int K, int N) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    if (row >= M) return;

    for (int col = 0; col < N; col++) {
        float sum = 0.0f;
        for (int i = 0; i < K; i++)
            sum += A[row * K + i] * B[i * N + col];
        C[row * N + col] = sum;
    }
}

Col-per-thread — Each thread computes one column of the output matrix:

__global__ void matmul_col_per_thread(const float* A, const float* B,
                                      float* C, int M, int K, int N) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (col >= N) return;

    for (int row = 0; row < M; row++) {
        float sum = 0.0f;
        for (int i = 0; i < K; i++)
            sum += A[row * K + i] * B[i * N + col];
        C[row * N + col] = sum;
    }
}

Same math. Same number of floating-point operations. The only difference is which dimension maps to threadIdx.x.

That small mapping change turns out to matter a lot.

Why It Matters: How GPUs Read Memory

A GPU does not read one float at a time in the way people often imagine. When a warp executes a load instruction, the hardware tries to combine the addresses from all 32 threads into as few memory transactions as possible.

In the best case, all 32 threads access consecutive floats, and the warp can be served efficiently.

In the worst case, each thread touches a different cache line, so the GPU ends up issuing many separate transactions. Most of the fetched data is not even used by that warp.

That is exactly what happens here.

In matmul_row_per_thread, adjacent threads (thread 0, 1, 2, ...) are assigned rows 0, 1, 2, .... When they read A[row * K + i], thread 0 reads address 0*K + i and thread 1 reads 1*K + i — these are K floats apart. With K=256, that's a stride of 1024 bytes between adjacent threads. Every thread hits a different cache line.

In matmul_col_per_thread, adjacent threads access columns 0, 1, 2, .... When they read B[i * N + col], thread 0 reads i*N + 0 and thread 1 reads i*N + 1 — consecutive addresses. One cache line serves all 32 threads.

Measuring with GPUFlight

GPUFlight instruments your CUDA application using CUPTI's SASS metrics and PC sampling APIs. You add a few lines to your code:

#include "gpufl/gpufl.hpp"

int main() {
    gpufl::InitOptions opts;
    opts.app_name = "memory_coalescing_demo";
    opts.profiling_engine = gpufl::ProfilingEngine::PcSamplingWithSass;
    gpufl::init(opts);

    GFL_SCOPE("row-per-thread") {
        matmul_row_per_thread<<<blocks, threads>>>(d_A, d_B, d_C, M, K, N);
    }

    GFL_SCOPE("col-per-thread") {
        matmul_col_per_thread<<<blocks, threads>>>(d_A, d_B, d_C, M, K, N);
    }

    gpufl::shutdown();
    gpufl::generateReport();
}

GPUFlight collects data during normal execution — no debugger, no replay, no kernel serialization.

The Results

Here's the report from an RTX 5060 (Blackwell, sm_120):

  matmul_row_per_thread  (13,268 stall samples)
  ------------------------------------------------------------------
    Stalls:
      Wait                           4,592   34.6%  #######
      Wait (idle)                    4,298   32.4%  ######
      Long Scoreboard                1,441   10.9%  ##
      Long Scoreboard (idle)         1,376   10.4%  ##
      Branch Resolving                 459    3.5%  #
      Selected                         351    2.6%  #
    Instructions:
      Warp Insts:                 12,042,560
      Thread Insts:              385,361,920
      Warp Efficiency:            32.0 / 32 (100.0%)
    Memory:
      Global Sectors:             69,468,160
      Ideal Sectors:              10,518,528
      Memory Efficiency:               15.1%
    Hints:
      * Low memory efficiency (15%) — consider coalesced access
        patterns or shared memory tiling.

  matmul_col_per_thread
  ------------------------------------------------------------------
    Instructions:
      Warp Insts:                 10,428,736
      Thread Insts:              333,719,552
      Warp Efficiency:            32.0 / 32 (100.0%)
    Memory:
      Global Sectors:             10,518,528
      Ideal Sectors:              10,518,528
      Memory Efficiency:              100.0%

Breaking Down the Numbers

Memory Efficiency: 15% vs 100%

This is the main number to look at. GPUFlight measures two things per kernel:

Global Sectors: actual 32-byte memory sectors transferred
Ideal Sectors: minimum sectors needed if every access were perfectly coalesced

Kernel	Actual Sectors	Ideal Sectors	Efficiency	Waste
Row-per-thread	69,468,160	10,518,528	15.1%	6.6×
Col-per-thread	10,518,528	10,518,528	100.0%	1.0×

The row-per-thread kernel transfers 6.6× more data than necessary. For every useful float, the GPU fetches an entire cache line that only one thread uses.

Stall Analysis: Where the Time Goes

PC sampling tells us what each warp was doing when sampled:

Wait (34.6%) + Wait idle (32.4%) = 67%
Long Scoreboard (10.9%) — That means a large portion of the time, the warps are not doing useful math. They are mostly waiting for memory.

This is the part I like most about seeing the data together: the memory inefficiency is not just an abstract metric. You can see it show up directly in the stall breakdown.

The col-per-thread kernel has so few stalls that PC sampling barely accumulates much data there. It simply finishes too quickly.

Wall-Clock Impact

Row-per-thread (uncoalesced): 245 ms
Col-per-thread (coalesced):   155 ms
Speedup: 1.6×

The coalesced version is 1.6× faster on this setup.

That is already a meaningful gain, and this is from a very small change in how work is mapped to threads.

Warp Efficiency Can Be Misleading

Both kernels show 100% warp efficiency (32/32 active threads). That means there is no thread divergence here. Every thread in each warp follows the same control flow.

If you only looked at warp efficiency, both kernels would look healthy.

But they are not equally healthy. The real problem is memory access, and memory efficiency exposes it immediately.

What GPUFlight Collects Under the Hood

GPUFlight uses two CUPTI mechanisms that run during normal execution:

SASS Metrics — The GPU binary is patched at load time to count per-instruction execution, thread activity, and memory sector usage. This is how we get the Global Sectors and Ideal Sectors numbers. No sampling bias — every instruction is counted.

PC Sampling — The hardware periodically interrupts each SM and records what every warp is doing: executing, or stalled and why. This gives us the stall reason distribution (Wait, Long Scoreboard, etc.).

GPUFlight also disassembles the GPU binary (SASS assembly) so you can see exactly which instructions are hot:

/*0x2a0*/ LDG.E.CONSTANT R20, desc[UR12][R18.64]   ← memory load (hot!)
/*0x2c0*/ LDG.E.CONSTANT R22, desc[UR12][R16.64]   ← memory load (hot!)
/*0x340*/ FFMA R35, R20, R21, R37                   ← fused multiply-add

The LDG.E.CONSTANT instructions are the global memory loads. In the row-per-thread kernel, these are where 67% of the time is spent waiting.

The Fix Is One Line

The entire difference between 15% and 100% memory efficiency comes down to which dimension you assign to threadIdx.x:

- int row = blockIdx.x * blockDim.x + threadIdx.x;  // threads map to rows
+ int col = blockIdx.x * blockDim.x + threadIdx.x;  // threads map to columns

That's it. Same algorithm, same math, same number of operations. Just a different mapping of threads to data.

Try It Yourself

The complete example is available as memory_coalescing_demo.cu in the GPUFlight client repository. To run it:

# Build with GPUFlight
cmake -B build -DCMAKE_CUDA_ARCHITECTURES=native
cmake --build build --target memory_coalescing_demo

# Run (admin/root for PC sampling on some platforms)
./build/example/cuda/memory_coalescing_demo

Final Thought

Memory coalescing is one of those concepts that sounds simple when explained in theory, but it becomes much more convincing when you can see the numbers in a real kernel.

In this example, it is not a tiny optimization. It is the difference between 15% and 100% memory efficiency, 6.6× more memory traffic than necessary, and a 1.6× wall-clock slowdown.

That is why memory access patterns matter so much on GPUs.

Detecting Thread Divergence with SASS Metrics and GPU Flight

Myoungho Shin — Tue, 10 Mar 2026 06:54:51 +0000

In the previous post I showed how to set up GPU Flight with Python and read kernel-level profiling data — occupancy, register counts, and resource bottlenecks. That tells you how well a kernel uses the hardware. But it doesn't tell you what's happening inside the kernel.

Today I want to look at one specific problem: thread divergence. When threads within a warp take different code paths, the GPU serializes execution — it runs one branch, then the other, while idle threads wait. If half the threads branch left and half branch right, you're running at 50% efficiency on those instructions.

GPU Flight's SASS Metrics engine gives you a direct way to measure this. It instruments the GPU at the assembly (SASS) level and reports two key counters per instruction:

smsp__sass_inst_executed — the number of warp-level instruction executions
smsp__sass_thread_inst_executed — the total number of thread-level instruction executions

The ratio thread_executed / (inst_executed × 32) tells you the average number of active threads per warp at each instruction. If it's 32.0, every thread was active. If it's 16.0, half were diverged. If it's 8.0, only a quarter was doing useful work.

The Demo: Five Divergence Patterns

I wrote a small CUDA program with five kernels, each demonstrating a different divergence pattern. The full source is in the GPU Flight repo at example/cuda/sass_divergence_demo.cu. Here's a summary:

Kernel	Pattern	Expected Active Threads
`uniformWork`	No divergence (baseline)	32
`branchByWarpLane`	`if (threadIdx.x % 2)` — even/odd split	16 in each branch
`branchByWarpQuad`	`if (threadIdx.x % 4 == 0)` — 1-in-4	8 in hot path
`earlyExit`	Data-dependent early return	Varies (~16)
`indirectBranch`	4-way switch on random data	Varies (~8)

Each kernel is wrapped in a GFL_SCOPE so GPU Flight can attribute the SASS metrics to the right section.

Kernel 1: Uniform Work (Baseline)

__global__
void uniformWork(float* out, const float* in, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float val = in[idx];
        for (int i = 0; i < 512; ++i) {
            val = val * 1.01f + 0.001f;
        }
        out[idx] = val;
    }
}

Every thread does the same thing. No branches inside the loop, no divergence. This is the baseline — you should see thread_executed / inst_executed close to 32 for the loop body instructions.

Kernel 2: Even/Odd Divergence

__global__
void branchByWarpLane(float* out, const float* in, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float val = in[idx];
        if (threadIdx.x % 2 == 0) {
            for (int i = 0; i < 512; ++i)
                val = val * 1.01f + 0.001f;
        } else {
            for (int i = 0; i < 512; ++i)
                val = val + 0.001f * (float)i;
        }
        out[idx] = val;
    }
}

This is the classic divergence example. Within every warp, 16 threads go left, 16 go right. The GPU executes both paths sequentially with half the threads masked off each time. The SASS metrics will show ~16 active threads for instructions inside each branch.

Kernel 3: Quad Divergence

__global__
void branchByWarpQuad(float* out, const float* in, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float val = in[idx];
        if (threadIdx.x % 4 == 0) {
            for (int i = 0; i < 2048; ++i)
                val = val * 1.001f + 0.0001f;
        }
        out[idx] = val;
    }
}

Only every 4th thread enters the loop. That's 8 out of 32 threads doing the heavy work while 24 sit idle. Worse than 50/50 — 75% of the warp is wasted during the loop body.

Kernel 4: Early Exit

__global__
void earlyExit(float* out, const float* in, float threshold, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float val = in[idx];
        if (val < threshold) {
            out[idx] = val;
            return;
        }
        for (int i = 0; i < 1024; ++i)
            val = val * 1.01f - 0.005f;
        out[idx] = val;
    }
}

This is data-dependent. Threads whose input is below the threshold return early, while the rest do the expensive computation. With random inputs in [0, 1) and a threshold of 0.5, roughly half the threads will exit early. But unlike Kernel 2, the split isn't uniform across warps — some warps might have 20 threads exit, others might have 10.

Kernel 5: Data-Dependent Switch

__global__
void indirectBranch(float* out, const float* in, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float val = in[idx];
        int category = (int)(val * 4.0f) % 4;
        switch (category) {
            case 0: for (int i = 0; i < 256; ++i) val = val * 1.01f; break;
            case 1: for (int i = 0; i < 256; ++i) val = val + 0.01f; break;
            case 2: for (int i = 0; i < 256; ++i) val = val - 0.005f; break;
            case 3: for (int i = 0; i < 256; ++i) val = val * 0.99f; break;
        }
        out[idx] = val;
    }
}

A 4-way branch driven by random data. On average, each case gets ~8 threads per warp, but the GPU must execute all 4 paths sequentially. This is the worst case — 4x the instruction count for the branch body.

Running the Demo

Build and run from the GPU Flight repo:

git clone https://github.com/gpu-flight/gpufl-client.git
cd gpufl-client
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target sass_divergence_demo
./build/example/cuda/sass_divergence_demo

The key part is in main() — initializing GPU Flight with the SASS Metrics engine:

gpufl::InitOptions opts;
opts.app_name = "sass_divergence_demo";
opts.log_path = "sass_divergence";
opts.enable_kernel_details = true;
opts.sampling_auto_start = true;
opts.profiling_engine = gpufl::ProfilingEngine::SassMetrics;

gpufl::init(opts);

Setting profiling_engine to SassMetrics tells GPU Flight to instrument every kernel at the SASS level. Each GFL_SCOPE block then collects per-instruction counters for the kernels launched inside it.

Results: RTX 3090

Here's what I got running on an NVIDIA GeForce RTX 3090 (Ampere, SM 8.6, 82 SMs) with 1M elements:

Kernel                    Weighted Avg Active Threads    Instructions
----------------------------------------------------------------------
uniformWork                                      32.0             277
branchByWarpLane                                 16.3             796
branchByWarpQuad                                  8.2             281
earlyExit                                        16.2             280
indirectBranch                                    1.5            1062

The "Weighted Avg Active Threads" is thread_inst_executed / inst_executed across all SASS instructions in each kernel, weighted by execution count. "Instructions" is the number of unique PC offsets (SASS instructions) instrumented.

Let's walk through what this tells us:

uniformWork — 32.0 active threads. Perfect. Every warp runs at full width. This is the expected baseline for a kernel with no divergence.

branchByWarpLane — 16.3 active threads. Very close to the theoretical 16. The slight overshoot comes from instructions outside the branch (the if (idx < n) guard, loop control, and the final store) where all 32 threads are active. The 796 unique instructions — nearly 3x the baseline — show the cost: the compiler generates separate code for each branch, and both paths must be executed.

branchByWarpQuad — 8.2 active threads. Again close to the theoretical 8 (only 1 in 4 threads enters the loop). Similar instruction count to the baseline since there's only one branch path — but every instruction in the hot loop runs with 75% of threads idle.

earlyExit — 16.2 active threads. Matches the expectation for a 50% threshold with random data. Threads that exit early become inactive for the remaining instructions.

indirectBranch — 1.5 active threads, 1062 instructions. This is the most striking result. A 4-way switch on random data drops the weighted average to just 1.5 active threads per warp — far worse than the other kernels. It also generates the highest instruction count at 1062, nearly 4x the baseline. This is a crucial insight: divergence doesn't just halve your throughput — multi-way branching on random data can drop you below 5% when measured at the instruction level.

What This Means in Practice

Thread divergence is easy to create and hard to notice. Your kernel still produces correct results. But you might be leaving 50-95% of your GPU's compute on the table.

Here are the common patterns to watch for:

Lane-based branching — if (threadIdx.x % N). This is almost always unintentional. Consider rearranging your data so that threads within a warp take the same path.

Data-dependent branches — like the earlyExit kernel. If your input distribution is skewed, some warps diverge heavily while others don't. The average might look okay, but the worst warps are bottlenecks.

Switch statements on computed values — like indirectBranch. This was the worst offender in our test — each additional case multiplies the predicated instruction overhead.

The fix depends on the situation:

Sort or bin your data so threads in the same warp hit the same branch
Replace branches with predicated arithmetic — branchless code runs all threads at full width
Restructure your algorithm so the branch happens at the warp or block level, not the thread level

Profiling GPU (CUDA) — Getting Started with GPU Flight's Python Package

Myoungho Shin — Mon, 09 Mar 2026 03:59:53 +0000

In the previous posts I've been showing how to investigate GPU occupancy utilization and optimize kernels that aren't using the hardware fully. That was just one case — I'll cover more occupancy scenarios in future posts.

Today, I want to go through how to use GPU Flight in Python, especially with PyTorch. Since GPU Flight is still in active development, the current version is v0.1.0.dev7. You can install it with:

pip install gpufl==0.1.0.dev7

However, I highly recommend building from source inside a CUDA container. There are two reasons:

Prerequisite libraries — GPU Flight's backend needs CUPTI, the CUDA runtime, and NVML headers at compile time. Getting these right on a bare system is fiddly.
NVML support — the pre-built PyPI wheel is compiled in a minimal CI environment that doesn't include NVML stubs. This means the wheel works for kernel profiling, but can't collect runtime GPU utilization or VRAM usage. Building from source inside the nvidia/cuda:*-devel image picks up NVML automatically.

In this post, I'll show how to use Docker to set up an environment that's ready to go — with GPU Flight built from source, PyTorch, and Jupyter Lab all pre-installed.

The Dockerfile

Here's the full Dockerfile. It's straightforward — CUDA 13.1 base, PyTorch, GPU Flight, and Jupyter Lab:

FROM nvidia/cuda:13.1.0-devel-ubuntu24.04

ENV DEBIAN_FRONTEND=noninteractive

# System dependencies (Ubuntu 24.04 ships Python 3.12)
# NOTE: cmake/ninja come from pip (build-system.requires needs >=3.31, apt has 3.28)
RUN apt-get update && apt-get install -y \
    python3 \
    python3-venv \
    python3-dev \
    python3-pip \
    git \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Create venv to avoid PEP 668 issues
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Upgrade pip
RUN pip install --upgrade pip

# Install PyTorch with CUDA 13.1 support
RUN pip install torch --index-url https://download.pytorch.org/whl/cu130

# Build gpufl from source so it picks up NVML from the CUDA devel image
ARG GPUFL_VERSION=main
RUN git clone --depth 1 --branch ${GPUFL_VERSION} \
        https://github.com/gpu-flight/gpufl-client.git /tmp/gpufl-client \
    && CMAKE_ARGS="-DBUILD_TESTING=OFF" \
       pip install -v "/tmp/gpufl-client[analyzer,viz]" \
    && rm -rf /tmp/gpufl-client

# Install Jupyter
RUN pip install jupyterlab

# Working directory for notebooks
WORKDIR /workspace

# Expose Jupyter port
EXPOSE 8888

# Start Jupyter Lab
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", \
     "--allow-root", "--NotebookApp.token=''", "--NotebookApp.password=''"]

A few things to note:

Ubuntu 24.04 — ships Python 3.12 natively, which is what GPU Flight requires. No PPA hacks needed.
devel image — we use nvidia/cuda:13.1.0-devel-ubuntu24.04 because the devel variant includes CUPTI, CUDA headers, and NVML stubs that GPU Flight's backend needs at compile time.
Building from source — we clone the repo and build with pip install rather than using the pre-built PyPI wheel. This is important: the devel image has NVML stubs at /usr/local/cuda/lib64/stubs/libnvidia-ml.so, so CMake detects them and compiles in the NVML collector. The pre-built wheel doesn't have this, which means no GPU utilization or VRAM monitoring.
PyTorch cu130 — at the time of writing, PyTorch doesn't publish a cu131 wheel yet. The cu130 build is forward-compatible with the CUDA 13.1 runtime in the container, so this works fine.
No token — Jupyter starts without authentication. This is fine for local development; don't expose this to the internet.

Building and Running

Prerequisites

You need two things on your host machine:

Docker — any recent version
NVIDIA Container Toolkit — this lets Docker containers access your GPU

Important: Having an NVIDIA driver installed on your host is not enough. Docker doesn't know how to talk to your GPU on its own — you need the NVIDIA Container Toolkit to bridge that gap. Without it, --gpus all will fail with:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

You can check if it's already installed by running nvidia-ctk --version. If not, here's how to set it up:

# Add the NVIDIA container toolkit repo
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install and configure
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

That last systemctl restart is easy to forget — Docker needs to be restarted after the runtime is configured, or it won't pick up the new GPU capability.

You can verify it worked with:

docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu24.04 nvidia-smi

If you see your GPU listed, you're good to go.

Build the Image

docker build -t gpufl-python .

This will take a few minutes the first time — mostly downloading PyTorch.

Run the Container

docker run --gpus all -p 8888:8888 -v $(pwd)/notebooks:/workspace gpufl-python

Breaking that down:

Flag	What it does
`--gpus all`	Passes all GPUs into the container
`-p 8888:8888`	Maps Jupyter's port to your host
`-v $(pwd)/notebooks:/workspace`	Mounts a local folder so your notebooks persist

Connect

Open your browser and go to:

http://localhost:8888

You'll land in Jupyter Lab with GPU Flight, PyTorch, and a CUDA-capable GPU ready to go.

Quick Smoke Test

Create a new notebook and run this to verify everything is working:

import torch
import gpufl
from gpufl import ProfilingEngine

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device: {torch.cuda.get_device_name(0)}")

# Initialize GPU Flight
gpufl.init("smoke-test",
           log_path="./smoke_test",
           sampling_auto_start=True,
           enable_kernel_details=True,
           enable_stack_trace=True,
           profiling_engine = ProfilingEngine.RangeProfiler)

# Run a simple operation
with gpufl.Scope("RandomGeneration"):
    a = torch.randn(1024, 1024, device="cuda")
    b = torch.randn(1024, 1024, device="cuda")
with gpufl.Scope("a @ b"):
    c = a @ b
    torch.cuda.synchronize()

gpufl.shutdown()
print("GPU Flight logs written!")

After running this, you should see *.log files in your working directory. These are your GPU Flight recordings — every kernel launch, memory copy, and timing event that happened during that matrix multiply.

Analyzing the Results

GPU Flight's Python analyzer can load those logs directly in the notebook:

from gpufl.analyzer import GpuFlightSession

session = GpuFlightSession(".", log_prefix="smoke_test")
session.print_summary()

GpuFlightSession takes two main arguments: the directory where logs live, and the log_prefix matching your log_path from init. It automatically finds and loads smoke_test.device.log, smoke_test.scope.log, and smoke_test.system.log.

print_summary() gives you a quick dashboard — total duration, kernel count, GPU busy time, average utilization, and peak VRAM.

Now let's look at the kernel hotspots:

session.inspect_hotspots(top_n=10)

This gives you a Rich-formatted table of your hottest kernels with occupancy, register usage, shared memory, and the per-resource occupancy breakdown showing exactly what's limiting each kernel.

Here's what that actually looks like — this is real output from the matrix multiply we just ran:

Now let's look at 33.3% occupancy. That sounds bad, right? Let's break it down.

The kernel is ampere_sgemm_128x64_nn — cuBLAS's single-precision matrix multiply. It uses 122 registers per thread. That's a lot. Let's trace through what happens on an Ampere SM:

128 threads per block = 4 warps per block
122 regs/thread × 32 threads/warp = 3,904 → rounded up to the hardware allocation granularity of 256 → 4,096 regs/warp
4 warps × 4,096 = 16,384 registers per block
An Ampere SM has 65,536 registers total → 65,536 / 16,384 = 4 blocks max
4 blocks × 4 warps = 16 active warps out of 48 max = 33.3%

The breakdown confirms it: reg 33.3% is the bottleneck, while shared memory (66.7%), warps (100%), and block count (100%) all have headroom.

But Is This Actually a Problem?

Not necessarily. If the algorithm itself doesn't require all those registers, high register usage might be a problem — but it could also be by design. This is a good example of why occupancy alone doesn't tell the whole story — you need to understand what's limiting it and whether that tradeoff makes sense for the workload.

If you saw 33% occupancy with limiting_resource: shared_mem on your own custom kernel, that might be worth investigating.

What's Next

Now that you have a working environment, you can start profiling your own models. The occupancy breakdown makes it easy to spot which kernels are underutilizing the GPU and — more importantly — why. Not every low-occupancy kernel is a problem, but when one is, you'll know exactly which resource to optimize.

In the next post, I'll cover GPU Flight's profiling engines — PC sampling, SASS metrics, and the range profiler — which let you go beyond kernel metadata and collect hardware-level data about what's happening inside the GPU while your kernels run.

Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?

Myoungho Shin — Mon, 02 Mar 2026 01:19:03 +0000

In my last post I introduced GPU Flight — a lightweight CUDA observability tool that acts like a flight recorder for your GPU. We covered what it collects: system metrics, device capabilities, and per-kernel events.

Today I want to talk about one specific metric that GPU Flight captures: occupancy. It's one of the most important numbers for understanding GPU performance, and also one of the most misunderstood.

What Is Occupancy?

A GPU is organized around Streaming Multiprocessors (SMs). Each SM can run many threads simultaneously — not by context-switching like a CPU, but by actually running them in parallel. The unit of scheduling on an SM is a warp: a group of 32 threads that execute the same instruction in lockstep.

An SM has a fixed warp budget — say, 48 warps on a typical Ampere GPU. When you launch a kernel with blocks of 256 threads (8 warps each), the SM can hold up to 6 blocks concurrently to fill those 48 warp slots. If something prevents that — too many registers, too much shared memory — fewer blocks fit, and some warp slots sit idle.

Occupancy measures how well those warp slots are filled:

occupancy = active warps / maximum warps per SM

A value of 1.0 means every slot is in use. A value of 0.5 means half the SM's compute capacity is being wasted while your kernel runs.

How GPU Flight Captures It

GPU Flight records occupancy automatically for every kernel launch. No code changes needed — just initialize with enableKernelDetails: true and it shows up in the log:

{
  "type": "kernel_event",
  "name": "_Z18block_reduce_naivePKfPfi",
  "occupancy": 0.833333,
  "num_regs": 16,
  "static_shared_bytes": 16384,
  "dyn_shared_bytes": 0,
  "block": "(256,1,1)",
  "grid": "(16384,1,1)",
  "max_active_blocks": 5,
...
}

Under the hood, GPU Flight calls cudaOccupancyMaxActiveBlocksPerMultiprocessor at kernel launch time to get max_active_blocks, then divides by the SM's warp budget to compute occupancy. This happens inside the CUPTI callback — zero overhead to your kernel execution.

That 0.833333 immediately tells you something is off. This kernel only fills 5 out of 6 possible concurrent blocks on each SM. Some compute is being left on the table.

But What Is Actually Causing It?

Here's where a single number hits its limit.

Is it registers? Shared memory? The hardware block count cap? Looking at the log fields, you can make an educated guess — static_shared_bytes: 16384 is 16 KB of shared memory per block, which is pretty large. But you still have to do the math yourself against your specific GPU's properties to confirm.

That manual detective work is exactly what I wanted to eliminate. So GPU Flight now also computes a per-resource occupancy breakdown and identifies the limiting resource automatically. Let me show what this looks like with a concrete kernel.

The kernel

Here's a simple parallel block reduction — it sums an array by having all 256 threads in a block cooperate through shared memory:

__global__ void block_reduce_naive(const float* in, float* out, int n) {
    __shared__ float smem[4096]; // 16 KB — statically reserved

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + tid;

    // Load one element per thread into shared memory
    smem[tid] = (gid < n) ? in[gid] : 0.0f;
    __syncthreads();

    // Reduce in shared memory — each step halves the active threads
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) smem[tid] += smem[tid + s];
        __syncthreads();
    }

    // Thread 0 writes the block's result
    if (tid == 0) out[blockIdx.x] = smem[0];
}

Launched with 256 threads per block across 4M elements:

const int BLOCK = 256;
const int GRID  = (N + BLOCK - 1) / BLOCK; // ~16384 blocks
block_reduce_naive<<<GRID, BLOCK>>>(d_in, d_out, N);

Nothing unusual here — this is a textbook reduction. But GPU Flight flags a problem immediately.

What GPU Flight sees

{
...,
  "occupancy":         0.833333,
  "reg_occupancy":     1.0,
  "smem_occupancy":    0.833333,
  "warp_occupancy":    1.0,
  "block_occupancy":   1.0,
  "limiting_resource": "shared_mem"
}

Each *_occupancy field answers: "if only this constraint existed, what would occupancy be?" The limiting_resource field names the one that's actually binding. Here — smem_occupancy matches occupancy and everything else is 1.0 — shared memory is definitively the culprit.

Why

The problem is __shared__ float smem[4096]. Static shared memory is sized at compile time and reserved in full for the block's entire lifetime — even if the kernel only uses part of it. With 256 threads per block, this reduction only ever touches smem[0] through smem[255], but all 4096 floats (16 KB) are locked up on the SM regardless. Every block is paying a 16 KB reservation it doesn't actually need, and that prevents the SM from scheduling as many concurrent blocks as the warp budget would otherwise allow.

The fix

Switch to dynamic shared memory, which is sized at launch time rather than compiled in:

__global__ void block_reduce_optimized(const float* in, float* out, int n) {
    extern __shared__ float smem[]; // size comes from the launch call

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + tid;

    smem[tid] = (gid < n) ? in[gid] : 0.0f;
    __syncthreads();

    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) smem[tid] += smem[tid + s];
        __syncthreads();
    }

    if (tid == 0) out[blockIdx.x] = smem[0];
}

The kernel body is completely unchanged. The only differences are extern __shared__ instead of a fixed-size array, and passing the size as the third launch argument:

size_t smem_bytes = BLOCK * sizeof(float); // 256 × 4 = 1 KB
block_reduce_optimized<<<GRID, BLOCK, smem_bytes>>>(d_in, d_out, N);

The shared memory footprint drops from 16 KB to 1 KB per block — 16× smaller — and now the SM can fit all 6 concurrent blocks instead of 5.

GPU Flight confirms the fix worked:

{
  "occupancy":         1.0,
  "limiting_resource": "warps"
}

"warps" as the limiting resource means full occupancy — every SM warp slot is filled and shared memory is no longer in the way.

Full Sample Code: GitHub Repo

Profiling GPU (CUDA) — Introducing GPU Flight

Myoungho Shin — Tue, 24 Feb 2026 01:32:26 +0000

Last year, I took a GPU programming course at Johns Hopkins University as part of my graduate studies, where I learned CUDA programming. For my final project, I built a lightweight GPU monitoring and profiling tool focused on CUDA.

I enjoyed the process so much that I decided to continue developing it beyond the course.

GPUFlight is a GPU profiling and monitoring tool for CUDA and ROCm workloads.

In this post, I’d like to briefly introduce the project:

Open-source client: https://github.com/gpu-flight/gpufl-client

Why I Started GPU Flight

When profiling a CUDA application, you typically:

Install profiling tools such as Nsight
Or manually integrate CUPTI into your application, which often makes the code complex and difficult to manage
Deal with additional complexity in cloud or containerized environments

This workflow can be inconvenient — especially in production systems.

I wanted something lighter.

Something that works more like a flight recorder for GPUs.

So I built GPU Flight.

Instead of requiring heavy tooling at runtime, GPU Flight writes structured profiling logs directly on the host machine. A separate component (GPUFL Agent) crawls these log files and forwards them to a backend service or other destinations.

This makes GPU observability more flexible and easier to integrate into distributed systems.

What is GPU Flight?

GPU Flight is designed to be lightweight and modular.

If you only need monitoring, the overhead is minimal.
Enabling deeper profiling provides more detailed metrics.

The goal is to expose useful GPU metrics so you can clearly understand:

How the GPU manages resources
How your program utilizes GPU resources
Where performance bottlenecks occur

Project Structure

GPU Flight currently consists of several components:

1️⃣ gpufl-client

https://github.com/gpu-flight/gpufl-client

The client library that users embed into their applications for monitoring and profiling.

2️⃣ gpufl-agent

https://github.com/gpu-flight/gpufl-agent

Despite the name, this is not an AI agent 🙂

It tracks log files and forwards profiling data to the configured destination.

3️⃣ gpufl-desktop

https://github.com/gpu-flight/gpufl-desktop

Originally, I planned to build a desktop viewer. Due to time constraints and the need for better cross-platform accessibility, I pivoted to a web-based frontend.
I am currently keeping the web frontend and backend repositories private as I develop them into a hosted cloud platform. To ensure the open-source community can still easily parse and utilize the trace logs locally, I am providing a lightweight Python viewer alongside the open-source C++ client.

What Metrics Does GPU Flight Support?

GPU Flight captures observability at multiple layers.

1️⃣ System & GPU Monitoring (NVML)

Host memory usage
GPU memory usage (used/free/total)
GPU utilization
Memory utilization
Temperature
Power consumption
Clock speeds (GFX / SM / Memory)
PCIe RX/TX bandwidth
Power and thermal throttling flags

Example JSON snippet:

{
  "type": "system_sample",
  "util_gpu": 57,
  "temp_c": 39,
  "power_mw": 54415,
  "clk_sm": 1740
}

2️⃣ CUDA Device Capabilities

Static architectural information:

Compute capability
L2 cache size
Shared memory per block
Registers per block
SM count
Warp size

3️⃣ CUDA API & Kernel Events (CUPTI)

API enter/exit timestamps
Kernel execution start/end timestamps
Grid/block dimensions
Shared memory usage
Register usage
Occupancy
Correlation IDs
Memory copy events (HtoD, DtoH)

Python Support

GPU Flight is also being extended to support Python applications that use CUDA (e.g., PyTorch).

Example:

https://github.com/gpu-flight/gpufl-client/blob/main/example/python/03_pytorch_benchmark.py

This allows profiling GPU-heavy ML workloads without deeply modifying existing code.

What’s Next?

In the next post, I’ll walk through a minimal CUDA example and show how to:

Integrate gpufl-client
Run a kernel
Inspect generated profiling logs
Interpret stall reasons and metrics

Thanks for reading — this is just the beginning