Forem: Shah Fahad

Programming Hopper GPUs: The Memory Consistency Model

Shah Fahad — Sat, 25 Apr 2026 14:05:13 +0000

You've decided to write fast code for an NVIDIA Hopper GPU. Maybe you want to build a custom attention kernel. Maybe you're trying to understand how CUTLASS and ThunderKittens work under the hood. Either way, before you can use any of the cool Hopper hardware — TMA, wgmma, mbarriers, clusters — you need to understand one thing: how memory works when thousands of threads share it.

That's what the memory consistency model describes. It's the rulebook for what one thread can or cannot see about another thread's writes. Without it, the rest of the stack is undefined behavior waiting to happen.

This article covers the minimum you need to write correct multi-threaded GPU code. We'll build it from one concrete bug, then introduce the two primitives that fix it.

Code examples are schematic unless noted. Full PTX spells out scope, state space, and type — for example fence.release.gpu or st.release.gpu.global.u32.

The bug we're trying to prevent

Imagine the simplest possible producer-consumer pattern. One thread fills a buffer, then sets a flag to say "I'm done." Another thread waits for the flag, then reads the buffer.

Producer (thread T1):              Consumer (thread T2):

  data = 42;       // (1)            while (flag == 0) {}    // (3) wait for flag
  flag = 1;        // (2)            x = data;               // (4) expects 42

Looks fine, right? It's how you'd write it in any single-threaded language. But on a GPU (or any modern CPU, for that matter), this code can produce x == 0 instead of x == 42. Here's why.

From the producer's own perspective, (1) and (2) are two completely unrelated stores to two different memory locations. There's nothing in T1's own code that depends on the order in which they hit memory. So the compiler is free to reorder them, and the hardware is free to commit them out of order — neither change is visible from inside T1.

But the consumer T2 can see the reorder. If flag = 1 arrives in memory before data = 42 does, then T2 might see flag == 1, exit its wait loop, and read data while it's still 0. The consumer's assumption — "if the flag is set, the data is ready" — silently breaks.

What T1 thinks happens:        What T2 might actually see:

  data = 42                      flag = 1     ← becomes visible first
  flag = 1                       data = 42    ← arrives later

  T2 reads:                      T2 reads:
    flag → 1 ✓                     flag → 1 ✓ (exits wait loop)
    data → 42 ✓                    data → 0 ✗ (reads stale value!)

This is the central problem of shared-memory programming. A program that looks logically correct can fail because the system reorders operations that are independent from the issuing thread's perspective but not independent from another thread's perspective.

The memory consistency model gives us tools to prevent this. The two main tools are called the release fence and the acquire fence.

The big idea: pair of fences

Think of fences as a contract between the producer and the consumer:

Producer side: "I promise to publish all my prior work before I tell anyone I'm done."
Consumer side: "I promise to wait for the announcement, then read fresh data — not anything I might have cached earlier."

The producer holds up its end of the contract using a release fence. The consumer holds up its end using an acquire fence. When both sides cooperate, the bug above goes away.

Let's see exactly what each fence does, then we'll put them together to fix the bug.

The release fence

A release fence sits between two pieces of producer code. Conceptually:

  data = 42;          // some prior work

  fence.release;      // "publish everything above me before anything below me"

  flag = 1;           // the announcement

Two things to understand:

1. The compiler cannot move prior memory operations past the fence.

Without the fence, a compiler optimization might reorder data = 42 to come after flag = 1. With fence.release in between, that's forbidden. The fence is a one-way wall: stuff above stays above.

Without the fence:                With fence.release:

  data = 42                         data = 42
  flag = 1                          fence.release  ← prior writes pinned above
                                    flag = 1
  ⇒ compiler/hardware may swap     ⇒ data = 42 stays before flag = 1

The fence is one-way, not two-way. This is a subtle but important point. The release fence only blocks downward motion — operations above it can't sink below. Operations below the fence are technically allowed to move above it; the fence does not pin them in place. This is by design: the release only promises "everything above me is published," so it has nothing to say about what comes after.

   memory op above                    ← cannot sink below the fence
   memory op above                    ← cannot sink below the fence
   ─── fence.release ───
   memory op below                    ← in general, CAN move above the fence
   memory op below                    ← in general, CAN move above the fence

This sounds dangerous — wouldn't the compiler also be free to move the flag = 1 write above the fence? In source PTX, the answer is no for the store that forms the release pattern: NVIDIA defines the pattern by program order, so moving that store would change the program's synchronization behavior.

2. The fence makes the prior writes "publishable."

This is the more subtle part. After the fence, when the producer writes flag = 1, that flag write effectively carries with it the promise that everything before the fence is also visible. It's like attaching a receipt to the announcement: "by the way, my data is ready too."

This is called a release pattern: a fence.release followed by a write to a flag. The pair together is what publishes the producer's prior work to anyone watching the flag.

Pattern exception — the paired flag write is pinned. The "later operations can move above the fence" rule has one exception: the specific strong write that pairs with the fence to form the release pattern. NVIDIA's model defines fence.release; st.relaxed [flag], 1 as a release pattern because those instructions occur in that program order. A compiler or assembler cannot preserve the same PTX semantics while hoisting that store above the fence, because doing so would dismantle the release pattern entirely.

So in practice, the rules around a release fence are:

Anything before the fence stays before it.
Unrelated memory operations after the fence are not protected by the release semantics in the same way.
The specific strong flag write that pairs with the fence must stay after the fence — it's the whole point of the pattern.

One important nuance: the fence does NOT eagerly broadcast anything to other threads. It just guarantees an ordering — that the data is ready in memory before the flag is. Other threads will observe the change at their own pace; the release fence doesn't push anything to them. This is why we'll need a spin loop on the consumer side.

The acquire fence

The acquire fence is the mirror image. It sits between two pieces of consumer code:

  flag_value = flag;   // read the announcement

  fence.acquire;       // "anything below me sees a fresh view of memory"

  x = data;            // read the data, guaranteed up-to-date

Two things to understand:

1. The compiler cannot move later memory operations before the fence.

Without the fence, a compiler might prefetch data into a register before the flag check (a common optimization). With fence.acquire, that's forbidden. The fence is again a one-way wall, but in the opposite direction this time: stuff below stays below.

Without the fence:                With fence.acquire:

  flag_value = flag                 flag_value = flag
  x = data         ← may be         fence.acquire  ← later reads pinned below
                     prefetched     x = data
                     above flag!

The fence is one-way, in the opposite direction from release. Symmetrically with fence.release, the acquire fence only blocks upward motion — operations below it can't hoist above. Operations above the fence are technically allowed to move below it; the fence doesn't pin them in place. This is by design: the acquire only promises "everything below me sees a fresh view," so it has nothing to say about what came before.

   memory op above                    ← in general, CAN move below the fence
   memory op above                    ← in general, CAN move below the fence
   ─── fence.acquire ───
   memory op below                    ← cannot hoist above the fence
   memory op below                    ← cannot hoist above the fence

Just like for release, this raises a question: couldn't the compiler also move the flag_value = flag read below the fence? Not for the read that forms the acquire pattern: NVIDIA defines the pattern by program order, so moving that read would change the synchronization behavior.

2. The fence makes subsequent reads "fresh."

After the fence, later reads are ordered after any matching release that was observed through the flag. In other words, the consumer cannot use an old value of data from before the producer's published write; it must see the released write or a later write in that location's coherence order.

This is called an acquire pattern: a read of a flag, followed by a fence.acquire. The pair together is what makes the consumer pick up the producer's published data.

Pattern exception — the paired flag read is pinned. The "earlier operations can move below the fence" rule has one exception: the specific strong read that pairs with the fence to form the acquire pattern. NVIDIA's model defines ld.relaxed [flag]; fence.acquire as an acquire pattern because those instructions occur in that program order. Moving the flag read below the fence would dismantle the pattern.

So in practice, the rules around an acquire fence are:

Anything after the fence stays after it.
Unrelated memory operations before the fence are not protected by the acquire semantics in the same way.
The specific strong flag read that pairs with the fence must stay before the fence — it's the whole point of the pattern.

Putting them together: the bug, fixed

Now let's go back to our buggy code and fix it:

Producer (thread T1):                 Consumer (thread T2):

  data = 42;          // (1)            while (1) {
  fence.release;      // (F_R)            flag_value = flag;     // (3)
  flag = 1;           // (2)              if (flag_value == 1) break;
                                        }
                                        fence.acquire;           // (F_A)
                                        x = data;                // (4)

Walk through it:

T1 writes data = 42.
T1 hits fence.release. This guarantees data = 42 is committed to memory before anything that follows.
T1 writes flag = 1. This is the announcement; it now "carries" the visibility of data = 42.
T2 spins on flag until it reads 1. (We need to spin because the flag's new value takes some real time to propagate to T2 — fences don't push, they just promise ordering.)
T2 hits fence.acquire. This guarantees subsequent reads see fresh data — no stale cached values can satisfy them.
T2 reads data. Because the producer's release published data = 42 before publishing the flag, and T2's acquire ensures fresh reads after seeing the flag, T2 is guaranteed to see 42.

The key insight: the bug only got fixed because BOTH sides cooperated. The release alone wouldn't help — the consumer would still cache stale data. The acquire alone wouldn't help — the producer's writes might still arrive out of order. Memory ordering is always a contract between producer and consumer.

   Producer side:                       Consumer side:

   data = 42                            spin: flag_value = flag
       │                                              │
       ▼                                              ▼
   fence.release        ─ publishes ─►   fence.acquire   ◄─ acquires
       │                  data + flag    │                  fresh view
       ▼                  together       ▼
   flag = 1                              x = data → 42 ✓

Shorter forms: baked-in release and acquire

In the example above, the producer wrote two separate instructions for the publish step:

fence.release;
st.relaxed [flag], 1;

PTX provides a shorter form that bakes the release semantics directly into the store. Instead of the two-instruction pair, you can write:

data = 42;
st.release [flag], 1;     // store with release semantics built in

For this adjacent publish-store pattern, this gives the same release/acquire synchronization guarantee as fence.release; st.relaxed [flag], 1. The release behavior is fused into the store, so the release pattern is inherent in one instruction.

Important: it has to be st.release, not st.relaxed. A bare st.relaxed [flag], 1 is not a release pattern — it's a strong store with no release ordering effect on prior writes. The release pattern requires either st.release (release baked in) or the explicit pair fence.release; st.relaxed [flag], 1. Don't drop the fence.release and assume the .relaxed qualifier carries any release meaning — it doesn't.

The consumer side has the matching shortcut:

flag_value = ld.acquire [flag];   // load with acquire semantics built in
x = data;                         // sees fresh data

ld.acquire is shorthand for "do the read, and include acquire semantics in that same operation." Same synchronization guarantee, one instruction.

Same warning, mirrored. A bare ld.relaxed [flag] is not an acquire pattern — it's a strong load with no acquire ordering effect on later reads. The acquire pattern requires either ld.acquire (acquire baked in) or the explicit pair ld.relaxed [flag]; fence.acquire. Don't drop the fence.acquire and assume the .relaxed qualifier carries any acquire meaning.

The fully-baked-in producer/consumer becomes:

Producer:                          Consumer:

  data = 42;                         while (1) {
  st.release [flag], 1;                flag_value = ld.acquire [flag];
                                       if (flag_value == 1) break;
                                     }
                                     x = data;

Same contract, less typing. For most producer-consumer patterns, this is the form you want.

When would you reach for the explicit fence.release / fence.acquire form instead? A few cases:

You want one fence to publish multiple flags. A single fence.release followed by several flag writes gives all of them release semantics — no need to repeat the fence per flag.
You want a cheap relaxed read inside a spin loop, then "commit" to acquiring only when you see the right value. Repeating ld.acquire on every spin iteration can be more expensive than a ld.relaxed loop followed by one fence.acquire after exit. We'll see this pattern again with mbarrier in a later article.
You need the combined fence variant discussed next.

When you need both: `fence.acq_rel`

So far we've kept the producer and consumer roles cleanly separated — release on the producer side, acquire on the consumer side. Each side has a single direction to worry about.

But sometimes a single point in your code needs to do both jobs at once:

Refresh its view to see what someone else published (acquire side).
Publish its own writes for someone else to see (release side).

That's exactly what fence.acq_rel is for. It's a fence.release and fence.acquire rolled into one — both effects fused at the same point.

fence.acq_rel;     // both: prior writes published AND subsequent reads refreshed

Both directions of the "one-way wall" apply at once: prior memory ops cannot sink below it (release side), and later memory ops cannot hoist above it (acquire side).

   memory op above                    ← cannot sink below the fence (release side)
   memory op above                    ← cannot sink below the fence
   ─── fence.acq_rel ───
   memory op below                    ← cannot hoist above the fence (acquire side)
   memory op below                    ← cannot hoist above the fence

Where this actually shows up: atomic operations

The clearest case for acq_rel semantics is atomic operations that read AND write at the same time — like atom.cas (compare-and-swap), atom.exch (exchange), atom.add (fetch-and-add). These instructions can be simultaneously consumer-like (they read an old value) and producer-like (they write a new value).

Take a shared work queue as an example. Producers fill slots in a queue, then publish progress by advancing a shared index. Consumers atomically claim index values and then read the corresponding slots. In a queue like that, the index is not just a counter — it's also the handoff point between "someone published work" and "someone else is allowed to consume it."

Now imagine one thread advances the index with an atomic fetch-and-add:

// Atomic fetch-and-add: returns the old value
old = atom.add.acq_rel [queue_index], 1;
slot = old;

Why might this atomic want acq_rel semantics? Because the same operation may be doing both jobs:

Acquire half (the read). The old counter value may point to a slot whose contents were written by another thread before that thread published the index. Before this thread reads queue[slot], it needs to acquire those slot writes.
Release half (the write). The new counter value may become the value that a later thread observes before reading work or metadata this thread prepared. If this thread did setup before advancing the index, the write-half of the atomic can publish that setup.

That is the kind of situation where atom.add.acq_rel makes sense: the same read-modify-write is consuming someone else's publication and publishing this thread's own update. PTX bakes both effects into the atomic with the .acq_rel qualifier, for example atom.acq_rel.gpu.global.add.u32 in full PTX syntax.

For comparison, an ordinary lock usually does not need acq_rel on lock acquire. Taking a lock is normally just an acquire operation: it consumes the previous holder's release. Releasing the lock is normally just a release store:

st.release [lock], 0;

So the common lock pattern is acquire on lock acquisition and release on unlock. Reach for acq_rel when a single atomic really does both jobs for your algorithm.

Standalone `fence.acq_rel` is rarer

A standalone fence.acq_rel (not attached to an atomic) shows up less often, but it's there if you have a non-atomic point in your code that needs both effects. Most of the time it appears inside higher-level synchronization primitives (atomics, mbarriers — which we'll see in later articles) rather than as a programmer-written explicit fence.

One bit of PTX trivia: fence.acq_rel is the default when you write a plain fence without a .sem qualifier. So fence.gpu is shorthand for fence.acq_rel.gpu. This is partly why acq_rel shows up so often in PTX disassembly.

For typical one-direction producer-consumer patterns, stick with the matched release + acquire pair we built up in the earlier sections — it's simpler and avoids asking for stronger ordering than you need. Reach for acq_rel when one fence point (or one atomic operation) really does need to do both jobs at once, like the work-queue atomic above.

Things to remember

A handful of practical points worth keeping in your head as you write code.

The producer's flag write must be a "real" memory write

In PTX, this means using something like st.relaxed [flag], 1 rather than the default st [flag], 1. The default is what's called a "weak" write — the memory model gives it no cross-thread guarantees, and the release pattern won't form correctly with it. For any flag another thread will read, use a strong store such as st.relaxed or st.release (the baked-in form discussed above). (.acquire and .acq_rel do not apply to ordinary stores.)

The same applies to the consumer's flag read: use ld.relaxed [flag] or ld.acquire [flag], not plain ld [flag]. (.release only applies to stores, not loads.)

Spin loops are normal — fences don't push data

A common confusion is "I issued the release, why doesn't the consumer see it immediately?" The release fence orders your writes; it doesn't shove them down other threads' throats. The consumer's flag read may still return 0 for a while after the producer wrote 1, because the new value has to propagate through the memory system.

That's why the consumer is in a while (flag == 0) loop — it's bridging the propagation gap. This is the standard, correct idiom.

Memory ordering is about visibility, not synchronization

A second and related confusion: people see "fence" and assume it makes threads pause or wait for each other. It doesn't. Fences and release/acquire qualifiers are entirely about what one thread sees in memory when it reads — not about pausing or aligning threads in time.

Two different concerns, two different toolboxes:

Visibility (this article): one thread's writes become observable to another thread's reads, in a well-defined order. Fences and release/acquire qualifiers handle this. A fence may wait until the calling thread's relevant memory operations have reached the point of coherence for its scope, but it doesn't wait for another thread to arrive, acknowledge, or read anything.
Synchronization (a separate concern): threads actually pause and align in time — e.g., "no thread proceeds past this point until all threads have arrived." That's the job of barriers like bar.sync, barrier.cluster, and the mbarrier object. We'll cover those in the next article.

The spin loop in our consumer code is what bridges the two when you want them together: the consumer wants to wait for the producer, so it busy-loops on the flag. That waiting is the programmer's choice (the while loop), not something the fence is doing for them.

Keeping these two ideas separate is one of the most useful mental moves you can make in concurrent GPU programming. The memory model is purely about visibility; for actual "wait until X" behavior, reach for the synchronization primitives in the next article.

Fences have a "scope"

In PTX you'll write things like fence.release.gpu or fence.release.cta. The scope says which threads can directly participate in the ordering guarantee.

To understand why scope matters, picture the GPU's memory as a layered hierarchy, with caches and storage at different distances from each thread:

   Per-SM:    L1 / shared-memory crossbar       (closest, fastest)
   Cluster:   DSMEM crossbar (Hopper+)
   Chip:      L2 cache                          (chip-wide)
   System:    HBM and host-memory fabric        (farthest, slowest)

For two threads to communicate via a release/acquire pair, the operations must use a scope that includes both threads. A useful hardware mental model is that wider scopes generally have to order through a farther-away coherence point, so they cost more. The exact cache behavior is an implementation detail; the architectural rule is the set of threads covered by the scope.

The scopes are:

.cta — between threads in the same thread block. This is the smallest PTX memory-model scope and is usually the cheapest.
.cluster — between threads in the same thread block cluster (Hopper's new feature). This is useful for cluster-level shared memory and cluster barriers.
.gpu — between any threads in the current program on the same GPU. In the usual global-memory mental model, this means ordering at a chip-wide level such as L2, so it is more expensive than .cta.
.sys — across the whole program, including host threads and kernels on other GPUs. This may involve the system fabric and is the most expensive scope.

Rough hardware mental model:

   .cta     ──►  usually local CTA-visible path
   .cluster ──►  cluster-level shared-memory network
   .gpu     ──►  chip-wide ordering point, often L2
   .sys     ──►  system-visible fabric

           ──── generally increasing cost ────►

Pick the smallest scope that covers your producer-consumer pair. A fence.release.cta is much cheaper than a fence.release.gpu, which is much cheaper than a fence.release.sys. There's no reason to pay for a wider scope than your actual readers need. If both threads are guaranteed to be in the same CTA, use .cta. If they could be on any SM, you need .gpu.

We'll see scopes again in later articles when we discuss thread block clusters and the TMA engine.

Single-threaded code never needs fences

If only one thread is touching some piece of memory, you don't need any of this. The reordering issues only matter when there's a second thread observing. Inside one thread, you get this guarantee for free:

Whenever you read a memory location, you see the most recent value your own thread wrote to that same location — in source-code order, no fences needed.

The important nuance is the phrase "that same location." The compiler and hardware are still free to reorder operations to different memory addresses, even within a single thread. That reorder is invisible to you, because nothing inside your thread depends on the order. For example:

data[0] = 42;        // (1)
data[1] = 99;        // (2)
x = data[0];         // (3) guaranteed to see 42 — same address as (1)

Here, (1) and (2) write to different addresses, so the compiler may commit them to memory in either order. But (3) reads from the same address as (1), so it's guaranteed to see 42 — the per-address ordering rule pins this down. From your thread's perspective, the reorder of (1) and (2) is invisible because nothing in your code reads data[1] afterward to detect it.

This is exactly the source of the bug at the start of the article: the producer's data write and flag write are to different addresses, so they can be reordered freely from the producer's own perspective. It only becomes a problem because another thread (the consumer) reads both addresses and notices the reorder. The fences fix it for that exact case.

So the rule is: fences are needed only at the precise places where two threads communicate, and even then only because they need to coordinate their views of different memory addresses. Within one thread, same-address reads work without fences.

Mental model summary

If you remember nothing else from this article, remember this:

Memory ordering on a GPU is a contract between a producer and a consumer.

The producer uses a release fence to publish all its prior writes before announcing "done."

The consumer uses an acquire fence to ensure all its subsequent reads see fresh data after observing the "done" signal.

Both halves of the contract are needed. Each fence is a one-way wall — release pins prior writes above it; acquire pins subsequent reads below it. Together they make a producer-consumer pattern correct on a system where independent operations can otherwise be reordered freely.

What's next in this series

The memory consistency model is the foundation. With it in hand, the rest of the Hopper stack starts to make sense. Coming up:

Execution synchronization — bar.sync, barrier.cluster, and how to align threads in time (not just memory).
mbarrier — a programmable synchronization object that combines memory ordering, thread arrival counting, and async-engine completion tracking. The Hopper workhorse.
Asynchronous copies — the TMA engine, cp.async.bulk, and the powerful but subtle CUtensorMap descriptor.
wgmma — Hopper's warp-group matrix multiply-accumulate, the engine that drives modern GEMM and attention kernels.

Each will build on the same release-acquire ideas you just learned. The TMA engine, mbarriers, and warp-group MMA all rely on exactly the same kind of "one side publishes, the other side acquires" contract — just with more sophisticated machinery to support things like async byte-counting and distributed shared memory.

CUDA Graphs in LLM Inference: Deep Dive

Shah Fahad — Sat, 21 Feb 2026 07:09:21 +0000

Why CUDA Graphs Matter for LLM Inference

LLM inference -- especially the token generation (decode) phase -- is often dominated by CPU overhead rather than GPU compute. Each decode step generates a single token per sequence: the actual GPU work (small matmuls, attention over one query) can finish in microseconds, but the CPU can spend tens of microseconds per kernel launch on launch bookkeeping, driver calls, and synchronization. With hundreds of kernel launches per transformer forward pass, this CPU overhead can become the bottleneck (though at higher batch sizes or with heavier kernels, decode can still become GPU-bound).

Making matters worse, the CPU isn't just launching kernels -- it's also preparing data for the next batch: updating token IDs, managing the KV cache block table, running the scheduler, and handling request arrivals/completions. All of this competes for CPU time with kernel launches, amplifying the bottleneck. The GPU ends up sitting idle between launches, throughput drops, latency rises, and expensive GPU cycles are wasted on nothing.

CUDA graphs solve this by recording the entire kernel sequence once and replaying it with a single CPU call. The driver overhead is paid once at capture time; every subsequent replay amortizes hundreds of per-kernel launches into a single replay launch, largely avoiding the repeated per-kernel launch bookkeeping. For decode-heavy workloads, this can eliminate the majority of per-step overhead.

This post walks through how CUDA graphs work in the context of LLM serving -- why decode is a natural fit, why context/mixed batches are harder, and how TensorRT-LLM (TRT-LLM) implements both monolithic and piecewise CUDA graph strategies.

1. CUDA Graphs Fundamentals
2. Generation (Decode) CUDA Graphs
3. KV Cache with Static Addresses
4. Why Context & Mixed Batches Are Hard
5. Piecewise CUDA Graphs (torch.compile)
6. Configuration Guide

1. CUDA Graphs Fundamentals

A CUDA graph captures a sequence of GPU operations (kernel launches, memory copies) into a single replayable unit.

What Gets Captured (Fixed)

+--------------------------------------------------------------------+
| CUDA Graph Recording                                               |
|                                                                    |
| +----------+      +----------+      +----------+      +----------+ |
| | Kernel A |      | Kernel B |      | Kernel C |      | Kernel D | |
| |grid(4,1) |----->|grid(8,1) |----->|grid(4,1) |----->|grid(2,1) | |
| |@0x100 -> |      |@0x200 -> |      |@0x300 -> |      |@0x400 -> | |
| |  0x200   |      |  0x300   |      |  0x400   |      |  0x500   | |
| +----------+      +----------+      +----------+      +----------+ |
+--------------------------------------------------------------------+

Baked into the graph:

Which kernels to launch, in what order
Memory addresses (pointers) each kernel reads/writes
Kernel launch parameters (grid dims, block dims, shared memory)

NOT baked (can change between replays):

The actual data at those addresses
Data-dependent control flow inside kernels (loops, branches)

Replay Contract

On replay, the entire sequence launches with minimal CPU overhead. The user's responsibility is to place correct data at the captured addresses before each replay.

Why It's Fast

+----------------------------+
| Without CUDA Graph (eager) |
|                            |
| CPU -- launch --> Kernel A |
| CPU <-- wait ----+         |
| CPU -- launch --> Kernel B |
| CPU <-- wait ----+         |
| CPU -- launch --> Kernel C |
| CPU <-- wait ----+         |
| CPU -- launch --> Kernel D |
|                            |
| = 4x CPU round-trips       |
+----------------------------+

+------------------------------------------+
| With CUDA Graph                          |
|                                          |
| CPU -- replay --> [ Kernel A, B, C, D ]  |
|                                          |
| = 1 launch, entire chain executes on GPU |
+------------------------------------------+

2. Generation (Decode) CUDA Graphs

Why Decode Is Well-Suited

In decode, each sequence contributes exactly 1 new token per step. Total tokens = batch size. This makes the input shape predictable.

+---------------------------------------------------------------+
| Decode step N                                                 |
|                                                               |
| seq0: 1 token  \                                              |
| seq1: 1 token   \                                             |
|                   >-- batch_size = 4, shape = [4, hidden_dim] |
| seq2: 1 token   /                                             |
| seq3: 1 token  /                                              |
+---------------------------------------------------------------+

Pre-allocated Static Buffers

+-----------------------------------------------------------------+
| Input token IDs buffer (pre-allocated, max_batch_size = 4096)   |
|                                                                 |
| [ token_0 ][ token_1 ][ token_2 ][ token_3 ] ... [ token_4095 ] |
|   @addr_0    @addr_1    @addr_2    @addr_3          @addr_4095  |
|                                                                 |
|   fixed addresses -- same every replay                          |
+-----------------------------------------------------------------+

Multiple Graphs for Different Batch Sizes

Captured graphs (one per supported batch size, typically powers of two):

  batch_size   grid size     reads
  ----------   ---------     -----
       1  -->  (1, ...)  --> addr_0
       2  -->  (2, ...)  --> addr_0..1
       4  -->  (4, ...)  --> addr_0..3
       8  -->  (8, ...)  --> addr_0..7
       :
    4096  -->  (4096,..) --> addr_0..4095

At runtime with 5 active sequences → use batch_size=8 graph, pad 3 dummy sequences.

Intermediate Activations Have Stable Addresses

During capture, intermediate tensors are allocated from a graph-private memory pool, giving them stable device addresses:

+----------------------------------------------------------+
| Transformer layer (captured; all addresses fixed)        |
|                                                          |
| [QKV Projection] ----> [Attention] ----> [Output Proj]   |
|  in @A, out @B          in @B, out @C    in @C, out @D   |
|                                               |          |
|                                               v          |
| [FFN Layer 1] --------> [FFN Layer 2] ----> (next layer) |
|  in @D, out @E           in @E, out @F                   |
+----------------------------------------------------------+

On replay, the same chain executes at the same addresses. Intermediate buffers are never freed between replays -- they persist in the graph's memory pool. This is why each captured batch size has its own set of stable-address buffers, and capturing many batch sizes consumes significant GPU memory.

What the Runtime Updates Before Each Replay

+-----------------------------------------------------+
| 1. input_token_ids[0:B]  <-- new token IDs          |
| 2. position_ids[0:B]     <-- new positions          |
| 3. sequence_lengths[0:B] += 1                       |
| 4. block_table           <-- update if new KV block |
+-----------------------------------------------------+
| 5. >>> REPLAY GRAPH <<<                             |
+-----------------------------------------------------+
| 6. new_logits <-- output_buffer[0:B]                |
+-----------------------------------------------------+
| B = batch_size                                      |
+-----------------------------------------------------+

3. KV Cache with Static Addresses

The Apparent Contradiction

KV cache grows every step (new K,V written for each token), yet CUDA graphs require fixed addresses. The solution: paged/block-based KV cache with an indirection table.

Block-Based KV Cache Pool

+-------------------------------------------------------------+
| KV cache pool (pre-allocated; addresses never change)       |
|                                                             |
| [ Block 0 ][ Block 1 ][ Block 2 ][ Block 3 ][ Block 4 ] ... |
|   @blk_0     @blk_1     @blk_2     @blk_3     @blk_4        |
|  32 slots   32 slots   32 slots   32 slots   32 slots       |
|                                                             |
| each block holds K,V for a fixed number of tokens (e.g. 32) |
+-------------------------------------------------------------+

Block Table (Indirection)

Each sequence has a block table mapping logical positions to physical blocks:

Logical positions	Physical block
tokens 0–31	Block 7
tokens 32–63	Block 12
tokens 64–95	Block 3 (partially filled, e.g. up to 82)

Sequence 0's block table at fixed address @tbl_0

How Attention Kernel Uses Indirection

# Inside the attention kernel (pseudo-code):
for each past token position i in range(sequence_length[seq_id]):
    block_idx = block_table[seq_id][i / block_size]    # read from @tbl_0
    offset    = i % block_size
    K_i = kv_cache_pool[block_idx][offset]              # indirect lookup into pool
    V_i = kv_cache_pool[block_idx][offset]
    score += dot(Q, K_i)

Step-by-Step: How KV Cache Grows Within CUDA Graph

Buffer	Step N	Step N+1	Notes
`block_table` @tbl_0	`[7, 12, 3]`	`[7, 12, 3]`	Same address, same indices
`seq_length` @len_0	`82`	`83`	Same address, incremented
kv_pool Block 3, slot 18	K,V for token 82	K,V for token 82	Unchanged
kv_pool Block 3, slot 19	(empty)	K,V for token 83	NEW — written by kernel

The kernel wrote to a different slot because sequence_length told it to. All addresses remain fixed -- only the data changes.

Why This Doesn't Violate CUDA Graph Rules

What's fixed (baked in graph)	What changes (data at fixed addrs)
`kv_cache_pool` base address	Which blocks are assigned (block_table data)
`block_table` buffer address	The integer block indices
`sequence_length` buffer address	The actual length values
Kernel grid dimensions	Data-dependent loops inside kernel iterate more/fewer times

4. Why Context & Mixed Batches Are Hard

The Core Problem: Variable Total Token Count

In decode, total tokens = batch size (each sequence = 1 token). In context/mixed, total tokens varies wildly:

Batch type	Sequences	Total tokens	Predictable?
Decode	`seq₀(1) + seq₁(1) + seq₂(1)`	3	Yes — always = batch_size
Context	`seq₀(137) + seq₁(2048)`	2185	No
Mixed	`seq₀(512 prefill) + seq₁(1 decode)`	513	No

Problem 1: Kernel Grid Dimensions Depend on Total Tokens

// Kernel launch -- grid dims are a function of input shape
dim3 grid((total_tokens + TILE_M - 1) / TILE_M, (hidden_dim + TILE_N - 1) / TILE_N);
matmul_kernel<<<grid, block>>>(input, weight, output, total_tokens, hidden_dim);

total_tokens	grid size	Implication
512	`(4, …)`	4 blocks — one graph
3072	`(24, …)`	24 blocks — different graph required

The grid is baked at capture time. Different total tokens = different grid = different graph.

Problem 2: Attention Grid Depends on Max Context Seq Length and Num Context Requests

For MLP, every token is independent: output[i] = MLP(input[i]). Fix total_tokens and you're done.

For attention, the kernel grid depends on two per-iteration variables:

+--------------------------------------------------------------+
| TRT-LLM attention grid (simplified call chain)               |
|                                                              |
| Python (trtllm.py)                                           |
|   max_ctx_seq_len = seq_lens[:num_contexts].max()            |
|                             |                                |
|                             v                                |
| C++ (fmhaRunner / fused_multihead_attention_v2)              |
|   |                   |                   |                  |
|   v                   v                   v                  |
|   grid.x              grid.y              grid.z             |
|   ceil(s/unroll)      num_heads           num_ctx_requests   |
|   [VARIES]            [FIXED]             [VARIES]           |
|                                                              |
|   --> grid = ( ceil(s/unroll), num_heads, num_ctx_requests ) |
+--------------------------------------------------------------+

Grid = (ceil(max_ctx_seq_len / unroll_step), num_heads, num_context_requests)

TRT-LLM uses a padded tiling strategy: the grid is sized for the longest context request, and shorter requests have their extra tiles skip computation (the kernel checks cu_seqlens internally):

Padded tiling: 3 context requests, seq_lens = [64, 128, 256], unroll_step = 64.
Grid = (4, num_heads, 3) — sized for longest request (256).

	Tile 0	Tile 1	Tile 2	Tile 3
Req 0 (64 tokens)	compute	skip	skip	skip
Req 1 (128 tokens)	compute	compute	skip	skip
Req 2 (256 tokens)	compute	compute	compute	compute

Even with this padded approach, the grid changes per iteration because both max_ctx_seq_len and num_context_requests change depending on which requests the scheduler assigns to the context phase:

Iteration	Context requests	max_len	grid	What changed
1	32	128	`(2, heads, 32)`	—
2	1	128	`(2, heads, 1)`	grid.z
3	2	256	`(4, heads, 2)`	grid.x and z

Different iterations produce different grids/launch parameters — the combination space explodes across multiple variables (e.g., max_ctx_seq_len, num_context_requests, and sequence-length distributions), making “one reusable CUDA graph” impractical.

A CUDA graph captured with one grid would produce incorrect results if replayed with a different grid/launch configuration (missing tiles = unprocessed tokens; extra tiles = out-of-bounds/garbage work). To make this safe, you’d need to capture graphs for many combinations or pad/standardize to a fixed worst-case launch shape.

Why Decode Attention Doesn't Have This Problem

In decode, every sequence has exactly 1 query token. The decode attention uses a different kernel path where:

Decode attention: grid = (batch_size, num_heads) — both fixed per captured graph.

batch_size is fixed per captured graph (one graph per supported batch size)
Variable KV cache lengths are handled by data-dependent loops inside the kernel (loop over sequence_length[i]) -- the grid doesn't change

Where Each Layer Type Falls

Layer	Shape	Capturable?
Layer norm	`[total_tokens, hidden]` — flat	Yes
Q, K, V projections	`[total_tokens, hidden]` — flat matmuls	Yes
Fused attention (Q@K^T, softmax, scores@V)	per-sequence, variable tiles	No — grid varies
Output projection	`[total_tokens, hidden]` — flat matmul	Yes
MLP	`[total_tokens, hidden]` — flat matmuls	Yes

5. Piecewise CUDA Graphs (torch.compile)

Two Separate CUDA Graph Systems

TRT-LLM uses two independent CUDA graph systems -- understanding this distinction is critical:

                  Python model forward()
                          |
            +-------------+-------------+
            |                           |
            v                           v
+-------------------------+ +-------------------------+
| torch.compile           | | Native CUDA Graph       |
| (Dynamo tracing)        | | (stream capture)        |
+-------------------------+ +-------------------------+
| Traces Python -> FX     | | Records GPU kernels     |
| Decomposes to ATen ops  | | on the CUDA stream      |
| Custom ops -> split pt  | | Captures everything     |
+-------------------------+ +-------------------------+
| Result: Pieces          | | Result: One monolithic  |
| [graph][eager][graph]...| | graph of full fwd pass  |
+-------------------------+ +-------------------------+
            |                           |
            v                           v
  Used for: mixed/context    Used for: decode-only
  (attn grid varies)         (attn grid fixed)

Generation-only (decode): Uses native torch.cuda.CUDAGraph capture. This records every kernel launch on the CUDA stream at the driver level -- including FlashAttention. It doesn't need to "understand" the kernels; it just records them. This works because decode attention's grid depends only on batch_size (fixed per capture).

Piecewise (mixed/context): Uses torch.compile to trace the model into an FX graph, then TRT-LLM's custom backend splits at attention boundaries and captures each non-attention piece as a CUDA graph. Attention runs eagerly.

The Piecewise Architecture

+--------------------------------------------------------+
| CUDA GRAPH -- piece 1                     [captured]   |
|   layer_norm -> qkv_projection                         |
|   pre-allocates output buffer @ addr_X                 |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
+--------------------------------------------------------+
| EAGER -- not graphed                 [runs every time] |
|   flash_attention(q, k, v, cu_seqlens, ...)            |
|   writes result IN-PLACE to addr_X                     |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
+--------------------------------------------------------+
| CUDA GRAPH -- piece 2                     [captured]   |
|   reads from addr_X                                    |
|   output_proj -> layer_norm -> mlp_up ->               |
|   activation -> mlp_down -> residual_add               |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
|                 ... next layer ...                     |
+--------------------------------------------------------+

The in-place attention design is critical: attention writes into a buffer pre-allocated by piece 1, ensuring piece 2's captured graph reads from the correct fixed address.

Why Attention Is Excluded

Attention is excluded from CUDA graph capture for a correctness reason, not a tracing limitation.

The tracing works fine. TRT-LLM registers a FakeTensor implementation for the attention custom op, so torch.compile in fullgraph mode traces the entire forward pass into one FX graph without graph breaks.

The exclusion is a deliberate choice. TRT-LLM's piecewise_optimizer.py explicitly identifies attention ops and excludes them from CUDA graph pieces:

# tensorrt_llm/_torch/compilation/piecewise_optimizer.py
if is_call_function(node, [
        torch.ops.trtllm.attn_custom_op_inplace.default,
        torch.ops.trtllm.mla_custom_op_inplace.default,
]):
    exclude_modules_id.append(idx)  # ← excluded from CUDA graph capture

The reason: replay correctness. If attention were captured in a CUDA graph, the kernel's grid dimensions would be baked in. But attention's grid depends on the per-sequence query distribution, not just total tokens:

Kernel source	grid.x	grid.y	grid.z
`fused_multihead_attention_v2.cpp`	`ceil(params.s / mUnrollStep)` — varies	`params.h` (heads) — fixed	`params.b` (batch) — varies
`triton_attention.py`	`num_prefill` — varies	`n_heads` — fixed	`ceil(max(seq_len) / SEQ_BLOCK)` — varies
`unfusedAttentionKernels.cu`	`ceil(q_length / 32.0f)` — varies

For the same total_tokens=4096, different sequence distributions can produce different grids/launch metadata. A captured graph replays the capture-time launch configuration; unless you pad/standardize to that same configuration, replaying on a different distribution would be incorrect. MLP doesn't have this problem because its grid depends primarily on total_tokens.

What `capture_num_tokens` Controls

Pre-captures piecewise graphs for specific total token counts. At runtime, pads up to the next captured value.

capture_num_tokens: [1, 2, 4, 8, ..., 8192]

Runtime: 4160 total tokens → pad up to the next captured value (e.g., 5120)
  - Waste: (5120 - 4160) / 5120 = 18.7% extra compute
  - Benefit: CUDA graph replay for MLP pieces (zero launch overhead)

Graph Type Summary

Graph Type	Capture Mechanism	What It Captures	When Used	Key Parameter
Generation-only	Native `torch.cuda.CUDAGraph`	Full forward pass (including attention)	Pure decode iterations	`cuda_graph_config.batch_sizes` or `max_batch_size`
Piecewise	torch.compile + native capture per piece	All non-attention ops (attention runs eager)	Mixed/context iterations	`torch_compile_config.capture_num_tokens`

Memory vs. Coverage Trade-off

Each piecewise capture at token count N pre-allocates intermediate buffers of size [N, hidden_dim] per piece per layer. Capturing at large N (e.g., 8192) can consume enough GPU memory to shrink KV cache capacity below usable levels. In some setups, pushing capture_num_tokens too high (e.g., up to 8192) with aggressive kv_cache_free_gpu_mem_fraction can shrink the KV cache max length enough to cause warmup failures.

6. Configuration Guide

TensorRT-LLM `llm_api_options_yaml` Settings

# Generation-only CUDA graphs (decode phase)
cuda_graph_config:
  enable_padding: true
  max_batch_size: 4096    # or explicit batch_sizes list

# Piecewise CUDA graphs (context/mixed phases, requires torch.compile)
torch_compile_config:
  enable_piecewise_cuda_graph: true
  capture_num_tokens: [1, 2, 4, ...]   # Must cover runtime max_num_tokens!
  enable_userbuffers: false             # Default is true; disable if needed

Key Principles for `capture_num_tokens`

Must cover max_num_tokens: If the runtime scheduler can produce up to N total tokens, the largest capture point must be >= N. Otherwise, iterations exceeding the max fall back to eager.
Dense where iterations cluster: Use iteration logs to find the hot zone. Pack capture points there to minimize padding waste.
Sparse where few iterations land: Ramp-up and transition regions need minimal captures (powers of 2 suffice).
Fewer captures = less memory: Each capture pre-allocates intermediate buffers sized [capture_tokens, hidden_dim] per piece. On memory-constrained systems, fewer large captures may be preferable.

TorchCompileConfig Defaults (TensorRT-LLM)

Field	Default	Notes
`torch_compile_config`	`None`	Torch compile completely off unless explicitly set
`enable_piecewise_cuda_graph`	`False`	Must opt-in
`capture_num_tokens`	`None` (auto: max 3072)	Auto-generated: `[1,2,4,...,128,256,512,...,3072]`
`enable_userbuffers`	`True`	Enabled by default when torch compile is on
`enable_fullgraph`	`True`	Full graph compilation in torch.compile
`enable_inductor`	`False`	Inductor backend disabled by default

Checking Coverage at Runtime

Parse the iteration log and compute:

total_tokens_per_iter = numCtxTokens + numGenRequests

For each iteration:
  - If numCtxTokens == 0: uses generation-only CUDA graph (match on numGenRequests)
  - If numCtxTokens > 0:  uses piecewise CUDA graph (match on total_tokens)

Hit rate = iterations with total_tokens <= max(capture_num_tokens) / total iterations

Target: >95% hit rate on piecewise graphs for meaningful benefit.