<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mayank Ketkar</title>
    <description>The latest articles on Forem by Mayank Ketkar (@mketkar).</description>
    <link>https://forem.com/mketkar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3760623%2F9be3928f-38db-49e7-9b2c-c79c3ef5cd70.png</url>
      <title>Forem: Mayank Ketkar</title>
      <link>https://forem.com/mketkar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mketkar"/>
    <language>en</language>
    <item>
      <title>One Week in Ray: 21 Bugs Between Us and a Production ML Pipeline</title>
      <dc:creator>Mayank Ketkar</dc:creator>
      <pubDate>Sun, 15 Feb 2026 17:48:09 +0000</pubDate>
      <link>https://forem.com/mketkar/one-week-in-ray-21-bugs-between-us-and-a-production-ml-pipeline-1g5c</link>
      <guid>https://forem.com/mketkar/one-week-in-ray-21-bugs-between-us-and-a-production-ml-pipeline-1g5c</guid>
      <description>&lt;p&gt;The pipeline finished. All four checkpoints evaluated. Metrics written to cloud storage. Zero errors in the logs. And 68% of the results were empty strings.&lt;/p&gt;

&lt;p&gt;Not errors. Not exceptions. Not null values with a helpful stack trace. Empty strings. The pipeline had done everything it was supposed to do -- read images, distribute them to GPU workers, run inference, collect results, write output -- and two-thirds of the results were blank. The metrics dashboard showed accuracy numbers. They were plausible. They were also computed on one-third of the actual data.&lt;/p&gt;

&lt;p&gt;We had built a system that was confidently wrong.&lt;/p&gt;

&lt;p&gt;This is the story of how we found and fixed 21 bugs in 9 days, turning a silently broken GPU inference pipeline into one that survives actor crashes, CUDA deadlocks, and stalled workers with zero data loss. The pipeline evaluates driving scenes through a fine-tuned vision-language model -- classifying road conditions and predicting vehicle waypoints. Safety-critical decisions. Full dataset coverage is not optional.&lt;/p&gt;

&lt;p&gt;The tools were fine. Ray Data, vLLM, Kubernetes, experiment tracking -- none of them were broken. Every single one of the 21 bugs was in how they were integrated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Act I: "The Pipeline Works!"
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The False Sense of Security
&lt;/h3&gt;

&lt;p&gt;The pipeline reported success because it was configured to tolerate failure. Somewhere in the early development, someone had set &lt;code&gt;max_errored_blocks&lt;/code&gt; to a generous value. The reasoning was sound: if a handful of samples fail due to transient cloud storage errors, don't crash the entire 8,600-sample evaluation. Keep going. Report what you have.&lt;/p&gt;

&lt;p&gt;The problem is that "keep going" and "silently discard 68% of your data" are the same instruction when the errors are systemic rather than transient.&lt;/p&gt;

&lt;p&gt;The first clue was a boolean argument that wasn't being parsed correctly. The evaluation config had a flag -- something like &lt;code&gt;--use_detailed_metrics&lt;/code&gt; -- that was being read as a string. In Python, &lt;code&gt;bool("False")&lt;/code&gt; is &lt;code&gt;True&lt;/code&gt;. Every string except the empty string is truthy. So the flag was always on, regardless of what you passed. This wasn't causing the 68% empty results directly, but it was a symptom of a larger disease: nobody was validating that configurations were doing what they claimed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The bug: bool("False") == True
&lt;/span&gt;&lt;span class="n"&gt;use_detailed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use_detailed_metrics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Always True
&lt;/span&gt;
&lt;span class="c1"&gt;# The fix: explicit parsing
&lt;/span&gt;&lt;span class="n"&gt;use_detailed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use_detailed_metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;TypeError&lt;/code&gt; guard in the inference wrapper was catching exceptions from malformed inputs and returning empty dictionaries. Silently. No log line. No counter. Just an empty result that looked indistinguishable from a successful inference that happened to produce no output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The silent swallower
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;  &lt;span class="c1"&gt;# Looks fine. Is not fine.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the student-turning-in-blank-pages problem. The teacher collects 30 papers, sees 30 papers in the stack, and assumes everyone answered the questions. It isn't until you actually read them that you realize 20 are blank. Our pipeline was counting papers, not reading them.&lt;/p&gt;

&lt;p&gt;The core insight: &lt;strong&gt;error tolerance without error visibility is just data loss with extra steps.&lt;/strong&gt; The pipeline was designed to be resilient. It was resilient to the wrong things. It tolerated failures that should have been hard crashes, and it reported success when it should have reported catastrophe.&lt;/p&gt;




&lt;h2&gt;
  
  
  Act II: "The Machine Runs Out of Room"
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Three Places Data Piles Up
&lt;/h3&gt;

&lt;p&gt;In a Ray pipeline, data accumulates in three places: the driver node (the coordinator), the workers (the GPU actors), and the object store (shared memory between them). Understanding which one is drowning is the difference between a targeted fix and a week of thrashing.&lt;/p&gt;

&lt;p&gt;Our pipeline had the architecture equivalent of an hourglass. All data flowed through a single point -- the driver -- before fanning out to workers. Here's what that looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cloud Storage (8,600 samples x 1-2 MB = ~17 GB)
         │
         ▼
  ┌──────────────┐
  │  Ray Driver   │ ← deserializes ALL images here
  │  (1 node)     │ ← holds ~17 GB in Python heap
  └──────┬───────┘
         │ ray.data.from_items() serializes to object store
    ┌────┴────┬────────┐
    ▼         ▼        ▼
 ┌──────┐ ┌──────┐ ┌──────┐
 │ GPU  │ │ GPU  │ │ GPU  │   6 actors
 │ W-1  │ │ W-2  │ │ W-3  │   (VLM inference)
 └──┬───┘ └──┬───┘ └──┬───┘
    │         │        │
    └────┬────┴────────┘
         ▼
  ┌──────────────┐
  │  Results      │ ← take_all() pulls everything back to driver
  └──────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The driver was doing all the heavy lifting: reading every sample from cloud storage, decoding every base64 image into memory, building Python dictionaries with ~2 MB of image data each, then handing the entire collection to &lt;code&gt;ray.data.from_items()&lt;/code&gt;. That call serializes everything into the object store. For 8,600 samples at 2 MB each, that's roughly 17 GB of Python objects materialized on a single machine before any GPU touches a single pixel.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Before: Driver-Side Materialization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dataset_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_samples&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;mds_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StreamingDataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mds_dataset&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mds_dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;prompt_blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_images_to_base64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_blocks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt_blocks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ~2 MB per sample
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="c1"&gt;# All 17 GB lives in driver memory at this point
&lt;/span&gt;    &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ray&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_items&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 8,600 x 2 MB = OOM
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under load, the driver ran out of memory. Python's garbage collector fought with Ray's object store for the same physical RAM. Workers received empty batches -- not because they failed, but because the objects they were trying to read had been evicted from the object store to make room for newer ones. No error. No exception. Just empty data.&lt;/p&gt;

&lt;h3&gt;
  
  
  The After: Worker-Side Shard Reading
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dataset_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_samples&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ray&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_datasource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;MDSShardedDatasource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_samples&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;parallelism&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MDSShardedDatasource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Datasource&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Each worker reads its own shard directly from cloud storage.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_read_tasks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parallelism&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Driver only reads index.json (~1 KB)
&lt;/span&gt;        &lt;span class="n"&gt;shard_paths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_compute_shard_assignments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parallelism&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;shard&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;shard_paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="nc"&gt;ReadTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;read_fn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;shard&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_read_mds_shard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;BlockMetadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;shard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;size_bytes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;shard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;estimated_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The driver now reads a single index file -- roughly 1 KB -- and tells each of 64 workers which shard to pull from cloud storage. Each worker deserializes only its own portion. Memory usage on the driver drops from 17 GB to effectively nothing.&lt;/p&gt;

&lt;p&gt;The same antipattern existed in reverse for output. &lt;code&gt;results.materialize()&lt;/code&gt; followed by &lt;code&gt;take_all()&lt;/code&gt; pulled every result back to the driver before writing to disk. We replaced it with &lt;code&gt;write_parquet()&lt;/code&gt; which streams results directly from workers to cloud storage. The driver never touches the result data.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Object Store Spill Cascade
&lt;/h3&gt;

&lt;p&gt;Even after fixing the driver bottleneck, we hit a second memory wall when running multiple evaluation pipelines concurrently. Four pipelines, each preprocessing 8,600 samples with PIL Images and PyTorch tensors (not Arrow-safe, so Ray falls back to pickle serialization -- full copy per consumer):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;4 pipelines x 8,600 samples x ~25 MB preprocessed = ~860 GB object store pressure&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Our node had 154 GB of object store (30% of 512 GB RAM). The cascade:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Object store hits 100%&lt;/li&gt;
&lt;li&gt;Ray spills to &lt;code&gt;/tmp&lt;/code&gt; (10-50x slower)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/tmp&lt;/code&gt; fills up&lt;/li&gt;
&lt;li&gt;Ray evicts unconsumed objects&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_errored_blocks&lt;/code&gt; absorbs the errors&lt;/li&gt;
&lt;li&gt;Pipeline "succeeds" with missing data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The fix was architectural: preprocess once, fan out to N inference pools via &lt;code&gt;ray.put()&lt;/code&gt;. One copy in the object store, N readers. Object store usage dropped from hundreds of GB to single-digit GB.&lt;/p&gt;




&lt;h2&gt;
  
  
  Act III: "Everything Fails All the Time" -- The Eight-Fix Monday
&lt;/h2&gt;

&lt;p&gt;This was the centerpiece. Eight fixes shipped in a single day, each addressing a different failure mode. Together, they form a defense-in-depth pattern -- six layers of protection between a failure and data loss.&lt;/p&gt;

&lt;p&gt;Think of it like a substitute teacher managing a classroom of 32 students (GPU workers). Any individual student might fall asleep (stall), have a meltdown (crash), or turn in a blank worksheet (empty result). The substitute can't prevent any of these. But they can notice them, respond to them, and make sure the class still finishes the assignment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Error Containment -- The _error Column
&lt;/h3&gt;

&lt;p&gt;The first layer is the simplest. When any stage encounters an error, it tags the row instead of crashing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;preprocess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preprocess: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;  &lt;span class="c1"&gt;# Row survives with error tag
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;postprocess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generated_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inference: empty response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_normal_postprocess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once &lt;code&gt;_error&lt;/code&gt; is set, every downstream stage checks for it and passes the row through untouched. The row shows up in the final output with a clear failure reason. No data loss. No silent swallowing. You can grep the output for &lt;code&gt;_error&lt;/code&gt; and see exactly which samples failed and why.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Per-Sample Fallback
&lt;/h3&gt;

&lt;p&gt;Batch inference is fast -- 64 samples at once through the GPU. But one corrupted image in a batch of 64 can crash the entire batch. Without isolation, that one bad apple kills 63 good results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_generate_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;              &lt;span class="c1"&gt;# Fast path: full batch
&lt;/span&gt;    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;batch_err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Batch of %d failed (%s), retrying per-sample&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;batch_err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])):&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_generate_single&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sample_err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generated_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inference: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sample_err&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_merge_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The happy path tries the full batch. On failure, it falls back to per-sample inference. Only the broken sample gets an error tag. The other 63 flow through normally. The cost is one extra inference call per bad sample. The benefit is 63 results that would otherwise be lost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Stall Watchdog
&lt;/h3&gt;

&lt;p&gt;This is the one that saved us at 98% progress. A vLLM actor hit a multimodal cache bug and froze -- not crashed, not errored, &lt;em&gt;frozen&lt;/em&gt;. A &lt;code&gt;futex&lt;/code&gt; deadlock in the CUDA runtime. No error signal. No exit code. Just silence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;VLMInferenceEngine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_last_progress&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_watchdog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_watchdog_loop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_watchdog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_watchdog_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Dead man&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s switch: if no batch completes in 120s, force-kill.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Check every 30 seconds
&lt;/span&gt;            &lt;span class="n"&gt;stall_seconds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_last_progress&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stall_seconds&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WATCHDOG: No progress for %ds. Force-killing actor.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;stall_seconds&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Not sys.exit(). Not an exception. Hard kill.
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sampling_params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_last_progress&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Pet the watchdog
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_format_outputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why &lt;code&gt;os._exit(1)&lt;/code&gt; instead of &lt;code&gt;sys.exit()&lt;/code&gt; or raising an exception? Because when you're in a CUDA deadlock, polite shutdown is not an option. &lt;code&gt;sys.exit()&lt;/code&gt; raises &lt;code&gt;SystemExit&lt;/code&gt;, which Python's exit handlers try to process. Those handlers can themselves deadlock on the same &lt;code&gt;futex&lt;/code&gt;. &lt;code&gt;os._exit()&lt;/code&gt; calls the C runtime's &lt;code&gt;_exit()&lt;/code&gt; directly -- no handlers, no cleanup, no chance to hang. The process dies. Ray detects the death. Ray restarts the actor on a fresh node.&lt;/p&gt;

&lt;p&gt;A train's dead man's switch requires the engineer to press a button every 30 seconds. If the engineer is incapacitated, the switch is released and the train brakes. Our watchdog requires the actor to complete a batch every 120 seconds. If it doesn't, the actor is killed and restarted. The system doesn't diagnose &lt;em&gt;why&lt;/em&gt; the actor stopped responding. It just knows that silence is dangerous.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Engine Retry with Exponential Backoff
&lt;/h3&gt;

&lt;p&gt;Initializing a GPU inference engine can fail for transient reasons: stale CUDA state from a previously crashed process, fragmented GPU memory, a race condition during model loading. Without retry logic, a single init failure means a permanently broken actor.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_initialize_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Retry engine init with exponential backoff and GPU cleanup.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;gpu_memory_utilization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 10% headroom for restarts
&lt;/span&gt;            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_initialized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Engine initialized on attempt %d/3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Init attempt %d/3 failed: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# Aggressive GPU cleanup between attempts
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;empty_cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# 10s, 20s, 40s
&lt;/span&gt;
    &lt;span class="c1"&gt;# All retries exhausted: mark permanently dead
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_init_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Engine init failed after 3 attempts. Marking actor as dead.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;gpu_memory_utilization=0.85&lt;/code&gt; is deliberate. We dropped it from 0.95 to leave 10% headroom for cold starts after a crash. That 10% sounds wasteful until you realize it's the difference between a restart succeeding and entering a death spiral.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Dead Engine Fast-Path (Circuit Breaker)
&lt;/h3&gt;

&lt;p&gt;Once an engine is permanently dead, every batch that arrives should be immediately returned as an error, not retried.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_init_error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Don't waste time. Don't waste GPU. Fast-return errors.
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generated_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engine_init_failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_initialized&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_initialize_engine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_init_error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Will hit fast-path above
&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_run_inference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a circuit breaker pattern. Once the circuit opens (engine dead), all requests fast-fail. Ray detects that the actor is producing only errors and stops routing new work to it. Without this, the dead actor would keep accepting batches, running through the retry logic on every single one, and returning empty results after a 70-second delay each time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 6: Pipeline-Level Fault Tolerance
&lt;/h3&gt;

&lt;p&gt;All five layers above protect individual actors and samples. The sixth layer protects the pipeline itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ray&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_current&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_errored_blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;         &lt;span class="c1"&gt;# Tolerate up to 10 failed blocks
&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actor_task_retry_on_errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actor_init_retry_on_errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;span class="n"&gt;processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;VLMInferenceEngine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_gpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ActorPoolStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tasks_in_flight_per_actor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;max_restarts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# Actor dies 3 times -&amp;gt; permanent failure
&lt;/span&gt;    &lt;span class="n"&gt;max_task_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# Block retried 3 times -&amp;gt; marked errored
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;max_restarts=3&lt;/code&gt; is actor-level: a GPU that crashes three times is considered permanently unhealthy. &lt;code&gt;max_task_retries=3&lt;/code&gt; is block-level: a batch of data that fails on three different actors is considered poison. Together, they create a two-tier retry policy. A transiently sick GPU gets three chances to recover. A poison batch gets three tries across different actors. The combination of these six layers meant that when our actor froze at 98% progress, the watchdog killed it, Ray restarted it on a fresh node, the remaining 178 samples were processed, and the final output contained every single one of the 8,600 input samples. Zero dropped.&lt;/p&gt;




&lt;h2&gt;
  
  
  Act IV: "The Fleet"
&lt;/h2&gt;

&lt;p&gt;With the pipeline itself hardened, we turned to the operational layer -- running dozens of evaluations across multiple checkpoints on a shared cluster. This introduced a new category of problems: not "does the pipeline work?" but "does the fleet of pipelines play nice together?"&lt;/p&gt;

&lt;h3&gt;
  
  
  The Metrics Deadlock
&lt;/h3&gt;

&lt;p&gt;Our experiment tracking SDK had a method -- &lt;code&gt;experiment.end()&lt;/code&gt; -- that blocks until its internal upload queue drains. Under normal conditions, this takes a few seconds. Under throttled conditions (the SDK rate-limits itself when you log too many metrics too fast), the queue never drains. The call blocks forever.&lt;/p&gt;

&lt;p&gt;This was happening inside Ray actors. A deadlocked actor holds its GPU allocation, accepts no new work, and never exits. We were losing GPUs to a metrics library.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: blocks forever under throttling
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Can hang for hours
&lt;/span&gt;
&lt;span class="c1"&gt;# After: bounded timeout in a daemon thread
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_flush_metrics_queue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_alive&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Experiment end() did not complete in %ds. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Server will auto-close the orphaned experiment.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The daemon thread ensures that if &lt;code&gt;end()&lt;/code&gt; hangs, the actor can still exit cleanly. The tracking server auto-closes orphaned experiments after a timeout, so no data is permanently lost. We also batched all &lt;code&gt;log_metrics()&lt;/code&gt; calls into a single call per batch (was 10 separate network round-trips), disabled auto-logging of environment info and git state, and set explicit timeouts of 60 seconds instead of the default 3,600.&lt;/p&gt;

&lt;h3&gt;
  
  
  Job Queuing: The Bouncer Pattern
&lt;/h3&gt;

&lt;p&gt;When you submit 80 evaluation jobs to a shared Ray cluster, you need to control how many run concurrently. Without resource gating, all 80 try to start simultaneously, each requesting 6 GPUs. The cluster has 24 GPUs. Four jobs fit. The other 76 sit in PENDING state, holding driver-node resources (CPU, memory for the coordinator process) while waiting for GPUs that won't be free for hours.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# entrypoint_resources acts as a bouncer:
# each job "holds" 6 GPUs worth of admission tickets
&lt;/span&gt;&lt;span class="n"&gt;ray&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_submission&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;JobSubmissionClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cluster_address&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;submit_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;entrypoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python run_eval.py --checkpoint $CKPT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;entrypoint_resources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GPU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# Job won't start until 6 GPUs free
&lt;/span&gt;    &lt;span class="n"&gt;runtime_env&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env_vars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CHECKPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ckpt_path&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;entrypoint_resources&lt;/code&gt; is the bouncer at the door. Each job declares upfront how many GPUs it needs. Ray's scheduler won't start the job until those GPUs are available. Jobs queue cleanly instead of stampeding. Four jobs run concurrently on 24 GPUs. As each finishes, the next one starts. No manual orchestration.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Version Mismatch
&lt;/h3&gt;

&lt;p&gt;We lost half a day to a Ray version mismatch. The cluster was running Ray 2.38. Our container image had Ray 2.35. The job submitted successfully, the actors started, and then inference produced subtly wrong results. No crash. No version check error. Just a different code path in the data serialization layer that handled edge cases differently.&lt;/p&gt;

&lt;p&gt;The fix was pinning the Ray version in the container image to match the cluster exactly. But the real lesson was adding a version check to the job entrypoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ray&lt;/span&gt;
&lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2.38.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ray&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ray version mismatch: expected &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, got &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Update your container image.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Force-Stopping Stuck Jobs
&lt;/h3&gt;

&lt;p&gt;Even with all the resilience logic, some jobs get stuck in PENDING state and never start -- usually because a previous job didn't release its resources cleanly. The Ray dashboard shows the job as PENDING, but there's no built-in "force stop" button that works reliably for jobs that haven't started yet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ray.job_submission&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JobSubmissionClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;JobSubmissionClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cluster_address&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;job_info&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_jobs&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;job_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PENDING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RUNNING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;job_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_time&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;7200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# 2 hours without progress
&lt;/span&gt;            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Force-stopping stuck job %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We wrapped this in a cron job that runs every 30 minutes. Aggressive, but the alternative was waking up to find 6 GPUs held hostage by a zombie job.&lt;/p&gt;




&lt;h2&gt;
  
  
  Act V: "The Result"
&lt;/h2&gt;

&lt;p&gt;Here is where we ended up, 9 days after finding the 68% empty results:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Day 1 (broken)&lt;/th&gt;
&lt;th&gt;Day 9 (fixed)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Results present&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2,752 / 8,600 (32%)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8,600 / 8,600 (100%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Results empty&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5,848 (68%)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Crash recovery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None (manual restart)&lt;/td&gt;
&lt;td&gt;Automatic (watchdog + Ray retry)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Actor stall detection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;120s watchdog, &lt;code&gt;os._exit(1)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per-sample isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None (one bad sample kills batch)&lt;/td&gt;
&lt;td&gt;Batch fallback to per-sample&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Concurrent checkpoints&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 (manual)&lt;/td&gt;
&lt;td&gt;4 concurrent, 80+ queued&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Job control&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cannot kill stuck jobs&lt;/td&gt;
&lt;td&gt;REST API + cron reaper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metrics deadlocks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Frequent (hours-long hangs)&lt;/td&gt;
&lt;td&gt;Zero (60s timeout)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We ran 80+ checkpoint evaluations unattended over the following week. Every single one produced a complete result set. Two actors crashed during that period -- one to a CUDA OOM, one to the same multimodal cache bug. Both were detected by the watchdog within 150 seconds, restarted by Ray, and the affected batches were reprocessed. Zero data loss in both cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  What We Learned
&lt;/h3&gt;

&lt;p&gt;The 21 fixes fell into a clear pattern. They weren't individually difficult. Most were 5-20 lines of code. The hard part was finding them, because the pipeline's error-tolerance features were actively hiding the problems.&lt;/p&gt;

&lt;p&gt;Three principles emerged:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reconcile relentlessly.&lt;/strong&gt; Input count must equal output count. If they don't match, the run failed, regardless of what the logs say. We added a hard assertion at the end of every pipeline run: &lt;code&gt;assert output_count == input_count&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fail loud, recover quiet.&lt;/strong&gt; Every failure should produce a visible signal -- a log line, an error tag, a metric increment. Recovery should be automatic and silent. The worst failure mode is one that is both invisible and unrecoverable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Defense in depth.&lt;/strong&gt; No single layer prevents all failures. The watchdog doesn't prevent CUDA deadlocks -- it detects and recovers from them. The per-sample fallback doesn't prevent corrupted images -- it isolates them. The circuit breaker doesn't prevent init failures -- it stops wasting time on dead actors. Each layer handles the failures that slip past the layer above it.&lt;/p&gt;




&lt;p&gt;Production is not a feature you ship. It is a place your code lives. The weather there is harsh and unpredictable. You don't make code "production-ready" by adding a feature flag. You make it production-ready by assuming everything will fail, and building the systems to notice, contain, and recover -- automatically, silently, completely.&lt;/p&gt;

&lt;p&gt;Our pipeline doesn't crash less than it used to. It crashes exactly as often. It just recovers now.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;All code examples are simplified for clarity. The actual implementations include additional error handling, logging, and configuration.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ray</category>
      <category>python</category>
      <category>mlops</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>The Ghost in the Batch: How vLLM Silently Switches Algorithms</title>
      <dc:creator>Mayank Ketkar</dc:creator>
      <pubDate>Sun, 15 Feb 2026 17:47:55 +0000</pubDate>
      <link>https://forem.com/mketkar/the-ghost-in-the-batch-how-vllm-silently-switches-algorithms-4bi2</link>
      <guid>https://forem.com/mketkar/the-ghost-in-the-batch-how-vllm-silently-switches-algorithms-4bi2</guid>
      <description>&lt;p&gt;You run Qwen3-VL on a single prompt. You record the output logprobs to full precision. Then you run the exact same prompt again, batched with 15 others. Same model, same weights, same GPU, same code. The logprobs are &lt;strong&gt;different&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Not catastrophically -- the top token usually agrees -- but the numbers have shifted at the seventh decimal place, and in an autoregressive loop, that hairline fracture propagates. By output position 8, &lt;strong&gt;29% of your tokens have diverged.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You are not going crazy. vLLM silently changed the algorithm.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Crime Scene
&lt;/h2&gt;

&lt;p&gt;Setup: Qwen3-VL 2B on an NVIDIA H200. Identical prompts at BS=1 (one at a time) vs BS=16 (sixteen at once). &lt;code&gt;VLLM_BATCH_INVARIANT=1&lt;/code&gt; is enabled -- all GEMMs are deterministic via persistent Triton kernels. Yet:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;BS=1 vs BS=16&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bitwise logprob match&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;6.1%&lt;/strong&gt; (30/490)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Top-1 token match&lt;/td&gt;
&lt;td&gt;78.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic agreement&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The profiler gives the first clue: flash attention takes &lt;strong&gt;5.15x longer&lt;/strong&gt; per call in BS=16 (6.17ms vs 1.20ms). Same kernel name. Same call count (392). If it were the same algorithm processing more data, you'd expect it to scale with tokens -- not blow up 5x per call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is not a scaling problem. This is a different recipe.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Attention Does (60 Seconds)
&lt;/h2&gt;

&lt;p&gt;For each new token, attention looks back at every previous token, computes a relevance score for each, normalizes them (softmax), and takes a weighted average:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;output = softmax(Q @ K^T / sqrt(d)) @ V
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical observation: &lt;strong&gt;softmax involves summing over all previous tokens.&lt;/strong&gt; If you have 640 tokens, that is 640 numbers being added. The &lt;strong&gt;order&lt;/strong&gt; of that summation will matter shortly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Optimization That Changes Everything
&lt;/h2&gt;

&lt;p&gt;Imagine 16 requests sharing the same 400-token system prompt. Without optimization, each scans those 400 tokens independently -- &lt;strong&gt;6,400 redundant KV reads&lt;/strong&gt; (80% waste).&lt;/p&gt;

&lt;p&gt;vLLM's &lt;strong&gt;cascade attention&lt;/strong&gt; splits the work:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Prefix (done ONCE):&lt;/strong&gt; Flash attention over the shared 400 tokens for all 16 queries. Produces partial output + Log-Sum-Exp (LSE) statistic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suffix (per-request):&lt;/strong&gt; Flash attention over each request's unique tokens (~100 each). Produces partial output + LSE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merge:&lt;/strong&gt; Combine using LSE-weighted rebalancing.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# flash_attn.py:1040 -- Cascade Implementation
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cascade_attention&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key_cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value_cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: Process shared prefix ONCE
&lt;/span&gt;    &lt;span class="n"&gt;prefix_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix_lse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;flash_attn_varlen_func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;key_cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;value_cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_seqlen_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;common_prefix_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;block_table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;block_table&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;causal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;return_softmax_lse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 2: Process each request's unique suffix
&lt;/span&gt;    &lt;span class="n"&gt;suffix_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suffix_lse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;flash_attn_varlen_func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;key_cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;value_cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_seqlen_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_kv_len&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;common_prefix_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;block_table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;block_table&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="n"&gt;num_common_kv_blocks&lt;/span&gt;&lt;span class="p"&gt;:],&lt;/span&gt;
        &lt;span class="n"&gt;causal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;return_softmax_lse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 3: Merge -- THIS is where determinism breaks
&lt;/span&gt;    &lt;span class="nf"&gt;merge_attn_states&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix_lse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;suffix_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suffix_lse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total KV reads: 400 + 16x100 = &lt;strong&gt;2,000&lt;/strong&gt; (vs 8,000). A &lt;strong&gt;4x bandwidth reduction&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Mathematically, this produces the identical result. But we don't live in the world of exact arithmetic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why 1 + 2 + 3 Does Not Equal 3 + 2 + 1
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This is the section that explains everything.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;IEEE 754 floating-point addition is &lt;strong&gt;not associative:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1e-7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;    &lt;span class="c1"&gt;# 1.1920928955078125e-07
&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# 1e-07
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same inputs. Same operations. Different answers. This is the IEEE 754 spec -- finite precision means rounding depends on order.&lt;/p&gt;

&lt;p&gt;Connect this to attention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single-pass (BS=1):&lt;/strong&gt; &lt;code&gt;softmax([t1, t2, ..., t640]) @ V&lt;/code&gt; -- one summation, one rounding chain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cascade (BS=16):&lt;/strong&gt; &lt;code&gt;merge(softmax([t1,...,t512]) @ V_prefix, softmax([t513,...,t640]) @ V_suffix)&lt;/code&gt; -- two summations + LSE merge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The merge math:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified merge_attn_states logic
&lt;/span&gt;&lt;span class="n"&gt;max_lse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prefix_lse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suffix_lse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# numerical stability
&lt;/span&gt;
&lt;span class="n"&gt;prefix_weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prefix_lse&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;max_lse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;suffix_weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suffix_lse&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;max_lse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prefix_weight&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;prefix_output&lt;/span&gt;
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;suffix_weight&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;suffix_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prefix_weight&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;suffix_weight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Math: IDENTICAL to single-pass
# IEEE 754: DIFFERENT by ~1e-7 (FP32) or ~1e-3 (FP16)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;~1e-7 per element sounds negligible. But autoregressive generation feeds each output back as input through ~28 transformer layers. That 1e-7 compounds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Position 5: &lt;strong&gt;17% token divergence&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Position 8: &lt;strong&gt;29% token divergence&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Three Gates
&lt;/h2&gt;

&lt;p&gt;When does vLLM activate cascade? &lt;strong&gt;Silently, based on three runtime conditions:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# flash_attn.py:962
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;use_cascade_attention&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;common_prefix_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_lens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
    &lt;span class="c1"&gt;# Gate 1: Is shared prefix long enough?
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;common_prefix_len&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# "Not worth the overhead"
&lt;/span&gt;
    &lt;span class="c1"&gt;# Gate 2: Are there enough requests?
&lt;/span&gt;    &lt;span class="n"&gt;num_reqs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_lens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;num_reqs&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# "Too few to benefit"
&lt;/span&gt;
    &lt;span class="c1"&gt;# Gate 3: Performance heuristic
&lt;/span&gt;    &lt;span class="n"&gt;cascade_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cascade_waves&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;num_prefix_tiles&lt;/span&gt;
    &lt;span class="n"&gt;flash_decoding_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cdiv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flash_decoding_ctas&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_sms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cascade_time&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;flash_decoding_time&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;BS=1 always fails Gate 2.&lt;/strong&gt; It uses single-pass attention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BS=16 with a system prompt passes all three.&lt;/strong&gt; It uses cascade attention.&lt;/p&gt;

&lt;p&gt;Your BS=1 benchmark is literally running a different algorithm than your BS=16 production system.&lt;/p&gt;

&lt;p&gt;
  The Smoking Gun: 30 Matching Samples
  &lt;p&gt;In our 490-pair experiment, exactly 30 matched bitwise -- always the &lt;strong&gt;last 3&lt;/strong&gt; per batch. As a batch of 50 processes, requests finish and leave. When fewer than 8 remain, Gate 2 closes. The last requests revert to single-pass and match BS=1 perfectly.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Batch Position&lt;/th&gt;
&lt;th&gt;Active Requests&lt;/th&gt;
&lt;th&gt;Cascade?&lt;/th&gt;
&lt;th&gt;Matches BS=1?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1-47&lt;/td&gt;
&lt;td&gt;50 down to 8&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fork in the Code
&lt;/h2&gt;

&lt;p&gt;The branch at &lt;code&gt;flash_attn.py:673&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;attn_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use_cascade&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# SINGLE PASS: one call over all KV tokens
&lt;/span&gt;    &lt;span class="nf"&gt;flash_attn_varlen_func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;num_actual_tokens&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;key_cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;value_cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
        &lt;span class="n"&gt;num_splits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;attn_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_num_splits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;

&lt;span class="c1"&gt;# CASCADE: two calls + merge
&lt;/span&gt;&lt;span class="nf"&gt;cascade_attention&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;num_actual_tokens&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;num_actual_tokens&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;key_cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value_cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;common_prefix_len&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;attn_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;common_prefix_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same function signature. Completely different execution.&lt;/p&gt;

&lt;p&gt;vLLM already knows these conflict. When &lt;code&gt;VLLM_BATCH_INVARIANT=1&lt;/code&gt; is set, it auto-disables cascade:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/config/vllm.py:994
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;vllm_is_batch_invariant&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disable_cascade_attn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disable_cascade_attn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning_once&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Disabling cascade attention when VLLM_BATCH_INVARIANT is enabled.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Fix: Three Options
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Option 1: Surgical (Recommended)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;disable_cascade_attn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Forces single-pass for all batch sizes. 5-15% throughput loss for shared-prefix workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: Remove Prefix Detection
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;enable_prefix_caching&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gate 1 never opens. Also disables other prefix caching benefits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 3: Full Determinism
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;VLLM_BATCH_INVARIANT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replaces ALL cuBLAS GEMMs + auto-disables cascade. ~2.4x performance cost.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option&lt;/th&gt;
&lt;th&gt;Cascade&lt;/th&gt;
&lt;th&gt;GEMM Determinism&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Default&lt;/td&gt;
&lt;td&gt;Enabled&lt;/td&gt;
&lt;td&gt;Non-deterministic&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disable_cascade_attn=True&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Disabled&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Non-deterministic&lt;/td&gt;
&lt;td&gt;~5-15%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enable_prefix_caching=False&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Disabled&lt;/td&gt;
&lt;td&gt;Non-deterministic&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;VLLM_BATCH_INVARIANT=1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Auto-disabled&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Deterministic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~2.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Broader Lesson
&lt;/h2&gt;

&lt;p&gt;Cascade attention is not a bug. It is a well-engineered bandwidth optimization. The issue is the &lt;strong&gt;silence&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This pattern recurs across GPU inference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flash Decoding&lt;/strong&gt; splits attention across thread blocks -- same associativity issue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cuBLAS GEMM&lt;/strong&gt; selects different tile sizes by matrix shape -- same op, different rounding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;torch.compile&lt;/strong&gt; fuses differently between eager/compiled -- same model, different graph&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Every time a framework says "mathematically equivalent," ask: &lt;strong&gt;equivalent in the reals, or in IEEE 754?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The ghost in the batch is not malicious. It is an optimization doing its job. But now you know it is there, you know when it activates, and you know how to control it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;BS=1 and BS&amp;gt;=8 run different attention algorithms in vLLM.&lt;/strong&gt; Single-pass vs cascade, by design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cascade saves 4x memory bandwidth&lt;/strong&gt; by processing shared prefixes once. The merge step introduces FP divergence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three silent gates&lt;/strong&gt; control activation: prefix &amp;gt;= 256 tokens, num_reqs &amp;gt;= 8, perf heuristic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One flag fixes it:&lt;/strong&gt; &lt;code&gt;disable_cascade_attn=True&lt;/code&gt; or &lt;code&gt;VLLM_BATCH_INVARIANT=1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Mathematically equivalent" != "numerically identical."&lt;/strong&gt; This applies across all GPU ML.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Key files:&lt;/strong&gt; &lt;code&gt;flash_attn.py:673&lt;/code&gt; (fork), &lt;code&gt;flash_attn.py:962&lt;/code&gt; (gates), &lt;code&gt;flash_attn.py:1040&lt;/code&gt; (cascade), &lt;code&gt;merge_attn_states.py&lt;/code&gt; (merge), &lt;code&gt;vllm/config/vllm.py:994&lt;/code&gt; (auto-disable)&lt;/p&gt;

</description>
      <category>vllm</category>
      <category>machinelearning</category>
      <category>gpu</category>
      <category>determinism</category>
    </item>
    <item>
      <title>How to Read GPU Profiling Logs: A Ground-Up Guide</title>
      <dc:creator>Mayank Ketkar</dc:creator>
      <pubDate>Sun, 15 Feb 2026 17:47:38 +0000</pubDate>
      <link>https://forem.com/mketkar/how-to-read-gpu-profiling-logs-a-ground-up-guide-3akl</link>
      <guid>https://forem.com/mketkar/how-to-read-gpu-profiling-logs-a-ground-up-guide-3akl</guid>
      <description>&lt;p&gt;You ran &lt;code&gt;nsys profile&lt;/code&gt;, got a 2GB &lt;code&gt;.nsys-rep&lt;/code&gt; file, exported it to SQLite, and found yourself staring at 88 tables with names like &lt;code&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/code&gt; and &lt;code&gt;ENUM_WDDM_PAGING_QUEUE_TYPE&lt;/code&gt;. The kernel names are integer IDs. The timestamps are in nanoseconds. Nothing is human-readable. You closed the file and went back to guessing.&lt;/p&gt;

&lt;p&gt;This post exists so you never have to guess again.&lt;/p&gt;

&lt;p&gt;I'm going to teach you to read any nsys trace in under 10 minutes — using four tables, one SQL join pattern, and four queries. Then we'll use those tools to solve a real mystery: why does a model give different results at batch size 1 vs batch size 16, even though both traces show exactly 8,955 kernel launches?&lt;/p&gt;




&lt;h2&gt;
  
  
  What nsys actually records
&lt;/h2&gt;

&lt;p&gt;Imagine you're standing in a factory watching an assembly line. You have a stopwatch, and your job is to write down: &lt;em&gt;when&lt;/em&gt; each machine started, &lt;em&gt;when&lt;/em&gt; it stopped, &lt;em&gt;which&lt;/em&gt; machine it was, and &lt;em&gt;what&lt;/em&gt; it was building.&lt;/p&gt;

&lt;p&gt;That's what NVIDIA Nsight Systems does for your GPU. It records every kernel launch, every memory copy, every synchronization event — with nanosecond timestamps.&lt;/p&gt;

&lt;p&gt;The output is two files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;profile.nsys-rep     ← visual report (open in the Nsight GUI)
profile.sqlite       ← raw data in a database (query with SQL)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;.sqlite&lt;/code&gt; file is the gold mine. Everything in the &lt;code&gt;.nsys-rep&lt;/code&gt; is derived from it.&lt;/p&gt;




&lt;h2&gt;
  
  
  88 tables, but only 4 matter
&lt;/h2&gt;

&lt;p&gt;Open any nsys SQLite export and you'll find 88 tables. Most are enum lookup tables or metadata. You need these four:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Table&lt;/th&gt;
&lt;th&gt;What it records&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Every GPU kernel execution: start, end, grid, block, name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CUPTI_ACTIVITY_KIND_MEMCPY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Every memory transfer: CPU→GPU, GPU→CPU, GPU→GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CUPTI_ACTIVITY_KIND_MEMSET&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Every memory fill (zeroing buffers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NVTX_EVENTS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Human-readable markers programmers add to their code&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Plus one helper table: &lt;strong&gt;&lt;code&gt;StringIds&lt;/code&gt;&lt;/strong&gt; — the Rosetta Stone that maps integer IDs to actual names.&lt;/p&gt;

&lt;p&gt;Here's how to discover them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sqlite3&lt;/span&gt;
&lt;span class="c1"&gt;# nsys export --type=sqlite profile.nsys-rep  -&amp;gt; produces profile.sqlite
&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;profile.sqlite&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT name FROM sqlite_master WHERE type=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tables&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total tables: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 88 tables!
# But only 4 matter:
#   CUPTI_ACTIVITY_KIND_KERNEL  ← GPU kernel executions
#   CUPTI_ACTIVITY_KIND_MEMCPY  ← memory transfers
#   CUPTI_ACTIVITY_KIND_MEMSET  ← memory fills
#   NVTX_EVENTS                 ← human-readable markers
#   + StringIds                 ← name lookup table
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The #1 gotcha: names are integers
&lt;/h2&gt;

&lt;p&gt;This is the single most confusing thing when you first look at nsys data. Kernel names are &lt;strong&gt;not&lt;/strong&gt; stored as strings. They're stored as integer foreign keys into the &lt;code&gt;StringIds&lt;/code&gt; table.&lt;/p&gt;

&lt;p&gt;When you query the kernel table, you'll see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;demangledName = 58
shortName     = 59
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are pointers. To get the actual name, you JOIN:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;StringIds&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shortName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And suddenly &lt;code&gt;59&lt;/code&gt; becomes &lt;code&gt;vectorized_elementwise_kernel&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Every query you write will include this JOIN. It becomes muscle memory after your second trace.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decoding kernel names: the Rosetta Stone
&lt;/h2&gt;

&lt;p&gt;Once you run that JOIN, you'll see names like &lt;code&gt;nvjet_tst_192x192_64x4_2x1_v_bz_coopB_TNN&lt;/code&gt;. This isn't gibberish — every character means something:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nvjet_tst _ 192x192 _ 64x4 _ 2x1 _ v _ bz _ coopB _ TNN
   │          │        │      │     │    │     │       │
   │          │        │      │     │    │     │       └─ transpose: T=yes, N=no
   │          │        │      │     │    │     │          TNN = A^T × B → C
   │          │        │      │     │    │     │
   │          │        │      │     │    │     └─ cooperative mode
   │          │        │      │     │    │        coopA/coopB = how SMs share work
   │          │        │      │     │    │
   │          │        │      │     │    └─ block-zero init strategy
   │          │        │      │     │
   │          │        │      │     └─ layout: v=vertical, h=horizontal
   │          │        │      │        (how tiles map to SMs)
   │          │        │      │
   │          │        │      └─ warp tiling: 2 warps in M, 1 in N
   │          │        │
   │          │        └─ block tile: 64×4 threads per block
   │          │
   │          └─ output tile: 192×192 chunk per SM
   │
   └─ nvjet_tst = NVIDIA JIT persistent kernel (deterministic!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Think of it like a car's VIN number. Once you know the format, you can read any GPU kernel at a glance.&lt;/p&gt;

&lt;p&gt;Besides &lt;code&gt;nvjet_tst_*&lt;/code&gt;, you'll encounter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;device_kernel&lt;/code&gt;&lt;/strong&gt; — output of &lt;code&gt;torch.compile&lt;/code&gt;. Opaque, but often 70%+ of GPU time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;vectorized_elementwise_kernel&lt;/code&gt;&lt;/strong&gt; — PyTorch's generic ops (add, multiply, cast).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;rms_norm_kernel&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;act_and_mul_kernel&lt;/code&gt;&lt;/strong&gt; — normalization and activation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;*_splitK_*&lt;/code&gt;&lt;/strong&gt; — Split-K GEMM with atomic reduction. &lt;em&gt;Potential non-determinism source&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Grid and block: mapping kernels to hardware
&lt;/h2&gt;

&lt;p&gt;The kernel table has &lt;code&gt;gridX/Y/Z&lt;/code&gt; and &lt;code&gt;blockX/Y/Z&lt;/code&gt; columns. These map to physical GPU hardware.&lt;/p&gt;

&lt;p&gt;An H200 has 132 Streaming Multiprocessors (SMs) — 132 independent assembly lines. Each can process one thread block at a time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  GPU with 132 SMs:
  ┌─────┬─────┬─────┬─────┬─── ── ──┬─────┐
  │SM 0 │SM 1 │SM 2 │SM 3 │  ...    │SM131│
  │blk 0│blk 1│blk 2│blk 3│         │blk  │
  │128  │128  │128  │128  │         │131  │
  │thrds│thrds│thrds│thrds│         │thrds│
  └─────┴─────┴─────┴─────┴─── ── ──┴─────┘

  grid=132 × block=128 = 16,896 total threads
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;gridX=1&lt;/strong&gt;: One SM active, 131 idle. Tiny work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gridX=132&lt;/strong&gt;: Every SM busy. What persistent GEMM targets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gridX=64,000&lt;/strong&gt;: Blocks queue in waves. GPU stays saturated.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Four SQL queries that answer every question
&lt;/h2&gt;

&lt;p&gt;Every GPU investigation follows the same four-step pattern: &lt;strong&gt;The Census&lt;/strong&gt;, &lt;strong&gt;The Lineup&lt;/strong&gt;, &lt;strong&gt;The Stakeout&lt;/strong&gt;, and &lt;strong&gt;The Timeline&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 1: The Census — "What's slow?"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Where is the GPU spending its time?&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;kernel_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;call_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_us&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;StringIds&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shortName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_ms&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On our H200 running vLLM inference:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Kernel&lt;/th&gt;
&lt;th&gt;Calls&lt;/th&gt;
&lt;th&gt;Total (ms)&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;device_kernel (torch.compile)&lt;/td&gt;
&lt;td&gt;748&lt;/td&gt;
&lt;td&gt;877.38&lt;/td&gt;
&lt;td&gt;70.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vectorized_elementwise_kernel&lt;/td&gt;
&lt;td&gt;883&lt;/td&gt;
&lt;td&gt;60.98&lt;/td&gt;
&lt;td&gt;4.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nvjet_tst_192x192_64x4_2x1...&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;51.94&lt;/td&gt;
&lt;td&gt;4.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;70% of GPU time in one kernel type — the compiled vision encoder.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 2: The Lineup — "What category?"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="k"&gt;CASE&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%nvjet%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Persistent GEMM'&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%cublas%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'cuBLAS (DANGER)'&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%device_kernel%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'torch.compile'&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%flash%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Flash Attention'&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%norm%'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Normalization'&lt;/span&gt;
    &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'Other'&lt;/span&gt;
  &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_ms&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;StringIds&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shortName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_ms&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results: 17.3% Persistent GEMM, 7.7% Elementwise, 2% Normalization, and... &lt;strong&gt;0.1% Flash Attention&lt;/strong&gt;. Just 1.2ms. Barely a blip.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Remember that 0.1%. It becomes 5.15x the smoking gun.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 3: The Stakeout — "What changed?"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Drill into every launch of a specific kernel&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dur_us&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gridX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gridY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gridZ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blockX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;registersPerThread&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dynamicSharedMemory&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;StringIds&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shortName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'reshape_and_cache_flash_kernel'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Level 4: The Timeline — "What happened when?"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Unified GPU timeline: kernels + memory transfers&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'KERNEL'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dur_us&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;CUPTI_ACTIVITY_KIND_KERNEL&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;StringIds&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shortName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'MEMCPY'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;e3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
       &lt;span class="s1"&gt;'copyKind='&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copyKind&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;CUPTI_ACTIVITY_KIND_MEMCPY&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;start&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Case study: the cascade attention smoking gun
&lt;/h2&gt;

&lt;p&gt;Here's the mystery: We're running vLLM inference with batch-invariant mode, which &lt;em&gt;guarantees&lt;/em&gt; bitwise-identical results regardless of batch size. BS=1 works perfectly. BS=16 gives different results. Both traces: exactly 8,955 kernel launches. Where's the bug?&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: The Census — nothing obvious
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;BS=1&lt;/th&gt;
&lt;th&gt;BS=16&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total kernels&lt;/td&gt;
&lt;td&gt;8,955&lt;/td&gt;
&lt;td&gt;8,955&lt;/td&gt;
&lt;td&gt;1.00x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total GPU time&lt;/td&gt;
&lt;td&gt;1.25s&lt;/td&gt;
&lt;td&gt;1.37s&lt;/td&gt;
&lt;td&gt;1.10x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memcpy ops&lt;/td&gt;
&lt;td&gt;1,099&lt;/td&gt;
&lt;td&gt;1,147&lt;/td&gt;
&lt;td&gt;1.04x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memset ops&lt;/td&gt;
&lt;td&gt;345&lt;/td&gt;
&lt;td&gt;429&lt;/td&gt;
&lt;td&gt;1.24x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 2: The Lineup — the outlier appears
&lt;/h3&gt;

&lt;p&gt;Most categories grow modestly. Persistent GEMM +35%, Normalization +52%. Expected.&lt;/p&gt;

&lt;p&gt;But Flash Attention: &lt;strong&gt;1.20ms → 6.17ms — 5.15x increase&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In absolute terms it's 6ms. But the &lt;em&gt;ratio&lt;/em&gt; is an extreme statistical outlier.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: The Stakeout — same calls, more data
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;reshape_and_cache_flash_kernel&lt;/code&gt; runs 392 times in &lt;em&gt;both&lt;/em&gt; profiles (same count!), but takes 5.15x longer per call in BS=16. More data per call, not more calls.&lt;/p&gt;

&lt;p&gt;Memory: 83% more device-to-device copies (53 vs 29 ops).&lt;/p&gt;

&lt;p&gt;GEMM: 7 new kernel variants in BS=16 (88ms) that don't exist in BS=1 (which had 7 different variants, only 18ms).&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: The code — root cause
&lt;/h3&gt;

&lt;p&gt;Every clue points to one place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm/v1/attention/backends/flash_attn.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;use_cascade_attention&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;common_prefix_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_lens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
    &lt;span class="c1"&gt;# Too short prefix — not worth splitting
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;common_prefix_len&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# ← BS=1 exits here
&lt;/span&gt;
    &lt;span class="c1"&gt;# Too few requests — not worth splitting
&lt;/span&gt;    &lt;span class="n"&gt;num_reqs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_lens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;num_reqs&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# ← BS=1-7 exits here
&lt;/span&gt;
    &lt;span class="c1"&gt;# BS=16 with shared system prompt (&amp;gt;256 tokens):
&lt;/span&gt;    &lt;span class="c1"&gt;# → cascade ON → split prefix/suffix → LSE merge
&lt;/span&gt;    &lt;span class="c1"&gt;# → mathematically equivalent, but FP-different
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# ← determinism breaks here
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cascade attention&lt;/strong&gt; splits attention into prefix and suffix passes, then merges with LSE arithmetic. Mathematically equivalent. Floating-point different. That's the determinism break.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix: &lt;code&gt;disable_cascade_attn=True&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;
  The Evidence Board
  &lt;br&gt;
| Clue | Evidence | Verdict |&lt;br&gt;
|------|----------|---------|&lt;br&gt;
| 5.15x attention slowdown | reshape_and_cache_flash_kernel: 1.20ms → 6.17ms | More data per call |&lt;br&gt;
| 83% more D-to-D copies | 29 → 53 ops, 81 → 163 MB | Internal tensor splitting |&lt;br&gt;
| 7 new GEMM variants | 88ms new vs 18ms removed | Autotuner adapting |&lt;br&gt;
| Identical kernel count | 8,955 both profiles | Same graph, different PATH |&lt;br&gt;
| &lt;strong&gt;All clues →&lt;/strong&gt; | &lt;strong&gt;CASCADE ATTENTION&lt;/strong&gt; | &lt;strong&gt;flash_attn.py:673&lt;/strong&gt; |&lt;br&gt;


&lt;/p&gt;




&lt;h2&gt;
  
  
  The cheat sheet
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│  NSYS SQLITE CHEAT SHEET                                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  THE 4 TABLES:                                              │
│    CUPTI_ACTIVITY_KIND_KERNEL  → kernel executions          │
│    CUPTI_ACTIVITY_KIND_MEMCPY  → memory transfers           │
│    CUPTI_ACTIVITY_KIND_MEMSET  → memory fills               │
│    NVTX_EVENTS                 → human-added markers        │
│    + StringIds                 → name lookup                │
│                                                             │
│  THE JOIN (everywhere):                                     │
│    JOIN StringIds s ON k.shortName = s.id                   │
│                                                             │
│  TIMESTAMPS:  nanoseconds                                   │
│    / 1e3 = μs    / 1e6 = ms    / 1e9 = seconds             │
│                                                             │
│  COPYKIND:  1=CPU→GPU  2=GPU→CPU  8=GPU→GPU                │
│                                                             │
│  GRID/BLOCK:                                                │
│    grid × block = total threads                             │
│    grid=132 → all H200 SMs active                           │
│    grid=1 → one SM, 131 idle                                │
│                                                             │
│  DANGER ZONE:                                               │
│    *cublas* → non-deterministic GEMM                        │
│    *splitK* → non-deterministic reduction                   │
│    cascade/merge_attn → FP divergence                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Your first 10 minutes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Capture&lt;/span&gt;
nsys profile &lt;span class="nt"&gt;-o&lt;/span&gt; my_trace &lt;span class="nt"&gt;--stats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true &lt;/span&gt;python my_script.py

&lt;span class="c"&gt;# 2. Export to SQLite&lt;/span&gt;
nsys &lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sqlite my_trace.nsys-rep

&lt;span class="c"&gt;# 3. Find your bottleneck&lt;/span&gt;
python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
import sqlite3
conn = sqlite3.connect('my_trace.sqlite')
cur = conn.cursor()
cur.execute('''
    SELECT s.value, COUNT(*), ROUND(SUM(k.end-k.start)/1e6,2) AS ms
    FROM CUPTI_ACTIVITY_KIND_KERNEL k
    JOIN StringIds s ON k.shortName = s.id
    GROUP BY s.value ORDER BY ms DESC LIMIT 10
''')
for row in cur.fetchall():
    print(f'{row[0]:50s}  calls={row[1]:5d}  total={row[2]:8.2f} ms')
"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In 10 minutes you'll know which kernel is your bottleneck. GPU profiling is not a dark art. It's four tables, one JOIN, and four queries.&lt;/p&gt;

&lt;p&gt;The answer is in the trace. Go look.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;All data from real H200 GPU traces: 8,955 kernel launches during Qwen3-VL-2B inference with vLLM 0.15.2.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>profiling</category>
      <category>performance</category>
      <category>cuda</category>
    </item>
    <item>
      <title>Two Ways to Move Tensors Without Stopping: Inside vLLM's Async GPU Transfer Patterns</title>
      <dc:creator>Mayank Ketkar</dc:creator>
      <pubDate>Wed, 11 Feb 2026 21:12:16 +0000</pubDate>
      <link>https://forem.com/mketkar/two-ways-to-move-tensors-without-stopping-inside-vllms-async-gpu-transfer-patterns-dk7</link>
      <guid>https://forem.com/mketkar/two-ways-to-move-tensors-without-stopping-inside-vllms-async-gpu-transfer-patterns-dk7</guid>
      <description>&lt;p&gt;A single &lt;code&gt;torch.cuda.synchronize()&lt;/code&gt; in the wrong place can erase every optimization you spent weeks building. Your GPU sits idle, your pipeline stalls, and your inference latency doubles. In vLLM's distributed serving stack, tensors move between GPUs constantly: billions of parameters shuffled during weight updates, and key-value cache blocks shipped between nodes during live inference. The codebase solves these two problems with two radically different async patterns -- and studying them side-by-side reveals a masterclass in GPU concurrency.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Is Waiting
&lt;/h2&gt;

&lt;p&gt;Imagine you need to get 14GB of model weights (a 7B parameter model in BF16) from a trainer process onto your inference GPU. The naive approach: transfer everything, wait, then resume inference.&lt;/p&gt;

&lt;p&gt;At NVLink bandwidth (~900 GB/s), that's ~15ms. Not terrible. But at PCIe 5.0 (~40 GB/s effective), it's 350ms of dead time. And during RLHF training, this happens every few hundred steps.&lt;/p&gt;

&lt;p&gt;vLLM faces this problem in two distinct contexts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Weight distribution&lt;/strong&gt;: Getting updated model weights from a trainer to inference workers (bulk, periodic)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KV cache migration&lt;/strong&gt;: Shipping key-value cache blocks between disaggregated prefill and decode nodes (streaming, continuous)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each context demands a different async pattern. Let's build the understanding from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  CUDA Streams: Separate Lanes on the Same Highway
&lt;/h2&gt;

&lt;p&gt;Before we can overlap anything, we need the mechanism that makes overlap possible.&lt;/p&gt;

&lt;p&gt;Think of a factory with multiple conveyor belts feeding the same set of assembly stations. Each belt keeps its items in order. The stations (GPU compute units) pull work from any belt that has items ready:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stream 0 (default):  [kernel A] → [kernel B] → [kernel C]
Stream 1 (transfer):  [NCCL broadcast chunk 0] → [NCCL broadcast chunk 1]

A and the broadcast can run AT THE SAME TIME on the GPU.
There is NO ordering between streams unless you add one.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A CUDA stream is just an ordered queue of GPU operations. Operations within a stream execute sequentially. Operations across streams can overlap. That's the entire mental model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Stream&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Everything here goes on the new stream, not the default one
&lt;/span&gt;    &lt;span class="nf"&gt;do_transfer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Back on the default stream -- transfer and default work can overlap
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The key primitive&lt;/strong&gt;: &lt;code&gt;stream.synchronize()&lt;/code&gt; blocks the CPU until all operations on &lt;strong&gt;that one stream&lt;/strong&gt; finish. It does NOT wait for other streams. This surgical waiting is what makes pipelining possible.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick recap&lt;/strong&gt;: A CUDA stream is an ordered queue. Two streams run independently. &lt;code&gt;stream.synchronize()&lt;/code&gt; waits for just one. Everything that follows builds on this single idea.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The uint8 Trick: One Blob to Rule Them All
&lt;/h2&gt;

&lt;p&gt;Before we can pipeline transfers, we need to solve a packaging problem. A model has weights in different dtypes -- bfloat16, float32, maybe int8 quantization scales. NCCL broadcasts a single contiguous buffer. You can't &lt;code&gt;torch.cat&lt;/code&gt; tensors of different dtypes.&lt;/p&gt;

&lt;p&gt;The solution: &lt;strong&gt;everything is just bytes in memory&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# packed_tensor.py, lines 62-69
# Get weight tensor (any dtype: bf16, fp32, int8...)
&lt;/span&gt;&lt;span class="n"&gt;tensor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;post_iter_func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iterator&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contiguous&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;           &lt;span class="c1"&gt;# Ensure contiguous memory
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uint8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="c1"&gt;# Reinterpret as raw bytes — NO COPY
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# Flatten to 1D for heterogeneous cat
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;packing_tensor_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;packing_tensor_sizes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;numel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;.view(torch.uint8)&lt;/code&gt; is NOT a cast or a copy. It reinterprets the same memory as raw bytes -- zero cost. A bfloat16 tensor of shape &lt;code&gt;[4096, 4096]&lt;/code&gt; becomes 33,554,432 uint8 values. Now every weight looks the same, and you can &lt;code&gt;torch.cat&lt;/code&gt; them into one contiguous blob.&lt;/p&gt;

&lt;p&gt;The consumer reverses the process using stored metadata (name, shape, dtype):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Consumer unpacks: raw bytes → original dtype → original shape
&lt;/span&gt;&lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contiguous&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both sides iterate in the &lt;strong&gt;same order&lt;/strong&gt;. If the order mismatches, you get silent corruption -- more on this in the antipatterns section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Double-Buffered Weight Transfer: The Assembly Line
&lt;/h2&gt;

&lt;p&gt;This is the centerpiece. vLLM's &lt;code&gt;packed_tensor.py&lt;/code&gt; defines two constants:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# packed_tensor.py, lines 13-14
&lt;/span&gt;&lt;span class="n"&gt;DEFAULT_PACKED_BUFFER_SIZE_BYTES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;  &lt;span class="c1"&gt;# 1GB
&lt;/span&gt;&lt;span class="n"&gt;DEFAULT_PACKED_NUM_BUFFERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1GB per buffer. Two buffers. The model weights get chunked into ~1GB pieces, and the two buffers alternate roles: one is being packed with fresh weights while the other's broadcast is still in flight.&lt;/p&gt;

&lt;p&gt;Here's the actual producer loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# packed_tensor.py, lines 39-86 (simplified)
&lt;/span&gt;&lt;span class="n"&gt;streams&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Stream&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_buffers&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;buffer_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# GATE: wait for buffer
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;packing_tensor_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
            &lt;span class="n"&gt;packing_tensor_sizes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
            &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;tensor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;post_iter_func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iterator&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contiguous&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uint8&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="n"&gt;packing_tensor_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;packing_tensor_sizes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;numel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;packing_tensor_sizes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;target_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="n"&gt;packed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;packing_tensor_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;packed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ASYNC on GPU!
&lt;/span&gt;            &lt;span class="n"&gt;buffer_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;num_buffers&lt;/span&gt;  &lt;span class="c1"&gt;# ROTATE
&lt;/span&gt;        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;StopIteration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Flush final partial buffer
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;packing_tensor_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;packed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;packing_tensor_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;buffer_idx&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;packed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The timeline shows the overlap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Time ──────────────────────────────────────────────────────►

Stream 0:  ┌──pack──┐┌──broadcast──┐          ┌──pack──┐┌──broadcast──┐
           │ buf[0] ││   buf[0]    │          │ buf[0] ││   buf[0]    │
           └────────┘└─────────────┘          └────────┘└─────────────┘
                                    ▲ sync                              ▲ sync
Stream 1:            ┌──pack──┐┌──broadcast──┐          ┌──pack──┐
                     │ buf[1] ││   buf[1]    │          │ buf[1] │
                     └────────┘└─────────────┘          └────────┘
                                              ▲ sync
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Iteration 0&lt;/strong&gt; (buffer 0, Stream 0): Pack ~1GB of weights, broadcast via NCCL. The broadcast is GPU-async -- returns to CPU immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iteration 1&lt;/strong&gt; (buffer 1, Stream 1): &lt;code&gt;streams[1].synchronize()&lt;/code&gt; returns instantly (empty stream). Pack into buffer 1, broadcast. Stream 0's broadcast may still be running. &lt;strong&gt;This is the overlap.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iteration 2&lt;/strong&gt; (buffer 0, Stream 0): &lt;code&gt;streams[0].synchronize()&lt;/code&gt; -- &lt;strong&gt;the critical call&lt;/strong&gt;. We're about to reuse buffer 0. Must wait for Stream 0's previous broadcast to finish. Then safely overwrite.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Does the Broadcast Return Instantly?
&lt;/h3&gt;

&lt;p&gt;Because NCCL operations are asynchronous on the GPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pynccl.py, lines 342-366
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;current_stream&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sendbuff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buffer_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;data_ptr&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;recvbuff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buffer_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;data_ptr&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  &lt;span class="c1"&gt;# Required by NCCL
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sendbuff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buffer_type&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;        &lt;span class="c1"&gt;# Null — receiver doesn't send
&lt;/span&gt;        &lt;span class="n"&gt;recvbuff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buffer_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;data_ptr&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nccl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ncclBroadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;sendbuff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recvbuff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;numel&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;ncclDataTypeEnum&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_torch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;comm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nf"&gt;cudaStream_t&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda_stream&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# No synchronize()! Returns to CPU immediately.
&lt;/span&gt;    &lt;span class="c1"&gt;# Caller synchronizes at TOP of next iteration.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No &lt;code&gt;synchronize()&lt;/code&gt; after the call. The synchronization happens at the &lt;strong&gt;top of the next iteration&lt;/strong&gt;, when we need to reuse the buffer.&lt;/p&gt;

&lt;p&gt;Also critical: NCCL broadcast is a &lt;strong&gt;collective&lt;/strong&gt;. Every process must call it the same number of times, in the same order. Mismatch = deadlock.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Checkpoint&lt;/strong&gt;: We've covered streams (independent lanes), uint8 packing (uniform bytes), and double-buffering (overlap via alternating buffers). This pattern handles bulk, structured transfers -- but still blocks the caller. What if we need to not block at all?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  KV Connector Background Threads: Transfer Without Stopping Inference
&lt;/h2&gt;

&lt;p&gt;In disaggregated serving, prefill nodes compute KV caches and ship them to decode nodes. This must happen &lt;strong&gt;during live inference&lt;/strong&gt; without stalling the decode engine.&lt;/p&gt;

&lt;p&gt;vLLM's &lt;code&gt;p2p_nccl_engine.py&lt;/code&gt; takes a fundamentally different approach: &lt;strong&gt;background Python threads with producer-consumer queues&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The main inference thread does this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# p2p_nccl_engine.py, lines 254-258
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PUT_ASYNC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_queue_cv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_queue_cv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;notify&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Wake background thread
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# Returns IMMEDIATELY to inference loop
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five lines. Append to a deque, notify a condition variable, return. Zero blocking.&lt;/p&gt;

&lt;p&gt;Meanwhile, a daemon thread drains the queue:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# p2p_nccl_engine.py, lines 476-484
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Background daemon thread — drains send queue.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_queue_cv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_queue_cv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;# Releases GIL, sleeps
&lt;/span&gt;            &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;popleft&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_queue_cv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;notify&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Signal: queue drained
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# NCCL send + stream.synchronize()
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The background thread sleeps on &lt;code&gt;send_queue_cv.wait()&lt;/code&gt; (zero CPU cost), wakes when notified, and performs the NCCL P2P send on a &lt;strong&gt;dedicated CUDA stream&lt;/strong&gt; (&lt;code&gt;self.send_stream&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; Inference Thread                    Background Threads
 ────────────────                    ──────────────────
   decode step                       ┌───────────────────┐
     send_tensor()                   │   Send Thread      │
       .append() + .notify() ──────►│   cv.wait()        │
     return True (instantly!)        │   ncclSend(item)   │
   next decode step                  └───────────────────┘
     recv_tensor()                   ┌───────────────────┐
       cv.wait() ◄──────────────────│   Listener Thread  │
       tensor = recv_store[id]       │   zmq.poll()       │
   continue inference                │   ncclRecv(tensor)  │
                                     │   cv.notify() ─────┘
                                     └───────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The GIL Question
&lt;/h3&gt;

&lt;p&gt;Python's GIL means only one thread runs Python at a time. How does this overlap?&lt;/p&gt;

&lt;p&gt;NCCL calls are C library calls that &lt;strong&gt;release the GIL&lt;/strong&gt;. While the send thread is inside &lt;code&gt;ncclSend()&lt;/code&gt;, the inference thread is free to run. The threads work because they spend 99% of their time in GIL-releasing C calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Main thread:  [Python][Python][Python][Python]
Send thread:  [cv.wait][ncclSend...........][cv.wait]
               GIL-free  GIL-free (C lib)    GIL-free
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implicit vs. Explicit Coordination
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Weight Transfer&lt;/th&gt;
&lt;th&gt;KV Connector&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Coordination&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Implicit (NCCL collective)&lt;/td&gt;
&lt;td&gt;Explicit (ZMQ + queues)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overlap target&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pack ↔ Broadcast&lt;/td&gt;
&lt;td&gt;Transfer ↔ Inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Blocks inference?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Concurrency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2 CUDA streams&lt;/td&gt;
&lt;td&gt;OS threads + dedicated streams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failure mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hang (collective mismatch)&lt;/td&gt;
&lt;td&gt;Timeout (missed ZMQ)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bulk, periodic&lt;/td&gt;
&lt;td&gt;Streaming, continuous&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Weight transfer coordinates implicitly through collectives -- simple but rigid. KV connector coordinates explicitly through ZMQ messages -- more machinery but fully async.&lt;/p&gt;

&lt;h2&gt;
  
  
  Eight Antipatterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Synchronization
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Syncing the wrong stream.&lt;/strong&gt; &lt;code&gt;streams[1].synchronize()&lt;/code&gt; when you needed &lt;code&gt;streams[0]&lt;/code&gt;. Buffer 0 gets corrupted mid-broadcast. Silent wrong answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Missing synchronize entirely.&lt;/strong&gt; Works under light load, race condition under heavy load. Nondeterministic, passes tests, fails in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. &lt;code&gt;torch.cuda.synchronize()&lt;/code&gt; instead of &lt;code&gt;stream.synchronize()&lt;/code&gt;.&lt;/strong&gt; Waits for ALL streams. Correctness maintained, performance destroyed (2-5x slower).&lt;/p&gt;

&lt;h3&gt;
  
  
  Buffers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;4. One buffer (num_buffers=1).&lt;/strong&gt; Must sync after every broadcast. All complexity, zero overlap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Too many buffers (num_buffers=8).&lt;/strong&gt; 8GB of VRAM wasted. Near-zero benefit over 2 buffers -- bandwidth is the bottleneck, not packing speed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Coordination
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;6. Mismatched iteration order.&lt;/strong&gt; Producer: [layer0, layer1, layer2]. Consumer: [layer2, layer0, layer1]. Every collective matches. No hang. Wrong bytes in wrong layers. Silent catastrophe.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Forgetting NCCL is a collective.&lt;/strong&gt; One process skips a broadcast via early-exit. Others block forever. Deadlock, no error message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. Default stream from background thread.&lt;/strong&gt; Send thread doesn't specify a stream. NCCL work goes on the default stream. Serializes with inference. Zero overlap despite threading complexity. Fix: &lt;code&gt;self.send_stream = torch.cuda.Stream()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;
  Bonus: The GIL Illusion
  &lt;br&gt;
Python threads only provide parallelism when threads spend time in C extensions that release the GIL. If your background thread does tensor slicing in Python between NCCL calls, it holds the GIL and blocks inference. Keep the thread body a tight loop of C calls.&lt;br&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;These two patterns form a reusable vocabulary for GPU concurrency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Double-buffered CUDA streams&lt;/strong&gt;: Overlap data packing with transfer. Use for bulk, periodic transfers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background threads + producer-consumer queues&lt;/strong&gt;: Decouple transfer from the critical path. Use for streaming, continuous transfers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The meta-lesson: the hardest part of GPU concurrency is not making things fast -- it's making things correct while fast. &lt;code&gt;stream.synchronize()&lt;/code&gt; is one line of code that separates "works" from "silently corrupts." The antipatterns exist because the failure mode of a missing sync is not a crash but wrong answers.&lt;/p&gt;

&lt;p&gt;The next time you see a &lt;code&gt;stream.synchronize()&lt;/code&gt; call in GPU code, don't skip over it. That one line is the load-bearing wall.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Code from vLLM's main branch: &lt;code&gt;vllm/distributed/weight_transfer/packed_tensor.py&lt;/code&gt; and &lt;code&gt;vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py&lt;/code&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>cuda</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Your Ray Data Pipeline Works at 10K Samples. Here's Why It Crashes at 1M.</title>
      <dc:creator>Mayank Ketkar</dc:creator>
      <pubDate>Mon, 09 Feb 2026 14:25:51 +0000</pubDate>
      <link>https://forem.com/mketkar/your-ray-data-pipeline-works-at-10k-samples-heres-why-it-crashes-at-1m-2g7k</link>
      <guid>https://forem.com/mketkar/your-ray-data-pipeline-works-at-10k-samples-heres-why-it-crashes-at-1m-2g7k</guid>
      <description>&lt;p&gt;There's a moment every ML infrastructure engineer knows: the evaluation pipeline that worked perfectly on 10,000 samples crashes catastrophically when you point it at a million.&lt;/p&gt;

&lt;p&gt;The model didn't change. The GPUs are fine. The inference code is identical. The &lt;em&gt;data pipeline&lt;/em&gt; is the bottleneck — and it fails in ways that are completely invisible at small scale.&lt;/p&gt;

&lt;p&gt;I spent a week scaling a Ray Data pipeline from 8,600 samples to 965,000 multi-image samples for a vision-language model (Qwen3-VL). Every sample contained 5-10 video frames — so the real data volume was closer to 5-10 million images flowing through the system. Along the way, I hit five distinct distributed systems problems, each of which required a different fix.&lt;/p&gt;

&lt;p&gt;This is the field guide.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;S3 (MDS shards) → Read → Preprocess (CPU) → Inference (GPU) → Results
   966 shards       download,    decode images,     Qwen3-VL        CSV +
   ~266MB each      base64       resize, format     via vLLM        metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage has workers. Ray Data orchestrates passing items between stages automatically — like conveyor belts between factory stations. The key insight: this is a &lt;em&gt;streaming&lt;/em&gt; pipeline. Data should flow through continuously, not accumulate at any stage.&lt;/p&gt;

&lt;p&gt;At 10K samples, everything fits in memory. At &amp;gt;1M samples with multi-image inputs, nothing does.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 1: The AllToAll Barrier That Defeats Streaming
&lt;/h2&gt;

&lt;p&gt;The original pipeline had a &lt;code&gt;repartition()&lt;/code&gt; call between data loading and preprocessing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Step 1: Load dataset (streaming from S3)
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset_handler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dataset_s3_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Repartition for parallel preprocessing  &amp;lt;-- THE BARRIER
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repartition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_blocks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# AllToAll!
# Nothing downstream starts until ALL of ReadMDS finishes
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 3: Parallel preprocessing
&lt;/span&gt;&lt;span class="n"&gt;processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;preprocessor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;160&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: GPU inference
&lt;/span&gt;&lt;span class="n"&gt;processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VLMEngine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_gpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Repartition is an &lt;strong&gt;AllToAll barrier&lt;/strong&gt;. Think of it like a quality checkpoint at a car factory where &lt;em&gt;every single chassis&lt;/em&gt; must arrive before &lt;em&gt;any&lt;/em&gt; chassis can move to the paint shop. Even if the first 100 are ready, they sit idle while chassis #965,000 is still being welded.&lt;/p&gt;

&lt;p&gt;At 10K samples, the repartition was instant. At 1M samples being streamed from S3, it meant: download all 256GB first, hold it in memory, &lt;em&gt;then&lt;/em&gt; start processing. GPUs idle for the entire download.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Remove it. With a streaming datasource producing multiple blocks, preprocessing can pull rows directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Step 1: Stream data from S3 (no full materialization)
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset_handler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dataset_s3_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Parallel preprocessing (CPU-bound)
&lt;/span&gt;&lt;span class="n"&gt;processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;SceneIQPreprocessor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fn_constructor_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ActorPoolStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_preprocessing_workers&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: GPU inference — data flows here immediately
&lt;/span&gt;&lt;span class="n"&gt;processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;VLMEngine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_gpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_vllm_engines&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: Postprocess and collect results
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postprocess&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;materialize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Problem 2: Ray Thinks Each Task Uses 500MB. It's Actually 39GB.
&lt;/h2&gt;

&lt;p&gt;The custom datasource creates &lt;code&gt;ReadTask&lt;/code&gt; objects that download MDS shards from S3. Each task reports expected memory via &lt;code&gt;BlockMetadata.size_bytes&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The original code set this to the raw shard size on S3 (~266MB). But in memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;On S3 (compressed):     266 MB
After base64 encoding:  364 MB  (1.37x expansion)
In PyArrow table:       ~500 MB (column overhead)
Total per-task (16 tasks, ~60 shards each): ~39 GB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Tell Ray the truth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: Ray thought each task used ~266MB (raw shard size)
# After:  Tell Ray the actual in-memory size
&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;shard_info&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;task_shards&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;task_bytes&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;shard_info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Base64 expands ~1.37x, plus PyArrow/dict overhead.
# Use 4x raw bytes as conservative in-memory estimate.
&lt;/span&gt;&lt;span class="n"&gt;estimated_mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task_bytes&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;

&lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BlockMetadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;size_bytes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;estimated_mem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# was just task_bytes
&lt;/span&gt;    &lt;span class="n"&gt;input_files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;input_files&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;exec_stats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single line was the difference between "workers crash every run" and "steady-state for 12 hours."&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 3: 16 Tasks for 966 Shards = 64GB Per Task
&lt;/h2&gt;

&lt;p&gt;Even with correct memory estimation, 16 tasks was far too few. This constant changed four times. Each wrong value crashed the cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This constant changed 4 times. Each wrong value
# crashed the cluster.
#
# 16  -&amp;gt; ~60 shards/task -&amp;gt; 64 GB -&amp;gt; OOM
# 48  -&amp;gt; ~20 shards/task -&amp;gt; 21 GB -&amp;gt; OOM
# 128 -&amp;gt; ~8 shards/task  -&amp;gt; 8.5 GB -&amp;gt; OOM (barely)
# 512 -&amp;gt; ~2 shards/task  -&amp;gt; 2 GB   -&amp;gt; Stable
#
# More tasks = less memory per task.
# Ray schedules them across CPUs, not all at once.
&lt;/span&gt;&lt;span class="n"&gt;_DEFAULT_MAX_TASKS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Having 512 tasks doesn't mean 512 simultaneous downloads. It means 512 small, schedulable units. Ray runs a handful at a time based on available CPU.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 4: CPU Oversubscription
&lt;/h2&gt;

&lt;p&gt;The original config was designed for a large cluster (280 CPU). On a smaller cluster (56 CPU):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# BEFORE: CPU oversubscription (crashed)&lt;/span&gt;
&lt;span class="c1"&gt;# vLLM:         20 engines x 4 CPU = 80 CPU&lt;/span&gt;
&lt;span class="c1"&gt;# Preprocessing: 160 workers x 1 CPU = 160 CPU&lt;/span&gt;
&lt;span class="c1"&gt;# Total:                               240 CPU&lt;/span&gt;
&lt;span class="c1"&gt;# Available:                            56 CPU&lt;/span&gt;
&lt;span class="c1"&gt;# Headroom:                              0 CPU  &amp;lt;-- CRASH&lt;/span&gt;

&lt;span class="c1"&gt;# AFTER: Right-sized for actual hardware&lt;/span&gt;
&lt;span class="na"&gt;num_vllm_engines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;         &lt;span class="c1"&gt;# 6 x 4 CPU = 24 CPU&lt;/span&gt;
&lt;span class="na"&gt;num_preprocessing_workers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;  &lt;span class="c1"&gt;# 16 x 1 CPU = 16 CPU&lt;/span&gt;
&lt;span class="c1"&gt;# Total:                               40 CPU&lt;/span&gt;
&lt;span class="c1"&gt;# Available:                           56 CPU&lt;/span&gt;
&lt;span class="c1"&gt;# Headroom:                            16 CPU  &amp;lt;-- Safe&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Preprocessing is fast — 16 workers can easily keep 6 GPUs fed at ~22 samples/s.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 5: Two Engines on a 64GB Pod
&lt;/h2&gt;

&lt;p&gt;8 vLLM engines across 6 worker nodes means some pods get 2 engines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Worker pod: 64 GB RAM
2 vLLM engines: ~30-40 GB (model + KV cache)
+ Object store: ~10-15 GB
+ ReadMDS data: ~5-10 GB
= ~55-65 GB --&amp;gt; OOM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; 1 engine per worker node. Physical constraint, not a tuning parameter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Under the Hood: The Custom Datasource
&lt;/h2&gt;

&lt;p&gt;The fix for Problems 2 and 3 lives in a custom Ray &lt;code&gt;Datasource&lt;/code&gt;. Here's the actual code.&lt;/p&gt;

&lt;p&gt;The architecture: the driver reads a lightweight &lt;code&gt;index.json&lt;/code&gt; from S3 (under 1KB), groups shards into bounded ReadTasks, and each task independently downloads and decodes its shards on a Ray worker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Maximum number of ReadTasks to create.
# More tasks = less memory per task (avoids OOM),
# but too many adds scheduling overhead.
#
# With datasets up to ~1000 shards at ~266MB each,
# use a high limit so each task gets 1-2 shards
# (~1GB in memory), preventing worker OOM.
# Ray Data will schedule them across available CPUs,
# so only a few run concurrently.
&lt;/span&gt;&lt;span class="n"&gt;_DEFAULT_MAX_TASKS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;get_read_tasks()&lt;/code&gt; method is where memory estimation happens. This is the method Ray Data calls to plan its work — and where the 4x multiplier prevents OOM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_read_tasks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parallelism&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;per_task_row_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Collect shards up to effective_samples limit
&lt;/span&gt;    &lt;span class="n"&gt;active_shards&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_effective_samples&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;shard&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_shard_infos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shard&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;samples&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;active_shards&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;shard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;

    &lt;span class="n"&gt;num_shards&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;active_shards&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;num_tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_shards&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_max_tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Distribute shards across tasks (contiguous groups)
&lt;/span&gt;    &lt;span class="n"&gt;shards_per_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_shards&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;num_tasks&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;num_tasks&lt;/span&gt;

    &lt;span class="n"&gt;read_tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;shard_groups&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;task_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Base64 expands ~1.37x, plus PyArrow/dict overhead.
&lt;/span&gt;        &lt;span class="c1"&gt;# Use 4x raw bytes as conservative in-memory estimate
&lt;/span&gt;        &lt;span class="c1"&gt;# so Ray can schedule tasks without OOM-killing workers.
&lt;/span&gt;        &lt;span class="n"&gt;estimated_mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task_bytes&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;

        &lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BlockMetadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;size_bytes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;estimated_mem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# was just task_bytes!
&lt;/span&gt;            &lt;span class="n"&gt;input_files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;input_files&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;exec_stats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;read_tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ReadTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;read_fn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;_make_read_fn&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;read_tasks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each ReadTask runs this function on a Ray worker — it downloads 1-2 shards, decodes samples, and returns a PyArrow table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_read_mds_shards&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;remote_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shard_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_samples&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Download MDS shards from S3 and extract samples.
    Runs on Ray workers — each task handles 1-2 shards.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;shard_info&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;shard_group&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Download shard from S3 to temp directory
&lt;/span&gt;        &lt;span class="n"&gt;basename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;shard_info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;basename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_parse_s3_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;remote_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;basename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;local_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Decode samples via MDSReader
&lt;/span&gt;        &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MDSReader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dirname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tmp_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;shard_info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shard_samples&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;raw_sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_extract_sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_sample&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Return as PyArrow table for Ray Data streaming
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_blocks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_blocks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;})]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design choices: each task is self-contained (downloads its own shards, no shared state), the memory estimate is conservative (4x raw bytes), and the PyArrow table output integrates directly with Ray Data's streaming execution.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Expanding Suitcase
&lt;/h2&gt;

&lt;p&gt;Here's what most people miss. Each sample isn't text — it's 5-10 video frames:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Per sample in ReadMDS:      ~5-20 MB  (base64 strings)
Per sample in preprocessing: ~50-100 MB (PIL Images, uncompressed!)
  A 1 MB JPEG -&amp;gt; ~10 MB as a PIL Image
  x 10 frames = 100 MB per sample
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the "expanding suitcase" problem. You pack vacuum-sealed clothes (compressed images, ~266MB per shard). At the destination, you unseal them and they expand 4-10x.&lt;/p&gt;

&lt;p&gt;The saving grace: &lt;strong&gt;vLLM is the bottleneck.&lt;/strong&gt; At ~22 samples/s, it's slow enough that data doesn't pile up. Natural backpressure keeps the pipeline stable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Original&lt;/th&gt;
&lt;th&gt;Final&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Repartition&lt;/td&gt;
&lt;td&gt;AllToAll barrier&lt;/td&gt;
&lt;td&gt;Removed&lt;/td&gt;
&lt;td&gt;Eliminated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory estimation&lt;/td&gt;
&lt;td&gt;1x (raw bytes)&lt;/td&gt;
&lt;td&gt;4x multiplier&lt;/td&gt;
&lt;td&gt;Realistic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ReadMDS max_tasks&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;+3100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vLLM engines&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;6 (1/node)&lt;/td&gt;
&lt;td&gt;-70%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Preprocessing workers&lt;/td&gt;
&lt;td&gt;160&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;-90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU utilization&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;td&gt;Headroom&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;1M multi-image samples. ~22 samples/s. 12+ hours. Zero OOM crashes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model code didn't change. Every fix was in how data gets to the GPUs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Repartition kills streaming.&lt;/strong&gt; It's an AllToAll barrier that forces full materialization. Remove it if your datasource already produces multiple blocks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;BlockMetadata.size_bytes&lt;/code&gt; is your memory contract with Ray.&lt;/strong&gt; For image/video data, in-memory size can be 4-10x on-disk size. Set it explicitly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;More tasks = less memory per task.&lt;/strong&gt; The simplest OOM fix. 512 tasks doesn't mean 512 simultaneous downloads.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Right-size for physical topology.&lt;/strong&gt; Count GPUs, CPUs, and RAM &lt;em&gt;per node&lt;/em&gt;, not just totals. One engine per node avoids hidden RAM contention.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The data pipeline is always the bottleneck at scale.&lt;/strong&gt; At 10K, GPU is the constraint. At 1M, the plumbing is. The inference code doesn't need to change.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Ray Data custom datasources: &lt;a href="https://docs.ray.io/en/latest/data/api/datasource.html" rel="noopener noreferrer"&gt;docs.ray.io&lt;/a&gt; | Performance tips: &lt;a href="https://docs.ray.io/en/latest/data/performance-tips.html" rel="noopener noreferrer"&gt;docs.ray.io&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>dataengineering</category>
      <category>machinelearning</category>
      <category>performance</category>
    </item>
    <item>
      <title>Compiling the Vision Encoder: Squeezing 3% More Throughput from Qwen3-VL on Hopper GPUs</title>
      <dc:creator>Mayank Ketkar</dc:creator>
      <pubDate>Mon, 09 Feb 2026 02:08:38 +0000</pubDate>
      <link>https://forem.com/mketkar/compiling-the-vision-encoder-squeezing-3-more-throughput-from-qwen3-vl-on-hopper-gpus-24ma</link>
      <guid>https://forem.com/mketkar/compiling-the-vision-encoder-squeezing-3-more-throughput-from-qwen3-vl-on-hopper-gpus-24ma</guid>
      <description>&lt;p&gt;When you run a vision-language model through vLLM, the framework does something clever: it compiles the LLM decoder with &lt;code&gt;torch.compile&lt;/code&gt;, fuses operators, and captures CUDA graphs for maximum throughput. But there is a component it quietly leaves behind -- the Vision Transformer (ViT) encoder that processes your images. It runs in plain eager mode, every single time.&lt;/p&gt;

&lt;p&gt;We changed that for Qwen3-VL. The result: &lt;strong&gt;3.4% higher throughput&lt;/strong&gt; on an NVIDIA H200, three previously unknown bugs discovered and fixed, and a one-flag change that any vLLM user can enable today.&lt;/p&gt;

&lt;p&gt;This post walks through the engineering story -- why the encoder was left behind, how we ported compilation support from a sibling model, what broke along the way, and what the profiler actually says about where the time goes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Does the Encoder Run Eager?
&lt;/h2&gt;

&lt;p&gt;vLLM's compilation infrastructure is built around the LLM decoder. When you launch an inference server, the startup sequence compiles the decoder's forward pass with &lt;code&gt;torch.compile&lt;/code&gt;, traces its graph, and captures CUDA graphs at various batch sizes. This eliminates Python overhead and enables kernel fusion across attention, LayerNorm, and MLP layers.&lt;/p&gt;

&lt;p&gt;The multimodal encoder -- the ViT that converts raw image pixels into embedding vectors -- gets none of this treatment. The reason is a single boolean flag in vLLM's compilation config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;compile_mm_encoder&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Whether or not to compile the multimodal encoder.
Currently, this only works for Qwen2_5_vl and mLLaMa4
models on selected platforms. Disabled by default until
more models are supported/tested to work.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default is &lt;code&gt;False&lt;/code&gt;, and for good reason. Vision encoders face a fundamental tension with compilation: &lt;strong&gt;variable input shapes&lt;/strong&gt;. Different requests can carry images at different resolutions, producing different numbers of patches. CUDA graphs require fixed tensor shapes at capture time. A general-purpose serving framework cannot assume that every image will be the same size.&lt;/p&gt;

&lt;p&gt;But for batch inference workloads with fixed-size images -- which is common in production pipelines processing standardized camera frames, satellite tiles, or document pages -- this conservatism leaves performance on the table. If your images are all the same resolution, the encoder always receives identically shaped tensors, and &lt;code&gt;torch.compile&lt;/code&gt; can fully specialize.&lt;/p&gt;

&lt;p&gt;There was a second, more specific problem: &lt;strong&gt;Qwen3-VL simply lacked the compilation decorators.&lt;/strong&gt; Its sibling model, Qwen2.5-VL, already had full &lt;code&gt;torch.compile&lt;/code&gt; support for its encoder. Qwen3-VL shared much of the same architecture (including the identical attention implementation), but the compilation wiring was never ported over.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pattern: Porting from Qwen2.5-VL
&lt;/h2&gt;

&lt;p&gt;vLLM uses a decorator-based system for selective compilation. Rather than compiling an entire model's forward pass (which would break on Python control flow, NumPy calls, and dynamic branching), it compiles individual submodules whose &lt;code&gt;forward()&lt;/code&gt; methods contain only clean tensor operations.&lt;/p&gt;

&lt;p&gt;Qwen2.5-VL already had this wired up for three encoder submodules: &lt;code&gt;VisionPatchEmbed&lt;/code&gt;, &lt;code&gt;VisionBlock&lt;/code&gt;, and &lt;code&gt;VisionPatchMerger&lt;/code&gt;. Our task was to replicate the exact same pattern in Qwen3-VL.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Decorator
&lt;/h3&gt;

&lt;p&gt;Each compilable submodule gets a &lt;code&gt;@support_torch_compile&lt;/code&gt; decorator that declares which tensor dimensions are dynamic and provides a gating function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@support_torch_compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dynamic_arg_dims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;enable_if&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;should_torch_compile_mm_vit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Qwen3_VisionPatchEmbed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;dynamic_arg_dims={"x": 0}&lt;/code&gt; tells &lt;code&gt;torch.compile&lt;/code&gt; that dimension 0 of the input tensor &lt;code&gt;x&lt;/code&gt; can vary between calls (different numbers of patches), so it should not bake that shape into the compiled graph. The &lt;code&gt;enable_if&lt;/code&gt; callback is a one-liner that checks whether the user opted in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_torch_compile_mm_vit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vllm_config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;VllmConfig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;vllm_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compilation_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile_mm_encoder&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;compile_mm_encoder&lt;/code&gt; is &lt;code&gt;False&lt;/code&gt; (the default), the decorator sets &lt;code&gt;self.do_not_compile = True&lt;/code&gt; and the forward pass runs in eager mode -- zero overhead, zero behavior change. When it is &lt;code&gt;True&lt;/code&gt;, the decorator wraps the module in &lt;code&gt;torch.compile&lt;/code&gt; on first call and uses compiled execution from then on.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Model Tags
&lt;/h3&gt;

&lt;p&gt;The second piece of wiring is &lt;code&gt;set_model_tag&lt;/code&gt;, a context manager that tells the compilation backend to use separate caches for encoder versus decoder components. Without tags, the encoder and decoder would share a single compile cache, causing shape mismatches when the compiler tries to reuse a graph compiled for decoder weight shapes on encoder weights.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;Qwen3_VisionTransformer.__init__()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# DO NOT MOVE THIS IMPORT
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vllm.compilation.backends&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;set_model_tag&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;set_model_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3_VisionPatchEmbed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_encoder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;patch_embed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Qwen3_VisionPatchEmbed&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;set_model_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3_VisionPatchMerger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_encoder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Qwen3_VisionPatchMerger&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;

&lt;span class="c1"&gt;# Deepstack mergers need a separate tag (different weight shapes!)
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;set_model_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3_VisionPatchMerger_deepstack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_encoder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deepstack_merger_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ModuleList&lt;/span&gt;&lt;span class="p"&gt;([...])&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;set_model_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3_VisionBlock&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_encoder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ModuleList&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="nc"&gt;Qwen3_VisionBlock&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That comment about &lt;code&gt;DO NOT MOVE THIS IMPORT&lt;/code&gt; is not a joke -- it matches the exact pattern in Qwen2.5-VL and relates to import ordering constraints with the compilation backend (see &lt;a href="https://github.com/vllm-project/vllm/issues/27044" rel="noopener noreferrer"&gt;vllm#27044&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Notice the deepstack mergers get their own tag, separate from the main merger. This was not in the original plan. It was the fix for Bug #2, which we will get to shortly.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Gets Compiled
&lt;/h3&gt;

&lt;p&gt;The Qwen3-VL vision encoder has three distinct compilable submodules:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Submodule&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Dynamic Dims&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Qwen3_VisionPatchEmbed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reshape + Conv3D + reshape (pixels to patch embeddings)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;x: dim 0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;Qwen3_VisionBlock&lt;/code&gt; (x24)&lt;/td&gt;
&lt;td&gt;LayerNorm -&amp;gt; Attention -&amp;gt; Residual -&amp;gt; LayerNorm -&amp;gt; MLP -&amp;gt; Residual&lt;/td&gt;
&lt;td&gt;&lt;code&gt;x, cu_seqlens, cos, sin: dim 0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Qwen3_VisionPatchMerger&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LayerNorm -&amp;gt; Linear -&amp;gt; GELU -&amp;gt; Linear (merge spatial patches)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;x: dim 0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The outer &lt;code&gt;VisionTransformer.forward()&lt;/code&gt; -- which orchestrates these submodules -- is deliberately &lt;strong&gt;not&lt;/strong&gt; compiled. It contains NumPy operations (&lt;code&gt;np.array&lt;/code&gt;, &lt;code&gt;np.cumsum&lt;/code&gt;), Python control flow (&lt;code&gt;isinstance&lt;/code&gt;, list comprehensions), and &lt;code&gt;.tolist()&lt;/code&gt; calls that would cause graph breaks. The per-submodule pattern avoids all of this.&lt;/p&gt;




&lt;h2&gt;
  
  
  Zero Graph Breaks
&lt;/h2&gt;

&lt;p&gt;The first compile attempt was the moment of truth. We enabled &lt;code&gt;TORCH_LOGS=+dynamo&lt;/code&gt; and &lt;code&gt;TORCH_COMPILE_DEBUG=1&lt;/code&gt;, loaded a handful of test images, and watched TorchDynamo trace through the encoder.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;zero graph breaks&lt;/strong&gt;. The single &lt;code&gt;COMPILING GRAPH&lt;/code&gt; event reported:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;COMPILING GRAPH due to GraphCompileReason(
    reason='return_value',
    user_stack=[&amp;lt;FrameSummary file qwen3_vl.py, line 1169 in forward&amp;gt;],
    graph_break=False
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was expected but still satisfying. The per-submodule compilation pattern is specifically designed to isolate clean tensor operations from Python control flow. Each compiled forward method contains nothing but &lt;code&gt;torch&lt;/code&gt; operations -- reshapes, linear projections, attention, LayerNorm, residual additions. No data-dependent control flow, no Python-side data structures, no calls that escape the Dynamo graph.&lt;/p&gt;

&lt;p&gt;The key insight: if you tried to compile the entire &lt;code&gt;VisionTransformer.forward()&lt;/code&gt; as one graph, you would hit graph breaks immediately on the NumPy calls that compute positional embeddings and cumulative sequence lengths. By compiling only the inner submodules, you get all the fusion benefits with none of the graph break headaches.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Bugs Found and Fixed
&lt;/h2&gt;

&lt;p&gt;Zero graph breaks did not mean zero problems. The first full run crashed. Then it crashed differently. Then it crashed a third way. Here is what we found.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug 1: &lt;code&gt;AssertionError: Forward context is not set&lt;/code&gt; in &lt;code&gt;profile_run()&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The crash:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AssertionError: Forward context is not set.
Please use `set_forward_context` to set the forward context.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happened:&lt;/strong&gt; When vLLM starts up, it runs a profiling pass (&lt;code&gt;profile_run()&lt;/code&gt;) to determine memory usage. This calls &lt;code&gt;self.model.embed_multimodal()&lt;/code&gt; to profile the encoder. In eager mode, this works fine -- the encoder's forward methods are just regular PyTorch calls.&lt;/p&gt;

&lt;p&gt;But with &lt;code&gt;@support_torch_compile&lt;/code&gt;, the compilation backend wraps each submodule in a &lt;code&gt;CUDAGraphWrapper&lt;/code&gt;. The wrapper's &lt;code&gt;__call__&lt;/code&gt; method reads &lt;code&gt;forward_context.cudagraph_runtime_mode&lt;/code&gt; to decide whether to execute via CUDA graph or fall through to eager. Without a forward context set, it crashes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Wrap the profiling call in &lt;code&gt;set_forward_context&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;set_forward_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attn_metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vllm_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vllm_config&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;dummy_encoder_outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_multimodal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;batched_dummy_mm_inputs&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since &lt;code&gt;attn_metadata=None&lt;/code&gt;, the wrapper sees &lt;code&gt;CUDAGraphMode.NONE&lt;/code&gt; and falls through to eager execution -- exactly the behavior we want during profiling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug 2: &lt;code&gt;AssertionError: expected size 1024==4096&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The crash:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AssertionError: expected size 1024==4096, stride 1==1 at dim=0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happened:&lt;/strong&gt; Qwen3-VL has two types of patch mergers. The main merger has a LayerNorm over &lt;code&gt;context_dim=1024&lt;/code&gt; (the per-patch hidden size before spatial merging). The deepstack mergers have a LayerNorm over &lt;code&gt;hidden_size=4096&lt;/code&gt; (the full hidden size, via &lt;code&gt;use_postshuffle_norm=True&lt;/code&gt;). Both use the &lt;code&gt;Qwen3_VisionPatchMerger&lt;/code&gt; class.&lt;/p&gt;

&lt;p&gt;In our initial implementation, both mergers shared the same &lt;code&gt;set_model_tag("Qwen3_VisionPatchMerger")&lt;/code&gt; context. This meant they shared a single compiled graph cache. When &lt;code&gt;torch.compile&lt;/code&gt; traced through the main merger (norm weight shape &lt;code&gt;(1024,)&lt;/code&gt;), it cached a graph with that shape baked in. When the deepstack merger tried to reuse the same cached graph with its &lt;code&gt;(4096,)&lt;/code&gt; weights -- crash.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Separate model tags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;set_model_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3_VisionPatchMerger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_encoder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;          &lt;span class="c1"&gt;# LayerNorm over 1024
&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;set_model_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen3_VisionPatchMerger_deepstack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_encoder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deepstack_merger_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;  &lt;span class="c1"&gt;# LayerNorm over 4096
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same Python class, different compile caches. The tag system was designed exactly for this -- but you have to remember to use it when two instances of the same class have different weight shapes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug 3: Same as Bug 1, but in &lt;code&gt;_execute_mm_encoder()&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The profiling fix (Bug 1) resolved the startup crash, but the same &lt;code&gt;AssertionError&lt;/code&gt; appeared during actual inference. The encoder execution path in &lt;code&gt;_execute_mm_encoder()&lt;/code&gt; also called &lt;code&gt;embed_multimodal()&lt;/code&gt; without setting forward context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Same pattern -- wrap the encoder execution loop in &lt;code&gt;set_forward_context(attn_metadata=None, ...)&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Defense-in-Depth
&lt;/h3&gt;

&lt;p&gt;After fixing both call sites, we added a belt-and-suspenders guard in &lt;code&gt;CUDAGraphWrapper.__call__&lt;/code&gt; itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;is_forward_context_available&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;runnable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Eager fallback
&lt;/span&gt;    &lt;span class="n"&gt;forward_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_forward_context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If any future code path calls a compiled encoder submodule without setting forward context, it gracefully falls through to eager execution instead of crashing. This is defense-in-depth -- the primary fix is ensuring all call sites set the context, but the guard protects against regressions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Profiling: Where the Time Goes
&lt;/h2&gt;

&lt;p&gt;With compilation working, we instrumented the encoder with &lt;code&gt;torch.cuda.Event&lt;/code&gt; timing to measure exactly how much each component contributes and how much compilation helps.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Encoder Is Only 13.5% of Total Inference Time
&lt;/h3&gt;

&lt;p&gt;For Qwen3-VL-2B on our workload, the ViT encoder processes each image once to produce embedding tokens, then the LLM decoder generates the output sequence. The decoder dominates.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Baseline (ms)&lt;/th&gt;
&lt;th&gt;Compiled (ms)&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PatchEmbed&lt;/td&gt;
&lt;td&gt;5.2&lt;/td&gt;
&lt;td&gt;6.2&lt;/td&gt;
&lt;td&gt;-19%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VisionBlocks (24)&lt;/td&gt;
&lt;td&gt;352.5&lt;/td&gt;
&lt;td&gt;330.2&lt;/td&gt;
&lt;td&gt;+6.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PatchMerger&lt;/td&gt;
&lt;td&gt;3.8&lt;/td&gt;
&lt;td&gt;5.3&lt;/td&gt;
&lt;td&gt;-39%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Encoder&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;450.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;430.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+4.4%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  VisionBlocks Win, Small Ops Lose
&lt;/h3&gt;

&lt;p&gt;The 24 VisionBlocks are where compilation shines. Each block runs LayerNorm -&amp;gt; Attention -&amp;gt; Residual -&amp;gt; LayerNorm -&amp;gt; MLP -&amp;gt; Residual. The Inductor backend fuses these into fewer, more efficient kernels. Blocks 1-23 show a consistent &lt;strong&gt;7-8% per-block speedup&lt;/strong&gt;, accumulating to a 22.3ms reduction.&lt;/p&gt;

&lt;p&gt;PatchEmbed and PatchMerger show the opposite: &lt;strong&gt;compilation makes them slower&lt;/strong&gt;. These are tiny operations (~0.3ms per call). The &lt;code&gt;@support_torch_compile&lt;/code&gt; decorator adds Python dispatch overhead on every call, and at this scale, the overhead exceeds the fusion benefit. It is a classic tradeoff -- compilation has a per-call dispatch cost that only pays off when the compiled operation is large enough.&lt;/p&gt;

&lt;p&gt;A pragmatic optimization would be to remove the &lt;code&gt;@support_torch_compile&lt;/code&gt; decorators from PatchEmbed and PatchMerger, compiling only VisionBlocks. The net encoder speedup would actually be &lt;em&gt;slightly higher&lt;/em&gt; without the small-op regressions. But the dispatch overhead is small in absolute terms (a few milliseconds total), and having all submodules wired for compilation maintains consistency with the Qwen2.5-VL pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why 4.4% Encoder Speedup Becomes 3.4% End-to-End
&lt;/h3&gt;

&lt;p&gt;With the encoder representing 13.5% of total inference time, even a 4.4% encoder speedup translates to only ~0.6% of total wall time through Amdahl's Law. The actual measured end-to-end improvement is larger than that simple calculation suggests, likely because the compilation also reduces Python overhead and improves memory access patterns in ways that benefit the surrounding orchestration code.&lt;/p&gt;




&lt;h2&gt;
  
  
  End-to-End Benchmark
&lt;/h2&gt;

&lt;p&gt;We ran a full A/B comparison over ~8,000 samples on an NVIDIA H200, with 10-sample warmup excluded from measurements.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;Compiled&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;32.33 samp/s&lt;/td&gt;
&lt;td&gt;33.42 samp/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+3.4%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generate time&lt;/td&gt;
&lt;td&gt;266.1s&lt;/td&gt;
&lt;td&gt;257.4s&lt;/td&gt;
&lt;td&gt;-8.7s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-sample latency&lt;/td&gt;
&lt;td&gt;30.93ms&lt;/td&gt;
&lt;td&gt;29.92ms&lt;/td&gt;
&lt;td&gt;-1.0ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model load time&lt;/td&gt;
&lt;td&gt;37.3s&lt;/td&gt;
&lt;td&gt;50.2s&lt;/td&gt;
&lt;td&gt;+12.9s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;3.4% throughput improvement&lt;/strong&gt; is consistent across scales. We saw similar relative gains at 100 samples (+0.9% -- noisier at smaller scale) and at the full dataset (+3.4%).&lt;/p&gt;

&lt;p&gt;The model load time increase (+12.9s) is a one-time cost for Dynamo bytecode transforms and Inductor codegen on the encoder submodules. On subsequent runs, the compilation cache (&lt;code&gt;~/.cache/vllm/torch_compile_cache/&lt;/code&gt;) eliminates recompilation entirely -- subsequent startups are only marginally slower than baseline. In a production serving context, this compilation happens once at server startup and all subsequent inference benefits from the speedup.&lt;/p&gt;

&lt;h3&gt;
  
  
  Break-Even Analysis
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;One-time compilation overhead&lt;/td&gt;
&lt;td&gt;12.9s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-sample time saving&lt;/td&gt;
&lt;td&gt;~1.0ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Break-even point&lt;/td&gt;
&lt;td&gt;~12,900 samples&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For the first-ever run (cold compilation cache), you need to process approximately 13,000 samples before the compilation overhead is amortized. For any subsequent run with a warm cache, the benefit is immediate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Output Correctness
&lt;/h2&gt;

&lt;p&gt;An important caveat: compiled and baseline modes produce &lt;strong&gt;slightly different outputs&lt;/strong&gt; on some inputs. This is expected behavior from &lt;code&gt;torch.compile&lt;/code&gt; -- the Inductor backend may apply different operator fusion, reduction ordering, and kernel implementations that change floating-point rounding at the bit level. These tiny differences in intermediate activations can cascade through the encoder, shift logits by small amounts, and occasionally flip the argmax for borderline tokens during autoregressive decoding.&lt;/p&gt;

&lt;p&gt;Concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Both modes are &lt;strong&gt;individually deterministic&lt;/strong&gt; -- the same mode always produces the same output for the same input, run after run.&lt;/li&gt;
&lt;li&gt;They are &lt;strong&gt;not cross-compatible&lt;/strong&gt; -- baseline and compiled may differ on some samples.&lt;/li&gt;
&lt;li&gt;The differences are small in magnitude and affect only a fraction of samples.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a property of &lt;code&gt;torch.compile&lt;/code&gt; itself, not of our changes. If your application requires bitwise reproducibility between compiled and non-compiled modes, this is worth knowing. If you only need consistency within a single mode (the more common requirement), both modes deliver it.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Would This Matter More?
&lt;/h2&gt;

&lt;p&gt;A 3.4% throughput improvement is real and free (once the cache is warm), but it is bounded by the encoder's share of total inference time. For Qwen3-VL-2B, the ViT encoder is small relative to the LLM decoder. Several scenarios would amplify the benefit:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Larger ViT encoders.&lt;/strong&gt; Qwen3-VL-72B has a proportionally larger vision encoder. The same 7-8% per-block VisionBlock speedup applied to more expensive encoder blocks would yield a larger end-to-end improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Video workloads.&lt;/strong&gt; Video inputs require processing many frames, multiplying encoder invocations per request. The encoder's share of total time increases, and the compilation benefit compounds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High-concurrency serving.&lt;/strong&gt; When many requests arrive simultaneously, encoder latency adds up across the batch. Shaving 4.4% off each encoder call reduces queuing delay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bandwidth-bound GPUs.&lt;/strong&gt; The H200 is a compute-rich Hopper GPU. On more bandwidth-constrained hardware like the L40S, the operator fusion from &lt;code&gt;torch.compile&lt;/code&gt; (which reduces memory traffic by eliminating intermediate tensor materializations) would likely yield a larger percentage improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Higher-resolution images.&lt;/strong&gt; More patches per image means more work in the VisionBlocks, which are the primary beneficiaries of compilation.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Enable It
&lt;/h2&gt;

&lt;p&gt;One flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vllm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen3-VL-2B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;compilation_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compile_mm_encoder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# ... other settings
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or via the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve Qwen/Qwen3-VL-2B-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--compilation-config&lt;/span&gt; &lt;span class="s1"&gt;'{"compile_mm_encoder": true}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is it. No model changes, no custom code, no configuration gymnastics. The flag tells vLLM to apply &lt;code&gt;torch.compile&lt;/code&gt; to the ViT encoder submodules during model initialization. The first inference call that includes images will trigger compilation (or load from cache), and all subsequent calls use the compiled kernels.&lt;/p&gt;

&lt;h3&gt;
  
  
  First Run vs. Subsequent Runs
&lt;/h3&gt;

&lt;p&gt;On the very first run with a new model or new vLLM version, you will see a longer model load time (~13s extra) as TorchDynamo traces and Inductor generates code for the encoder submodules. These artifacts are cached to &lt;code&gt;~/.cache/vllm/torch_compile_cache/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;On all subsequent runs, the cached artifacts load in seconds, and the throughput benefit is immediate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This was a small change -- six modifications across two files for the core enablement, plus four files touched for bug fixes. The pattern was already established by Qwen2.5-VL; we just ported it to Qwen3-VL. But small changes can have disproportionate engineering value when they uncover latent bugs.&lt;/p&gt;

&lt;p&gt;The three bugs we found -- missing &lt;code&gt;set_forward_context&lt;/code&gt; in two encoder execution paths, and shared compile caches for mergers with different weight shapes -- are not specific to Qwen3-VL. They would affect any model that enables &lt;code&gt;compile_mm_encoder&lt;/code&gt;. The fixes (including the defense-in-depth guard in &lt;code&gt;CUDAGraphWrapper&lt;/code&gt;) benefit the entire vLLM multimodal compilation infrastructure.&lt;/p&gt;

&lt;p&gt;The profiling results tell an honest story: the ViT encoder is a small fraction of end-to-end time for a 2B parameter model, so even a solid 4.4% encoder speedup translates to a modest 3.4% end-to-end gain. But it is a free 3.4% -- one flag, cached after the first run, no accuracy impact within a single mode. For larger models, video workloads, or bandwidth-constrained hardware, the benefit would be larger.&lt;/p&gt;

&lt;p&gt;Sometimes the most useful engineering work is not building something new, but noticing that a capability already exists in the codebase and was never wired up for your model.&lt;/p&gt;




&lt;h3&gt;
  
  
  Summary of Changes
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm/model_executor/models/qwen3_vl.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@support_torch_compile&lt;/code&gt; decorators on 3 encoder submodules + &lt;code&gt;set_model_tag&lt;/code&gt; wiring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm/config/compilation.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Updated &lt;code&gt;compile_mm_encoder&lt;/code&gt; docstring to include Qwen3-VL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm/v1/worker/gpu_model_runner.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;set_forward_context&lt;/code&gt; wrapper in &lt;code&gt;_execute_mm_encoder()&lt;/code&gt; and &lt;code&gt;profile_run()&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm/compilation/cuda_graph.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;is_forward_context_available()&lt;/code&gt; guard in &lt;code&gt;CUDAGraphWrapper.__call__&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Hardware and Software
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: NVIDIA H200 (141 GB HBM3e)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM&lt;/strong&gt;: 0.15.x (main branch)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyTorch&lt;/strong&gt;: 2.9.x&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model&lt;/strong&gt;: Qwen3-VL-2B-Instruct (fine-tuned checkpoint)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload&lt;/strong&gt;: ~8,000 fixed-resolution images, single GPU, &lt;code&gt;temperature=0.0&lt;/code&gt;, &lt;code&gt;max_tokens=128&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>vllm</category>
      <category>pytorch</category>
      <category>gpu</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
