<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Elise Moreau</title>
    <description>The latest articles on Forem by Elise Moreau (@elise_moreau).</description>
    <link>https://forem.com/elise_moreau</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864909%2F72833c18-30db-4456-82ee-e7d2016cc38f.jpg</url>
      <title>Forem: Elise Moreau</title>
      <link>https://forem.com/elise_moreau</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/elise_moreau"/>
    <language>en</language>
    <item>
      <title>Why Your Diffusion Model Is Slow at Inference (And It's Not the UNet)</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Mon, 27 Apr 2026 04:48:07 +0000</pubDate>
      <link>https://forem.com/elise_moreau/why-your-diffusion-model-is-slow-at-inference-and-its-not-the-unet-3m0h</link>
      <guid>https://forem.com/elise_moreau/why-your-diffusion-model-is-slow-at-inference-and-its-not-the-unet-3m0h</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Most inference bottlenecks in diffusion pipelines are not in the UNet denoising loop. They are in the VAE decoder, the text encoder on first call, and CPU-GPU synchronization between steps. Profile before you optimize. To be precise, a 30% speedup often comes from fixing the 5% of the code nobody looks at.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent three weeks last month trying to make a Stable Diffusion XL variant run faster on A10G. The model was trained in-house for product photography. Inference was around 4.2 seconds per image at 1024x1024, 30 steps. Target was under 2 seconds.&lt;/p&gt;

&lt;p&gt;My first instinct was wrong. I went straight to the UNet. Compiled it with &lt;code&gt;torch.compile&lt;/code&gt;, tried different attention implementations, looked at FlashAttention-3. I got it from 3.1s to 2.7s on the UNet alone. Nice. But total pipeline time barely moved.&lt;/p&gt;

&lt;p&gt;Then I actually profiled.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the profile showed
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.profiler&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CPU&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;record_shapes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;prof&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_inference_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prof&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;key_averages&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sort_by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda_time_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The breakdown was not what I expected:&lt;/p&gt;

&lt;p&gt;| Component | Time (ms) | % of pipeline |&lt;br&gt;
|---|---|&lt;br&gt;
| UNet forward (30 steps) | 2700 | 64% |&lt;br&gt;
| VAE decoder | 890 | 21% |&lt;br&gt;
| Text encoder (first call) | 340 | 8% |&lt;br&gt;
| Scheduler + CPU ops | 270 | 6% |&lt;/p&gt;

&lt;p&gt;The VAE decoder, which runs once at the end, was taking almost a quarter of total latency. The text encoders, which I assumed were negligible, were non-trivial on the first call because of kernel compilation.&lt;/p&gt;

&lt;p&gt;The nuance here is that people optimize what they read about. Every blog post is about UNet attention. Almost nobody writes about the VAE.&lt;/p&gt;
&lt;h2&gt;
  
  
  Fixing the VAE
&lt;/h2&gt;

&lt;p&gt;SDXL's VAE decoder processes a 128x128x4 latent into a 1024x1024x3 image. The default implementation in diffusers runs in fp32 for numerical stability. The tiled decoder, which splits the latent into patches, is even slower but uses less memory.&lt;/p&gt;

&lt;p&gt;Three things helped:&lt;/p&gt;

&lt;p&gt;First, cast the VAE to bf16. The numerical argument for fp32 is weak on modern GPUs. I ran a small eval on 500 prompts, compared LPIPS and a CLIP-based aesthetic score between fp32 and bf16 output. Differences were within noise. Paper to look at: the SDXL technical report touches on this, but the TAESD work from madebyollin is where the practical tricks live.&lt;/p&gt;

&lt;p&gt;Second, use &lt;code&gt;channels_last&lt;/code&gt; memory format for the VAE. This one is documented but rarely applied:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channels_last&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reduce-overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fullgraph&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Third, if you do not need full 1024x1024 decoding quality, swap in TAESD (Tiny AutoEncoder). It is a distilled VAE that decodes 8x faster. Quality is worse for fine details but fine for thumbnails and previews. We use the full VAE for final renders and TAESD for the interactive preview in the product UI.&lt;/p&gt;

&lt;p&gt;Combined, VAE time dropped from 890ms to 210ms.&lt;/p&gt;

&lt;h2&gt;
  
  
  The text encoder trap
&lt;/h2&gt;

&lt;p&gt;On the first pipeline call, the text encoders compile their kernels. If you are benchmarking with a single prompt, you pay this cost once and it looks small. In production, if you have cold starts on autoscaled GPUs, every new replica eats that 300-400ms on the first request.&lt;/p&gt;

&lt;p&gt;Solution is unglamorous: warm up the encoders at startup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;warmup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;dummy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a product on a white background&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dummy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this during container startup, not on first user request.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU sync between steps
&lt;/h2&gt;

&lt;p&gt;This one took me a while to find. In the scheduler step, there are small tensor operations that implicitly synchronize GPU and CPU. On A10G with a well-tuned UNet, these become visible. You see it in the profiler as gaps between CUDA kernel launches.&lt;/p&gt;

&lt;p&gt;The fix is either a custom scheduler that keeps everything on GPU, or using &lt;code&gt;torch.cuda.graphs&lt;/code&gt; to capture the full denoising loop. Graphs are fragile, they break if any input shape changes, but for a fixed-resolution product they are worth it. I got another 8% off pipeline time this way.&lt;/p&gt;

&lt;p&gt;If you route through a gateway that fronts multiple model backends (internal triton, replicate, fal), the gateway itself adds 20-80ms depending on implementation. Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;), LiteLLM, and Portkey sit in this space. Measure your gateway overhead before you blame the model. We saw 35ms of unnecessary latency from a naive proxy before we switched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final numbers
&lt;/h2&gt;

&lt;p&gt;After all the above:&lt;/p&gt;

&lt;p&gt;| Stage | Before (ms) | After (ms) |&lt;br&gt;
|---|---|&lt;br&gt;
| Text encode | 340 | 12 (warmed) |&lt;br&gt;
| UNet 30 steps | 2700 | 2100 |&lt;br&gt;
| VAE decode | 890 | 210 |&lt;br&gt;
| Scheduler/sync | 270 | 90 |&lt;br&gt;
| &lt;strong&gt;Total&lt;/strong&gt; | &lt;strong&gt;4200&lt;/strong&gt; | &lt;strong&gt;2410&lt;/strong&gt; |&lt;/p&gt;

&lt;p&gt;Still above target. To hit 2s we dropped to 24 steps with a DPM++ 2M Karras scheduler. Acceptable quality trade-off for our use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Casting the VAE to bf16 is fine for photographic content. For pixel art or content with hard edges, fp32 can preserve small structures better. Test on your data.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;torch.compile&lt;/code&gt; in reduce-overhead mode uses CUDA graphs internally. It is strict about input shapes. Dynamic batch sizes or resolutions will trigger recompilation, which costs seconds. Pin your shapes or expect volatility.&lt;/p&gt;

&lt;p&gt;TAESD is not a free lunch. Look at outputs manually before shipping. It is a lossy compression of the VAE, and the losses are not always perceptually small.&lt;/p&gt;

&lt;p&gt;CUDA graph capture can hide memory leaks. If you see OOM on long-running workers, disable graphs and re-profile before assuming the model is the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;SDXL technical report: &lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2307.01952&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;TAESD repository by madebyollin: &lt;a href="https://github.com/madebyollin/taesd" rel="noopener noreferrer"&gt;https://github.com/madebyollin/taesd&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PyTorch 2 compile notes on memory formats: &lt;a href="https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html" rel="noopener noreferrer"&gt;https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;NVIDIA Nsight Systems for GPU profiling: &lt;a href="https://developer.nvidia.com/nsight-systems" rel="noopener noreferrer"&gt;https://developer.nvidia.com/nsight-systems&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Diffusers optimization guide: &lt;a href="https://huggingface.co/docs/diffusers/optimization/fp16" rel="noopener noreferrer"&gt;https://huggingface.co/docs/diffusers/optimization/fp16&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>pytorch</category>
      <category>computervision</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Kimi K2.6 Is a Legit Opus 4.7 Replacement</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Mon, 27 Apr 2026 04:39:40 +0000</pubDate>
      <link>https://forem.com/elise_moreau/kimi-k26-is-a-legit-opus-47-replacement-1fci</link>
      <guid>https://forem.com/elise_moreau/kimi-k26-is-a-legit-opus-47-replacement-1fci</guid>
      <description>&lt;p&gt;For a long time, Opus 4.7 has been the default recommendation when someone asks for a top tier model. It has been reliable, capable, and strong across a wide range of tasks.&lt;/p&gt;

&lt;p&gt;After spending real time with Kimi K2.6 and gathering feedback from customers using it in production workflows, I have started to change my mind. It is the first model I feel comfortable recommending as a practical replacement for Opus 4.7.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Not better, but close enough&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Kimi K2.6 is not outright better than Opus 4.7. If you are comparing raw performance on difficult reasoning or edge case tasks, Opus still wins.&lt;/p&gt;

&lt;p&gt;What matters more in practice is coverage. Kimi K2.6 can handle around 85 percent of the tasks that Opus can, and it does so at a quality level that is good enough for real work. That gap sounds large on paper, but in day to day usage it is surprisingly small.&lt;/p&gt;

&lt;p&gt;Most users are not constantly pushing models to their limits. They need something that works consistently across writing, coding, research, and general problem solving. In that context, Kimi K2.6 holds up very well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strong features that actually matter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two areas where Kimi K2.6 stands out are vision and browser use.&lt;/p&gt;

&lt;p&gt;Vision is not just a checkbox feature here. It is genuinely useful for workflows that involve screenshots, documents, or UI level debugging. Being able to mix text and visual context smoothly removes a lot of friction.&lt;/p&gt;

&lt;p&gt;Browser use is another big win. It handles multi step information gathering better than expected, especially for longer tasks where the model needs to plan, search, and refine results over time.&lt;/p&gt;

&lt;p&gt;These features are not always the headline benchmarks, but they have a real impact on productivity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Surprisingly good at long horizon tasks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the more unexpected strengths of Kimi K2.6 is how well it handles longer time horizon work.&lt;/p&gt;

&lt;p&gt;I have been slowly replacing parts of my personal workflows with it, including tasks that require multiple steps, iteration, and context retention. It performs more reliably than I expected, and it does not fall apart as quickly over extended interactions.&lt;/p&gt;

&lt;p&gt;This makes it useful for things like research threads, content pipelines, and multi step coding tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The size question and what it signals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kimi K2.6 is a very large model. There is no getting around that.&lt;/p&gt;

&lt;p&gt;But its performance raises an interesting point. Frontier models like Opus 4.7 are not necessarily introducing completely new capabilities. Instead, we are seeing strong alternatives that can replicate most of that value.&lt;/p&gt;

&lt;p&gt;If a model can deliver 80 to 90 percent of the experience, the remaining gap starts to matter less, especially when other factors come into play.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limits, cost, and the shift to local&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the biggest complaints around models like Opus 4.7 is usage limits. As demand increases, constraints become more noticeable.&lt;/p&gt;

&lt;p&gt;This is where models like Kimi K2.6 become more attractive. There is growing interest in running models locally or in more controlled environments, where limits are less of a concern.&lt;/p&gt;

&lt;p&gt;It feels like the conversation is starting to shift. Instead of chasing the absolute best model, people are looking for models that are good enough, flexible, and easier to integrate into their own systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kimi K2.6 is not a perfect replacement for Opus 4.7. If you need the absolute best performance on every task, Opus is still ahead.&lt;/p&gt;

&lt;p&gt;But for most real world use cases, Kimi K2.6 gets you very close. Close enough that the tradeoffs start to make sense.&lt;/p&gt;

&lt;p&gt;That is what makes it interesting. Not that it beats Opus, but that it makes you question whether you still need Opus at all.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>automation</category>
    </item>
    <item>
      <title>Why Your Diffusion Model Is Slow at Inference (And It's Not the UNet)</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:48:58 +0000</pubDate>
      <link>https://forem.com/elise_moreau/why-your-diffusion-model-is-slow-at-inference-and-its-not-the-unet-443d</link>
      <guid>https://forem.com/elise_moreau/why-your-diffusion-model-is-slow-at-inference-and-its-not-the-unet-443d</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Most inference bottlenecks in diffusion pipelines are not in the UNet denoising loop. They are in the VAE decoder, the text encoder on first call, and CPU-GPU synchronization between steps. Profile before you optimize. To be precise, a 30% speedup often comes from fixing the 5% of the code nobody looks at.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent three weeks last month trying to make a Stable Diffusion XL variant run faster on A10G. The model was trained in-house for product photography. Inference was around 4.2 seconds per image at 1024x1024, 30 steps. Target was under 2 seconds.&lt;br&gt;
My first instinct was wrong. I went straight to the UNet. Compiled it with &lt;code&gt;torch.compile&lt;/code&gt;, tried different attention implementations, looked at FlashAttention-3. I got it from 3.1s to 2.7s on the UNet alone. Nice. But total pipeline time barely moved.&lt;br&gt;
Then I actually profiled.&lt;/p&gt;
&lt;h2&gt;
  
  
  What the profile showed
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.profiler&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CPU&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;record_shapes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;prof&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_inference_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prof&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;key_averages&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sort_by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda_time_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The breakdown was not what I expected:&lt;br&gt;
| Component | Time (ms) | % of pipeline |&lt;br&gt;
|---|---|---|&lt;br&gt;
| UNet forward (30 steps) | 2700 | 64% |&lt;br&gt;
| VAE decoder | 890 | 21% |&lt;br&gt;
| Text encoder (first call) | 340 | 8% |&lt;br&gt;
| Scheduler + CPU ops | 270 | 6% |&lt;br&gt;
The VAE decoder, which runs once at the end, was taking almost a quarter of total latency. The text encoders, which I assumed were negligible, were non-trivial on the first call because of kernel compilation.&lt;br&gt;
The nuance here is that people optimize what they read about. Every blog post is about UNet attention. Almost nobody writes about the VAE.&lt;/p&gt;
&lt;h2&gt;
  
  
  Fixing the VAE
&lt;/h2&gt;

&lt;p&gt;SDXL's VAE decoder processes a 128x128x4 latent into a 1024x1024x3 image. The default implementation in diffusers runs in fp32 for numerical stability. The tiled decoder, which splits the latent into patches, is even slower but uses less memory.&lt;br&gt;
Three things helped:&lt;br&gt;
First, cast the VAE to bf16. The numerical argument for fp32 is weak on modern GPUs. I ran a small eval on 500 prompts, compared LPIPS and a CLIP-based aesthetic score between fp32 and bf16 output. Differences were within noise. Paper to look at: the SDXL technical report touches on this, but the TAESD work from madebyollin is where the practical tricks live.&lt;br&gt;
Second, use &lt;code&gt;channels_last&lt;/code&gt; memory format for the VAE. This one is documented but rarely applied:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channels_last&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reduce-overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fullgraph&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Third, if you do not need full 1024x1024 decoding quality, swap in TAESD (Tiny AutoEncoder). It is a distilled VAE that decodes 8x faster. Quality is worse for fine details but fine for thumbnails and previews. We use the full VAE for final renders and TAESD for the interactive preview in the product UI.&lt;br&gt;
Combined, VAE time dropped from 890ms to 210ms.&lt;/p&gt;
&lt;h2&gt;
  
  
  The text encoder trap
&lt;/h2&gt;

&lt;p&gt;On the first pipeline call, the text encoders compile their kernels. If you are benchmarking with a single prompt, you pay this cost once and it looks small. In production, if you have cold starts on autoscaled GPUs, every new replica eats that 300-400ms on the first request.&lt;br&gt;
Solution is unglamorous: warm up the encoders at startup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;warmup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;dummy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a product on a white background&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dummy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this during container startup, not on first user request.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU sync between steps
&lt;/h2&gt;

&lt;p&gt;This one took me a while to find. In the scheduler step, there are small tensor operations that implicitly synchronize GPU and CPU. On A10G with a well-tuned UNet, these become visible. You see it in the profiler as gaps between CUDA kernel launches.&lt;br&gt;
The fix is either a custom scheduler that keeps everything on GPU, or using &lt;code&gt;torch.cuda.graphs&lt;/code&gt; to capture the full denoising loop. Graphs are fragile, they break if any input shape changes, but for a fixed-resolution product they are worth it. I got another 8% off pipeline time this way.&lt;br&gt;
If you route through a gateway that fronts multiple model backends (internal triton, replicate, fal), the gateway itself adds 20-80ms depending on implementation. Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;), LiteLLM, and Portkey sit in this space. Measure your gateway overhead before you blame the model. We saw 35ms of unnecessary latency from a naive proxy before we switched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final numbers
&lt;/h2&gt;

&lt;p&gt;After all the above:&lt;br&gt;
| Stage | Before (ms) | After (ms) |&lt;br&gt;
|---|---|---|&lt;br&gt;
| Text encode | 340 | 12 (warmed) |&lt;br&gt;
| UNet 30 steps | 2700 | 2100 |&lt;br&gt;
| VAE decode | 890 | 210 |&lt;br&gt;
| Scheduler/sync | 270 | 90 |&lt;br&gt;
| &lt;strong&gt;Total&lt;/strong&gt; | &lt;strong&gt;4200&lt;/strong&gt; | &lt;strong&gt;2410&lt;/strong&gt; |&lt;br&gt;
Still above target. To hit 2s we dropped to 24 steps with a DPM++ 2M Karras scheduler. Acceptable quality trade-off for our use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Casting the VAE to bf16 is fine for photographic content. For pixel art or content with hard edges, fp32 can preserve small structures better. Test on your data.&lt;br&gt;
&lt;code&gt;torch.compile&lt;/code&gt; in reduce-overhead mode uses CUDA graphs internally. It is strict about input shapes. Dynamic batch sizes or resolutions will trigger recompilation, which costs seconds. Pin your shapes or expect volatility.&lt;br&gt;
TAESD is not a free lunch. Look at outputs manually before shipping. It is a lossy compression of the VAE, and the losses are not always perceptually small.&lt;br&gt;
CUDA graph capture can hide memory leaks. If you see OOM on long-running workers, disable graphs and re-profile before assuming the model is the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;SDXL technical report: &lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2307.01952&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;TAESD repository by madebyollin: &lt;a href="https://github.com/madebyollin/taesd" rel="noopener noreferrer"&gt;https://github.com/madebyollin/taesd&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PyTorch 2 compile notes on memory formats: &lt;a href="https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html" rel="noopener noreferrer"&gt;https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;NVIDIA Nsight Systems for GPU profiling: &lt;a href="https://developer.nvidia.com/nsight-systems" rel="noopener noreferrer"&gt;https://developer.nvidia.com/nsight-systems&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Diffusers optimization guide: &lt;a href="https://huggingface.co/docs/diffusers/optimization/fp16" rel="noopener noreferrer"&gt;https://huggingface.co/docs/diffusers/optimization/fp16&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>pytorch</category>
      <category>computervision</category>
      <category>ai</category>
    </item>
    <item>
      <title>Diffusion Model Inference in Production: What the Benchmarks Leave Out</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:31:27 +0000</pubDate>
      <link>https://forem.com/elise_moreau/diffusion-model-inference-in-production-what-the-benchmarks-leave-out-2669</link>
      <guid>https://forem.com/elise_moreau/diffusion-model-inference-in-production-what-the-benchmarks-leave-out-2669</guid>
      <description>&lt;h3&gt;
  
  
  BIFROST COMMENT
&lt;/h3&gt;

&lt;p&gt;The routing overhead caught us off guard. We were running caption generation through a larger model for every input when 70% of them only needed a fast small model. Adding a gateway with cost-aware routing (we landed on Bifrost for this, though LiteLLM and Portkey do the same thing: &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) cut LLM spend in our vision pipeline by 38% without touching the heavy-model cases.&lt;/p&gt;

</description>
      <category>pytorch</category>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>mlops</category>
    </item>
  </channel>
</rss>
