<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Elise Moreau</title>
    <description>The latest articles on Forem by Elise Moreau (@elise_moreau).</description>
    <link>https://forem.com/elise_moreau</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864909%2F72833c18-30db-4456-82ee-e7d2016cc38f.jpg</url>
      <title>Forem: Elise Moreau</title>
      <link>https://forem.com/elise_moreau</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/elise_moreau"/>
    <language>en</language>
    <item>
      <title>Cost accounting for diffusion image generation at $0.0008 per render</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Mon, 25 May 2026 14:53:09 +0000</pubDate>
      <link>https://forem.com/elise_moreau/cost-accounting-for-diffusion-image-generation-at-00008-per-render-8j</link>
      <guid>https://forem.com/elise_moreau/cost-accounting-for-diffusion-image-generation-at-00008-per-render-8j</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Per-image cost on our SDXL-based product photography pipeline at Photoroom dropped from $0.0031 to $0.0008 over six months. Most of the win came from boring infrastructure work, not model tricks. An AI gateway in front of our text-conditioning calls saved more than I expected.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent most of Q1 staring at a Grafana panel labelled &lt;code&gt;cost_per_render_eur&lt;/code&gt;. Our diffusion pipeline generates background-replaced product images at volume. When marketing asks for a million renders, the per-image number matters.&lt;/p&gt;

&lt;p&gt;To be precise: the cost I track is GPU-seconds on A100/H100 SXM nodes plus any external API calls plus storage IO. Not amortised salaries, not the office espresso machine. Just the marginal cost of one more render.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the money actually goes
&lt;/h2&gt;

&lt;p&gt;Before I started measuring properly, I assumed the UNet denoising loop was 80%+ of the cost. It wasn't.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;% of wall time&lt;/th&gt;
&lt;th&gt;% of cost&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Text encoder (CLIP + T5)&lt;/td&gt;
&lt;td&gt;4%&lt;/td&gt;
&lt;td&gt;11%&lt;/td&gt;
&lt;td&gt;T5-XXL is expensive on H100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM caption rewriting&lt;/td&gt;
&lt;td&gt;8%&lt;/td&gt;
&lt;td&gt;22%&lt;/td&gt;
&lt;td&gt;External API, GPT-4o-mini initially&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UNet denoising (25 steps)&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;td&gt;48%&lt;/td&gt;
&lt;td&gt;DPM++ 2M Karras&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VAE decode&lt;/td&gt;
&lt;td&gt;9%&lt;/td&gt;
&lt;td&gt;7%&lt;/td&gt;
&lt;td&gt;fp16, no tricks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage IO + image post&lt;/td&gt;
&lt;td&gt;8%&lt;/td&gt;
&lt;td&gt;12%&lt;/td&gt;
&lt;td&gt;S3 multipart, sharpen, resize&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The caption-rewriting step shocked me. We use an LLM to take a customer prompt like "white sneaker on beach" and expand it into a diffusion-friendly description with lighting, framing, camera details. That single API call was 22% of cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Killing the bill in three places
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — UNet quantisation to int8.&lt;/strong&gt; Used torchao + a small calibration set of 512 product images. Quality drop measured by CLIP-similarity on a held-out set: 0.847 to 0.841. Negligible. Throughput went from 14 renders/sec to 23 renders/sec on an H100. That's a 39% cost drop on the dominant stage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Caching the text-encoder outputs.&lt;/strong&gt; For our product taxonomy, only about 4,000 unique caption stems exist (variations on "minimalist white background", "studio lighting from upper-left", etc.). T5-XXL embeddings for these are 14KB each. I cached them in Redis with a 30-day TTL. Hit rate after two weeks: 91%. Text-encoder cost dropped from 11% to 1.2%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 — The gateway problem.&lt;/strong&gt; This is where it got interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM caption step was the messy one
&lt;/h2&gt;

&lt;p&gt;The caption-rewriting calls were originally direct OpenAI API hits from our Python ranking service. When OpenAI had a partial outage in late January (the one that affected &lt;code&gt;gpt-4o-mini&lt;/code&gt; specifically for ~40 minutes), we lost 280k renders. The cost of those failed renders, billed but not delivered, was around €890.&lt;/p&gt;

&lt;p&gt;I put Bifrost in front. The choice was between LiteLLM, Portkey, and Bifrost. I'll be honest about the comparison.&lt;/p&gt;

&lt;p&gt;LiteLLM has wider provider coverage in the Python ecosystem and a more mature semantic-cache integration with langchain-style apps. If your stack is pure Python and you live inside LangChain, it's a more natural fit.&lt;/p&gt;

&lt;p&gt;Portkey's UI for prompt management is genuinely nicer than what Bifrost ships, and their guardrail catalog has more pre-built rules.&lt;/p&gt;

&lt;p&gt;I picked Bifrost because (a) it's a Go binary with a single HTTP endpoint and our caption service is Go, (b) the &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;automatic fallbacks&lt;/a&gt; between providers work without me writing routing logic, and (c) the &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; layer sits at the gateway so my Python preprocessing service and Go caption service share the cache.&lt;/p&gt;

&lt;p&gt;Config that replaced about 140 lines of fallback logic in our caption service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_PRIMARY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_BACKUP&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_KEY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;

&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;
    &lt;span class="na"&gt;secondary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-haiku-4-5&lt;/span&gt;
    &lt;span class="na"&gt;tertiary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;

&lt;span class="na"&gt;semantic_cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;similarity_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.94&lt;/span&gt;
  &lt;span class="na"&gt;ttl_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;604800&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 0.94 similarity threshold matters. We tested 0.90, 0.92, 0.94, 0.96 on 10,000 caption pairs and measured downstream image quality. Below 0.94, the cached caption sometimes mismatched the product category enough to confuse the UNet. Above 0.96, hit rate dropped under 30% and the cost win disappeared.&lt;/p&gt;

&lt;p&gt;Current numbers after one month with the gateway in place:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Caption API spend: down 61% (semantic cache hit rate of 47%)&lt;/li&gt;
&lt;li&gt;Caption-step latency p95: 340ms to 110ms on cache hits&lt;/li&gt;
&lt;li&gt;Failed render rate from upstream LLM issues: 0.31% to 0.04%&lt;/li&gt;
&lt;li&gt;New cost share for captions: 22% to 8.2%&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Quantisation to int8 cost me about three weekends of calibration tuning. For very high-end fashion shoots where we render at 2048x2048, the quality drop becomes visible in fine fabric weave. We keep an fp16 path for those.&lt;/p&gt;

&lt;p&gt;The semantic cache occasionally returns a "close enough" caption that doesn't match a niche product category. For our long-tail (about 4% of requests), I disable the cache via a header per-call. The gateway supports this through request metadata.&lt;/p&gt;

&lt;p&gt;Bifrost's clustering features are gated to enterprise, which fine for our scale, but if I were running this across three regions I'd want to evaluate that cost honestly. Portkey's pricing for similar features came in lower for the team-collaboration tier.&lt;/p&gt;

&lt;p&gt;I haven't migrated the image-generation outputs themselves through the gateway. The UNet runs on our own GPUs, not behind an LLM API, so the gateway adds no value there. Don't put infrastructure in places it doesn't earn its keep.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Bifrost semantic caching docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Bifrost retries and fallbacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/pytorch/ao" rel="noopener noreferrer"&gt;torchao quantisation recipes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;SDXL paper, Podell et al.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2211.01095" rel="noopener noreferrer"&gt;DPM-Solver++ paper, Lu et al.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>infrastructure</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why your diffusion model is slow at batch size 1 (and what actually helps)</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Fri, 22 May 2026 05:37:23 +0000</pubDate>
      <link>https://forem.com/elise_moreau/why-your-diffusion-model-is-slow-at-batch-size-1-and-what-actually-helps-16e0</link>
      <guid>https://forem.com/elise_moreau/why-your-diffusion-model-is-slow-at-batch-size-1-and-what-actually-helps-16e0</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Single-image diffusion inference is bottlenecked by kernel launch overhead and attention memory traffic, not raw FLOPs. torch.compile with mode="reduce-overhead", a fused attention backend, and CFG batching get you most of the way before you reach for distillation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spend a lot of time looking at flame graphs from production diffusion pipelines. The pattern is almost always the same. The team profiles their model, sees 50 steps of a UNet or DiT, and assumes the path to lower latency is fewer steps. So they try LCM, then TCD, then some flavor of consistency distillation, and the quality drops in ways the product team notices.&lt;/p&gt;

&lt;p&gt;The nuance here is that at batch size 1, your GPU is mostly idle. You are not compute-bound. You are launch-bound and memory-bound. Distillation helps eventually, but only after you have fixed the boring things.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the profiler actually shows
&lt;/h2&gt;

&lt;p&gt;Run a vanilla SDXL or a 1B-parameter DiT at 1024x1024, batch 1, on an H100. Capture a trace with &lt;code&gt;torch.profiler&lt;/code&gt; and zoom into a single denoising step.&lt;/p&gt;

&lt;p&gt;You will see something like this, roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~30-40% of wall time inside attention kernels&lt;/li&gt;
&lt;li&gt;~20-25% inside conv and linear layers&lt;/li&gt;
&lt;li&gt;~15-20% in layernorm, GELU, residual adds&lt;/li&gt;
&lt;li&gt;The rest: kernel launch gaps, host-to-device syncs, Python overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last bucket is the embarrassing one. On an H100 a kernel launch costs ~5 microseconds. A UNet step fires hundreds of kernels. A 50-step sample fires tens of thousands. You are paying for the privilege of dispatching work, not for the work itself.&lt;/p&gt;

&lt;p&gt;To be precise: at batch 1, the same model at batch 8 often runs in less than 2x the wall time. That gap is your overhead bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step one: torch.compile, but the right mode
&lt;/h2&gt;

&lt;p&gt;The default &lt;code&gt;torch.compile(model)&lt;/code&gt; call uses &lt;code&gt;mode="default"&lt;/code&gt;, which optimizes for compile time and flexibility. For inference you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;unet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;unet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reduce-overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fullgraph&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dynamic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;reduce-overhead&lt;/code&gt; enables CUDA graphs, which replay a captured sequence of kernels in one launch. This is the single largest win for batch 1 diffusion on modern GPUs. In my measurements on PyTorch 2.3, this alone takes a 1024x1024 SDXL UNet step from ~42ms to ~28ms on H100. No quality change, no architecture change.&lt;/p&gt;

&lt;p&gt;The catch: &lt;code&gt;fullgraph=True&lt;/code&gt; will yell at you about any graph break. CFG implementations that branch on &lt;code&gt;guidance_scale&lt;/code&gt; need rewriting. Custom samplers that touch &lt;code&gt;.item()&lt;/code&gt; between steps will break CUDA graph capture. Plan for a day of fighting this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step two: pick an attention backend on purpose
&lt;/h2&gt;

&lt;p&gt;PyTorch's &lt;code&gt;scaled_dot_product_attention&lt;/code&gt; dispatches to one of several backends. The defaults are not always right for high-resolution diffusion.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FlashAttention-2&lt;/td&gt;
&lt;td&gt;Long sequences, H100/A100&lt;/td&gt;
&lt;td&gt;Default on most setups, good general choice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FlashAttention-3&lt;/td&gt;
&lt;td&gt;H100 only&lt;/td&gt;
&lt;td&gt;~1.5x faster than FA2 on Hopper, requires manual install&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;xFormers memory-efficient&lt;/td&gt;
&lt;td&gt;Older GPUs (V100, T4)&lt;/td&gt;
&lt;td&gt;Lower memory, slower than Flash on modern hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Math (fallback)&lt;/td&gt;
&lt;td&gt;Debugging only&lt;/td&gt;
&lt;td&gt;Never ship this&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For DiT-style models at 2K resolution the sequence length per attention block hits 16K+ tokens. FA3 on H100 is a real difference there. I have seen 18% end-to-end latency drop on a 2B DiT just from switching FA2 to FA3 via &lt;code&gt;torch.nn.attention.sdpa_kernel&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step three: batch your CFG
&lt;/h2&gt;

&lt;p&gt;Classifier-free guidance runs the model twice per step, once conditional and once unconditional. Most reference implementations call the UNet twice sequentially. Do not do this.&lt;/p&gt;

&lt;p&gt;Concatenate the two prompts into one batch of 2, run one forward pass, split the output. On batch 1 this nearly halves your per-step latency because you were leaving the GPU idle anyway. The memory cost is negligible at typical inference resolutions.&lt;/p&gt;

&lt;p&gt;This is a 3-line change and somehow lives in maybe 60% of the codebases I review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step four, only now: think about steps
&lt;/h2&gt;

&lt;p&gt;After the above, a 50-step SDXL sample on H100 is in the 1.2-1.5 second range. If your product needs sub-second, then yes, look at LCM, Hyper-SD, or DMD2. But evaluate quality on your own data, not on the curated examples in the paper. Distilled models lose the most quality on the long tail of prompts your users actually send, particularly text rendering and fine compositional structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;CUDA graphs hate dynamic shapes. If your service accepts arbitrary aspect ratios you will recompile constantly. Either bucket aspect ratios into a small set of fixed shapes, or accept the warmup cost on cold paths.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;reduce-overhead&lt;/code&gt; mode increases memory usage because it pins workspace buffers. On a 24GB consumer card this can push you over the edge with larger models. Profile before deploying.&lt;/p&gt;

&lt;p&gt;FlashAttention-3 requires building from source against a specific CUDA version. If your deployment runs across mixed GPU generations, the version matrix becomes painful. Pick one backend per deployment target.&lt;/p&gt;

&lt;p&gt;And the obvious one: none of this fixes a slow VAE decode. If you are generating at 2K, the VAE can dominate. Tiled VAE decoding or a distilled decoder like TAESD is a separate fight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/blog/pytorch2-3/" rel="noopener noreferrer"&gt;PyTorch 2.3 release notes on torch.compile inference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2407.08608" rel="noopener noreferrer"&gt;FlashAttention-3 paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2405.14867" rel="noopener noreferrer"&gt;DMD2: Improved Distribution Matching Distillation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;For routing across multiple model providers in production pipelines, &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is one option alongside LiteLLM and direct SDK calls.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2404.13686" rel="noopener noreferrer"&gt;Hyper-SD: Trajectory Segmented Consistency Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>Routing diffusion inference traffic across three providers</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Thu, 21 May 2026 14:52:31 +0000</pubDate>
      <link>https://forem.com/elise_moreau/routing-diffusion-inference-traffic-across-three-providers-51h3</link>
      <guid>https://forem.com/elise_moreau/routing-diffusion-inference-traffic-across-three-providers-51h3</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We route a mix of diffusion and LLM traffic across three providers from a single Go-based gateway called Bifrost. The 11 microsecond overhead is real, the failover works, and the part I care about most (weighted routing for cost vs latency tradeoffs) finally stopped being a custom Python service nobody wanted to maintain.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I work on diffusion models for product photography. Most of what I write about is training, but the boring truth is that inference traffic management eats more of my week than I would like to admit.&lt;/p&gt;

&lt;p&gt;We have three categories of model calls in production. Hosted diffusion endpoints for fallback when our own GPU pool is saturated. LLM calls for prompt rewriting and caption generation. And a small embedding service for similarity search on reference images. Three providers, three SDKs, three retry policies. It was becoming a mess.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we had before
&lt;/h2&gt;

&lt;p&gt;A Python FastAPI service in front of everything. It worked. It was also slow, and the team had stopped trusting the metrics because the gateway itself was adding 40-80ms of overhead depending on the day.&lt;/p&gt;

&lt;p&gt;The nuance here is that for a diffusion call taking 3 seconds, 60ms of gateway overhead is noise. For a small LLM rewrite that should take 200ms, it is a third of your budget. We were optimizing the wrong axis.&lt;/p&gt;

&lt;p&gt;I spent a weekend evaluating replacements. Kong felt heavy. LiteLLM was the obvious choice for the LLM side but does not really speak the dialect of provider-specific diffusion APIs we need. Then a colleague pointed me at Bifrost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a Go gateway actually matters here
&lt;/h2&gt;

&lt;p&gt;To be precise: the language is not the point. The point is the runtime model. Bifrost runs as a single Go binary, uses goroutines for concurrency, and the published overhead is around 11 microseconds per request. I measured it on our own staging hardware and got numbers in the same ballpark, which is rare enough that I noticed.&lt;/p&gt;

&lt;p&gt;For our embedding service this matters. For diffusion it does not. But having one gateway that does not become the bottleneck for the fast calls is what made the consolidation possible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_PRIMARY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_KEY_SECONDARY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
    &lt;span class="na"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;retry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;max_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
        &lt;span class="na"&gt;backoff_initial_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_KEY&lt;/span&gt;
  &lt;span class="na"&gt;stability&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.STABILITY_KEY&lt;/span&gt;

&lt;span class="na"&gt;mcp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prompt-rewrite&lt;/span&gt;
    &lt;span class="na"&gt;primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;
    &lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-haiku-4-5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That config replaced about 400 lines of Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  The weighted routing thing
&lt;/h2&gt;

&lt;p&gt;This is the feature I did not know I wanted. We have two OpenAI accounts because of rate limits and billing isolation between research and production workloads. Previously we ran two separate clients with manual round-robin logic that always had off-by-one bugs.&lt;/p&gt;

&lt;p&gt;Weighted routing in the gateway just handles it. 70/30 split, configured declaratively, and when one key hits a 429 the failover kicks in without us writing the retry code ourselves. Virtual keys on top of that let us issue per-team credentials that map to the underlying provider keys, so the research team and the production team see different rate limits and different cost dashboards.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison with what we considered
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Kong&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-request overhead&lt;/td&gt;
&lt;td&gt;~50ms (Python)&lt;/td&gt;
&lt;td&gt;~5ms but heavy footprint&lt;/td&gt;
&lt;td&gt;~11μs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failover across providers&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Plugin required&lt;/td&gt;
&lt;td&gt;Yes, built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weighted key routing&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Custom plugin&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Via plugin&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Diffusion provider support&lt;/td&gt;
&lt;td&gt;Weak&lt;/td&gt;
&lt;td&gt;Generic HTTP only&lt;/td&gt;
&lt;td&gt;Provider-aware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational footprint&lt;/td&gt;
&lt;td&gt;Python service&lt;/td&gt;
&lt;td&gt;Lua plugins, DB&lt;/td&gt;
&lt;td&gt;Single Go binary&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM remains excellent for pure LLM-only stacks. Kong is the right answer if you already run Kong. For us, the combination of low overhead and provider-aware routing was the deciding factor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Semantic caching on prompt rewrites
&lt;/h2&gt;

&lt;p&gt;About 40% of our prompt-rewrite calls are near-duplicates. Same product, slightly different angle, same desired caption style. We were paying for every one of them.&lt;/p&gt;

&lt;p&gt;Bifrost has semantic caching built in, using embeddings to match similar requests within a configurable threshold. I was skeptical because cache invalidation on semantic similarity is famously a footgun. We set the threshold conservatively (cosine similarity above 0.94) and audit the cache hits weekly. Hit rate is around 22%, cost savings are real, and we have not had a quality complaint yet. The audit is the part nobody talks about, but you need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;It is a young project. The documentation has gaps, particularly around custom provider plugins. I had to read the source to understand how the streaming response handling works for SSE-heavy diffusion APIs.&lt;/p&gt;

&lt;p&gt;Observability is functional but basic. We forward to our existing Prometheus setup and it works, but if you expect a polished UI for traffic analysis you will be disappointed. We built our own Grafana dashboards.&lt;/p&gt;

&lt;p&gt;Semantic caching is only as good as your embedding model and threshold tuning. If your prompts have high lexical variation but identical intent, you will get false negatives. If your prompts are templated and only the parameters change, you will get false positives. Test on your own traffic before trusting it.&lt;/p&gt;

&lt;p&gt;And one honest note: an 11 microsecond gateway does not make a 3-second diffusion call faster. It just stops being the reason your fast calls are slow. Know which problem you are solving.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost on GitHub: &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LiteLLM proxy documentation: &lt;a href="https://docs.litellm.ai/docs/simple_proxy" rel="noopener noreferrer"&gt;https://docs.litellm.ai/docs/simple_proxy&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Kong AI Gateway: &lt;a href="https://konghq.com/products/kong-ai-gateway" rel="noopener noreferrer"&gt;https://konghq.com/products/kong-ai-gateway&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;"Inference Without Interference" (Microsoft Research, 2024) on multiplexing inference workloads&lt;/li&gt;
&lt;li&gt;A useful primer on semantic caching trade-offs from Pinecone's engineering blog&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Why your diffusion model is slow at batch size 1 (and what actually helps)</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Tue, 19 May 2026 05:37:02 +0000</pubDate>
      <link>https://forem.com/elise_moreau/why-your-diffusion-model-is-slow-at-batch-size-1-and-what-actually-helps-n15</link>
      <guid>https://forem.com/elise_moreau/why-your-diffusion-model-is-slow-at-batch-size-1-and-what-actually-helps-n15</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Single-image diffusion inference is bottlenecked by kernel launch overhead and attention memory traffic, not raw FLOPs. torch.compile with mode="reduce-overhead", a fused attention backend, and CFG batching get you most of the way before you reach for distillation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spend a lot of time looking at flame graphs from production diffusion pipelines. The pattern is almost always the same. The team profiles their model, sees 50 steps of a UNet or DiT, and assumes the path to lower latency is fewer steps. So they try LCM, then TCD, then some flavor of consistency distillation, and the quality drops in ways the product team notices.&lt;/p&gt;

&lt;p&gt;The nuance here is that at batch size 1, your GPU is mostly idle. You are not compute-bound. You are launch-bound and memory-bound. Distillation helps eventually, but only after you have fixed the boring things.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the profiler actually shows
&lt;/h2&gt;

&lt;p&gt;Run a vanilla SDXL or a 1B-parameter DiT at 1024x1024, batch 1, on an H100. Capture a trace with &lt;code&gt;torch.profiler&lt;/code&gt; and zoom into a single denoising step.&lt;/p&gt;

&lt;p&gt;You will see something like this, roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~30-40% of wall time inside attention kernels&lt;/li&gt;
&lt;li&gt;~20-25% inside conv and linear layers&lt;/li&gt;
&lt;li&gt;~15-20% in layernorm, GELU, residual adds&lt;/li&gt;
&lt;li&gt;The rest: kernel launch gaps, host-to-device syncs, Python overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last bucket is the embarrassing one. On an H100 a kernel launch costs ~5 microseconds. A UNet step fires hundreds of kernels. A 50-step sample fires tens of thousands. You are paying for the privilege of dispatching work, not for the work itself.&lt;/p&gt;

&lt;p&gt;To be precise: at batch 1, the same model at batch 8 often runs in less than 2x the wall time. That gap is your overhead bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step one: torch.compile, but the right mode
&lt;/h2&gt;

&lt;p&gt;The default &lt;code&gt;torch.compile(model)&lt;/code&gt; call uses &lt;code&gt;mode="default"&lt;/code&gt;, which optimizes for compile time and flexibility. For inference you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;unet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;unet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reduce-overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fullgraph&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dynamic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;reduce-overhead&lt;/code&gt; enables CUDA graphs, which replay a captured sequence of kernels in one launch. This is the single largest win for batch 1 diffusion on modern GPUs. In my measurements on PyTorch 2.3, this alone takes a 1024x1024 SDXL UNet step from ~42ms to ~28ms on H100. No quality change, no architecture change.&lt;/p&gt;

&lt;p&gt;The catch: &lt;code&gt;fullgraph=True&lt;/code&gt; will yell at you about any graph break. CFG implementations that branch on &lt;code&gt;guidance_scale&lt;/code&gt; need rewriting. Custom samplers that touch &lt;code&gt;.item()&lt;/code&gt; between steps will break CUDA graph capture. Plan for a day of fighting this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step two: pick an attention backend on purpose
&lt;/h2&gt;

&lt;p&gt;PyTorch's &lt;code&gt;scaled_dot_product_attention&lt;/code&gt; dispatches to one of several backends. The defaults are not always right for high-resolution diffusion.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FlashAttention-2&lt;/td&gt;
&lt;td&gt;Long sequences, H100/A100&lt;/td&gt;
&lt;td&gt;Default on most setups, good general choice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FlashAttention-3&lt;/td&gt;
&lt;td&gt;H100 only&lt;/td&gt;
&lt;td&gt;~1.5x faster than FA2 on Hopper, requires manual install&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;xFormers memory-efficient&lt;/td&gt;
&lt;td&gt;Older GPUs (V100, T4)&lt;/td&gt;
&lt;td&gt;Lower memory, slower than Flash on modern hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Math (fallback)&lt;/td&gt;
&lt;td&gt;Debugging only&lt;/td&gt;
&lt;td&gt;Never ship this&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For DiT-style models at 2K resolution the sequence length per attention block hits 16K+ tokens. FA3 on H100 is a real difference there. I have seen 18% end-to-end latency drop on a 2B DiT just from switching FA2 to FA3 via &lt;code&gt;torch.nn.attention.sdpa_kernel&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step three: batch your CFG
&lt;/h2&gt;

&lt;p&gt;Classifier-free guidance runs the model twice per step, once conditional and once unconditional. Most reference implementations call the UNet twice sequentially. Do not do this.&lt;/p&gt;

&lt;p&gt;Concatenate the two prompts into one batch of 2, run one forward pass, split the output. On batch 1 this nearly halves your per-step latency because you were leaving the GPU idle anyway. The memory cost is negligible at typical inference resolutions.&lt;/p&gt;

&lt;p&gt;This is a 3-line change and somehow lives in maybe 60% of the codebases I review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step four, only now: think about steps
&lt;/h2&gt;

&lt;p&gt;After the above, a 50-step SDXL sample on H100 is in the 1.2-1.5 second range. If your product needs sub-second, then yes, look at LCM, Hyper-SD, or DMD2. But evaluate quality on your own data, not on the curated examples in the paper. Distilled models lose the most quality on the long tail of prompts your users actually send, particularly text rendering and fine compositional structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;CUDA graphs hate dynamic shapes. If your service accepts arbitrary aspect ratios you will recompile constantly. Either bucket aspect ratios into a small set of fixed shapes, or accept the warmup cost on cold paths.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;reduce-overhead&lt;/code&gt; mode increases memory usage because it pins workspace buffers. On a 24GB consumer card this can push you over the edge with larger models. Profile before deploying.&lt;/p&gt;

&lt;p&gt;FlashAttention-3 requires building from source against a specific CUDA version. If your deployment runs across mixed GPU generations, the version matrix becomes painful. Pick one backend per deployment target.&lt;/p&gt;

&lt;p&gt;And the obvious one: none of this fixes a slow VAE decode. If you are generating at 2K, the VAE can dominate. Tiled VAE decoding or a distilled decoder like TAESD is a separate fight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/blog/pytorch2-3/" rel="noopener noreferrer"&gt;PyTorch 2.3 release notes on torch.compile inference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2407.08608" rel="noopener noreferrer"&gt;FlashAttention-3 paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2405.14867" rel="noopener noreferrer"&gt;DMD2: Improved Distribution Matching Distillation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;For routing across multiple model providers in production pipelines, &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is one option alongside LiteLLM and direct SDK calls.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2404.13686" rel="noopener noreferrer"&gt;Hyper-SD: Trajectory Segmented Consistency Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>pytorch</category>
      <category>computervision</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Why Your Diffusion Model Is Slow at Inference (And It's Not the UNet)</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Mon, 27 Apr 2026 10:55:43 +0000</pubDate>
      <link>https://forem.com/elise_moreau/why-your-diffusion-model-is-slow-at-inference-and-its-not-the-unet-1p51</link>
      <guid>https://forem.com/elise_moreau/why-your-diffusion-model-is-slow-at-inference-and-its-not-the-unet-1p51</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Most inference bottlenecks in diffusion pipelines are not in the UNet denoising loop. They are in the VAE decoder, the text encoder on first call, and CPU-GPU synchronization between steps. Profile before you optimize. To be precise, a 30% speedup often comes from fixing the 5% of the code nobody looks at.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent three weeks last month trying to make a Stable Diffusion XL variant run faster on A10G. The model was trained in-house for product photography. Inference was around 4.2 seconds per image at 1024x1024, 30 steps. Target was under 2 seconds.&lt;/p&gt;

&lt;p&gt;My first instinct was wrong. I went straight to the UNet. Compiled it with &lt;code&gt;torch.compile&lt;/code&gt;, tried different attention implementations, looked at FlashAttention-3. I got it from 3.1s to 2.7s on the UNet alone. Nice. But total pipeline time barely moved.&lt;/p&gt;

&lt;p&gt;Then I actually profiled.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the profile showed
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.profiler&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CPU&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProfilerActivity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;record_shapes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;prof&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_inference_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prof&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;key_averages&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sort_by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda_time_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The breakdown was not what I expected:&lt;/p&gt;

&lt;p&gt;| Component | Time (ms) | % of pipeline |&lt;br&gt;
|---|---|&lt;br&gt;
| UNet forward (30 steps) | 2700 | 64% |&lt;br&gt;
| VAE decoder | 890 | 21% |&lt;br&gt;
| Text encoder (first call) | 340 | 8% |&lt;br&gt;
| Scheduler + CPU ops | 270 | 6% |&lt;/p&gt;

&lt;p&gt;The VAE decoder, which runs once at the end, was taking almost a quarter of total latency. The text encoders, which I assumed were negligible, were non-trivial on the first call because of kernel compilation.&lt;/p&gt;

&lt;p&gt;The nuance here is that people optimize what they read about. Every blog post is about UNet attention. Almost nobody writes about the VAE.&lt;/p&gt;
&lt;h2&gt;
  
  
  Fixing the VAE
&lt;/h2&gt;

&lt;p&gt;SDXL's VAE decoder processes a 128x128x4 latent into a 1024x1024x3 image. The default implementation in diffusers runs in fp32 for numerical stability. The tiled decoder, which splits the latent into patches, is even slower but uses less memory.&lt;/p&gt;

&lt;p&gt;Three things helped:&lt;/p&gt;

&lt;p&gt;First, cast the VAE to bf16. The numerical argument for fp32 is weak on modern GPUs. I ran a small eval on 500 prompts, compared LPIPS and a CLIP-based aesthetic score between fp32 and bf16 output. Differences were within noise. Paper to look at: the SDXL technical report touches on this, but the TAESD work from madebyollin is where the practical tricks live.&lt;/p&gt;

&lt;p&gt;Second, use &lt;code&gt;channels_last&lt;/code&gt; memory format for the VAE. This one is documented but rarely applied:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channels_last&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reduce-overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fullgraph&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Third, if you do not need full 1024x1024 decoding quality, swap in TAESD (Tiny AutoEncoder). It is a distilled VAE that decodes 8x faster. Quality is worse for fine details but fine for thumbnails and previews. We use the full VAE for final renders and TAESD for the interactive preview in the product UI.&lt;/p&gt;

&lt;p&gt;Combined, VAE time dropped from 890ms to 210ms.&lt;/p&gt;

&lt;h2&gt;
  
  
  The text encoder trap
&lt;/h2&gt;

&lt;p&gt;On the first pipeline call, the text encoders compile their kernels. If you are benchmarking with a single prompt, you pay this cost once and it looks small. In production, if you have cold starts on autoscaled GPUs, every new replica eats that 300-400ms on the first request.&lt;/p&gt;

&lt;p&gt;Solution is unglamorous: warm up the encoders at startup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;warmup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;dummy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a product on a white background&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dummy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this during container startup, not on first user request.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU sync between steps
&lt;/h2&gt;

&lt;p&gt;This one took me a while to find. In the scheduler step, there are small tensor operations that implicitly synchronize GPU and CPU. On A10G with a well-tuned UNet, these become visible. You see it in the profiler as gaps between CUDA kernel launches.&lt;/p&gt;

&lt;p&gt;The fix is either a custom scheduler that keeps everything on GPU, or using &lt;code&gt;torch.cuda.graphs&lt;/code&gt; to capture the full denoising loop. Graphs are fragile, they break if any input shape changes, but for a fixed-resolution product they are worth it. I got another 8% off pipeline time this way.&lt;/p&gt;

&lt;p&gt;If you route through a gateway that fronts multiple model backends (internal triton, replicate, fal), the gateway itself adds 20-80ms depending on implementation. Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;), LiteLLM, and Portkey sit in this space. Measure your gateway overhead before you blame the model. We saw 35ms of unnecessary latency from a naive proxy before we switched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final numbers
&lt;/h2&gt;

&lt;p&gt;After all the above:&lt;/p&gt;

&lt;p&gt;| Stage | Before (ms) | After (ms) |&lt;br&gt;
|---|---|&lt;br&gt;
| Text encode | 340 | 12 (warmed) |&lt;br&gt;
| UNet 30 steps | 2700 | 2100 |&lt;br&gt;
| VAE decode | 890 | 210 |&lt;br&gt;
| Scheduler/sync | 270 | 90 |&lt;br&gt;
| &lt;strong&gt;Total&lt;/strong&gt; | &lt;strong&gt;4200&lt;/strong&gt; | &lt;strong&gt;2410&lt;/strong&gt; |&lt;/p&gt;

&lt;p&gt;Still above target. To hit 2s we dropped to 24 steps with a DPM++ 2M Karras scheduler. Acceptable quality trade-off for our use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Casting the VAE to bf16 is fine for photographic content. For pixel art or content with hard edges, fp32 can preserve small structures better. Test on your data.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;torch.compile&lt;/code&gt; in reduce-overhead mode uses CUDA graphs internally. It is strict about input shapes. Dynamic batch sizes or resolutions will trigger recompilation, which costs seconds. Pin your shapes or expect volatility.&lt;/p&gt;

&lt;p&gt;TAESD is not a free lunch. Look at outputs manually before shipping. It is a lossy compression of the VAE, and the losses are not always perceptually small.&lt;/p&gt;

&lt;p&gt;CUDA graph capture can hide memory leaks. If you see OOM on long-running workers, disable graphs and re-profile before assuming the model is the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;SDXL technical report: &lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2307.01952&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;TAESD repository by madebyollin: &lt;a href="https://github.com/madebyollin/taesd" rel="noopener noreferrer"&gt;https://github.com/madebyollin/taesd&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PyTorch 2 compile notes on memory formats: &lt;a href="https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html" rel="noopener noreferrer"&gt;https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;NVIDIA Nsight Systems for GPU profiling: &lt;a href="https://developer.nvidia.com/nsight-systems" rel="noopener noreferrer"&gt;https://developer.nvidia.com/nsight-systems&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Diffusers optimization guide: &lt;a href="https://huggingface.co/docs/diffusers/optimization/fp16" rel="noopener noreferrer"&gt;https://huggingface.co/docs/diffusers/optimization/fp16&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>pytorch</category>
      <category>computervision</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Kimi K2.6 Is a Legit Opus 4.7 Replacement</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Mon, 27 Apr 2026 04:39:40 +0000</pubDate>
      <link>https://forem.com/elise_moreau/kimi-k26-is-a-legit-opus-47-replacement-1fci</link>
      <guid>https://forem.com/elise_moreau/kimi-k26-is-a-legit-opus-47-replacement-1fci</guid>
      <description>&lt;p&gt;For a long time, Opus 4.7 has been the default recommendation when someone asks for a top tier model. It has been reliable, capable, and strong across a wide range of tasks.&lt;/p&gt;

&lt;p&gt;After spending real time with Kimi K2.6 and gathering feedback from customers using it in production workflows, I have started to change my mind. It is the first model I feel comfortable recommending as a practical replacement for Opus 4.7.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Not better, but close enough&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Kimi K2.6 is not outright better than Opus 4.7. If you are comparing raw performance on difficult reasoning or edge case tasks, Opus still wins.&lt;/p&gt;

&lt;p&gt;What matters more in practice is coverage. Kimi K2.6 can handle around 85 percent of the tasks that Opus can, and it does so at a quality level that is good enough for real work. That gap sounds large on paper, but in day to day usage it is surprisingly small.&lt;/p&gt;

&lt;p&gt;Most users are not constantly pushing models to their limits. They need something that works consistently across writing, coding, research, and general problem solving. In that context, Kimi K2.6 holds up very well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strong features that actually matter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two areas where Kimi K2.6 stands out are vision and browser use.&lt;/p&gt;

&lt;p&gt;Vision is not just a checkbox feature here. It is genuinely useful for workflows that involve screenshots, documents, or UI level debugging. Being able to mix text and visual context smoothly removes a lot of friction.&lt;/p&gt;

&lt;p&gt;Browser use is another big win. It handles multi step information gathering better than expected, especially for longer tasks where the model needs to plan, search, and refine results over time.&lt;/p&gt;

&lt;p&gt;These features are not always the headline benchmarks, but they have a real impact on productivity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Surprisingly good at long horizon tasks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the more unexpected strengths of Kimi K2.6 is how well it handles longer time horizon work.&lt;/p&gt;

&lt;p&gt;I have been slowly replacing parts of my personal workflows with it, including tasks that require multiple steps, iteration, and context retention. It performs more reliably than I expected, and it does not fall apart as quickly over extended interactions.&lt;/p&gt;

&lt;p&gt;This makes it useful for things like research threads, content pipelines, and multi step coding tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The size question and what it signals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kimi K2.6 is a very large model. There is no getting around that.&lt;/p&gt;

&lt;p&gt;But its performance raises an interesting point. Frontier models like Opus 4.7 are not necessarily introducing completely new capabilities. Instead, we are seeing strong alternatives that can replicate most of that value.&lt;/p&gt;

&lt;p&gt;If a model can deliver 80 to 90 percent of the experience, the remaining gap starts to matter less, especially when other factors come into play.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limits, cost, and the shift to local&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the biggest complaints around models like Opus 4.7 is usage limits. As demand increases, constraints become more noticeable.&lt;/p&gt;

&lt;p&gt;This is where models like Kimi K2.6 become more attractive. There is growing interest in running models locally or in more controlled environments, where limits are less of a concern.&lt;/p&gt;

&lt;p&gt;It feels like the conversation is starting to shift. Instead of chasing the absolute best model, people are looking for models that are good enough, flexible, and easier to integrate into their own systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kimi K2.6 is not a perfect replacement for Opus 4.7. If you need the absolute best performance on every task, Opus is still ahead.&lt;/p&gt;

&lt;p&gt;But for most real world use cases, Kimi K2.6 gets you very close. Close enough that the tradeoffs start to make sense.&lt;/p&gt;

&lt;p&gt;That is what makes it interesting. Not that it beats Opus, but that it makes you question whether you still need Opus at all.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>automation</category>
    </item>
    <item>
      <title>Diffusion Model Inference in Production: What the Benchmarks Leave Out</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:31:27 +0000</pubDate>
      <link>https://forem.com/elise_moreau/diffusion-model-inference-in-production-what-the-benchmarks-leave-out-2669</link>
      <guid>https://forem.com/elise_moreau/diffusion-model-inference-in-production-what-the-benchmarks-leave-out-2669</guid>
      <description>&lt;h3&gt;
  
  
  BIFROST COMMENT
&lt;/h3&gt;

&lt;p&gt;The routing overhead caught us off guard. We were running caption generation through a larger model for every input when 70% of them only needed a fast small model. Adding a gateway with cost-aware routing (we landed on Bifrost for this, though LiteLLM and Portkey do the same thing: &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) cut LLM spend in our vision pipeline by 38% without touching the heavy-model cases.&lt;/p&gt;

</description>
      <category>pytorch</category>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>mlops</category>
    </item>
  </channel>
</rss>
