Forem: Elise Moreau

Cost accounting for diffusion image generation at $0.0008 per render

Elise Moreau — Mon, 25 May 2026 14:53:09 +0000

TL;DR: Per-image cost on our SDXL-based product photography pipeline at Photoroom dropped from $0.0031 to $0.0008 over six months. Most of the win came from boring infrastructure work, not model tricks. An AI gateway in front of our text-conditioning calls saved more than I expected.

I spent most of Q1 staring at a Grafana panel labelled cost_per_render_eur. Our diffusion pipeline generates background-replaced product images at volume. When marketing asks for a million renders, the per-image number matters.

To be precise: the cost I track is GPU-seconds on A100/H100 SXM nodes plus any external API calls plus storage IO. Not amortised salaries, not the office espresso machine. Just the marginal cost of one more render.

Where the money actually goes

Before I started measuring properly, I assumed the UNet denoising loop was 80%+ of the cost. It wasn't.

Stage	% of wall time	% of cost	Notes
Text encoder (CLIP + T5)	4%	11%	T5-XXL is expensive on H100
LLM caption rewriting	8%	22%	External API, GPT-4o-mini initially
UNet denoising (25 steps)	71%	48%	DPM++ 2M Karras
VAE decode	9%	7%	fp16, no tricks
Storage IO + image post	8%	12%	S3 multipart, sharpen, resize

The caption-rewriting step shocked me. We use an LLM to take a customer prompt like "white sneaker on beach" and expand it into a diffusion-friendly description with lighting, framing, camera details. That single API call was 22% of cost.

Killing the bill in three places

Step 1 — UNet quantisation to int8. Used torchao + a small calibration set of 512 product images. Quality drop measured by CLIP-similarity on a held-out set: 0.847 to 0.841. Negligible. Throughput went from 14 renders/sec to 23 renders/sec on an H100. That's a 39% cost drop on the dominant stage.

Step 2 — Caching the text-encoder outputs. For our product taxonomy, only about 4,000 unique caption stems exist (variations on "minimalist white background", "studio lighting from upper-left", etc.). T5-XXL embeddings for these are 14KB each. I cached them in Redis with a 30-day TTL. Hit rate after two weeks: 91%. Text-encoder cost dropped from 11% to 1.2%.

Step 3 — The gateway problem. This is where it got interesting.

The LLM caption step was the messy one

The caption-rewriting calls were originally direct OpenAI API hits from our Python ranking service. When OpenAI had a partial outage in late January (the one that affected gpt-4o-mini specifically for ~40 minutes), we lost 280k renders. The cost of those failed renders, billed but not delivered, was around €890.

I put Bifrost in front. The choice was between LiteLLM, Portkey, and Bifrost. I'll be honest about the comparison.

LiteLLM has wider provider coverage in the Python ecosystem and a more mature semantic-cache integration with langchain-style apps. If your stack is pure Python and you live inside LangChain, it's a more natural fit.

Portkey's UI for prompt management is genuinely nicer than what Bifrost ships, and their guardrail catalog has more pre-built rules.

I picked Bifrost because (a) it's a Go binary with a single HTTP endpoint and our caption service is Go, (b) the automatic fallbacks between providers work without me writing routing logic, and (c) the semantic caching layer sits at the gateway so my Python preprocessing service and Go caption service share the cache.

Config that replaced about 140 lines of fallback logic in our caption service:

providers:
  openai:
    keys:
      - value: env.OPENAI_KEY_PRIMARY
        weight: 0.7
      - value: env.OPENAI_KEY_BACKUP
        weight: 0.3
  anthropic:
    keys:
      - value: env.ANTHROPIC_KEY
        weight: 1.0

fallbacks:
  - primary: openai/gpt-4o-mini
    secondary: anthropic/claude-haiku-4-5
    tertiary: openai/gpt-4o-mini

semantic_cache:
  enabled: true
  similarity_threshold: 0.94
  ttl_seconds: 604800

The 0.94 similarity threshold matters. We tested 0.90, 0.92, 0.94, 0.96 on 10,000 caption pairs and measured downstream image quality. Below 0.94, the cached caption sometimes mismatched the product category enough to confuse the UNet. Above 0.96, hit rate dropped under 30% and the cost win disappeared.

Current numbers after one month with the gateway in place:

Caption API spend: down 61% (semantic cache hit rate of 47%)
Caption-step latency p95: 340ms to 110ms on cache hits
Failed render rate from upstream LLM issues: 0.31% to 0.04%
New cost share for captions: 22% to 8.2%

Trade-offs and Limitations

Quantisation to int8 cost me about three weekends of calibration tuning. For very high-end fashion shoots where we render at 2048x2048, the quality drop becomes visible in fine fabric weave. We keep an fp16 path for those.

The semantic cache occasionally returns a "close enough" caption that doesn't match a niche product category. For our long-tail (about 4% of requests), I disable the cache via a header per-call. The gateway supports this through request metadata.

Bifrost's clustering features are gated to enterprise, which fine for our scale, but if I were running this across three regions I'd want to evaluate that cost honestly. Portkey's pricing for similar features came in lower for the team-collaboration tier.

I haven't migrated the image-generation outputs themselves through the gateway. The UNet runs on our own GPUs, not behind an LLM API, so the gateway adds no value there. Don't put infrastructure in places it doesn't earn its keep.

Why your diffusion model is slow at batch size 1 (and what actually helps)

Elise Moreau — Fri, 22 May 2026 05:37:23 +0000

TL;DR: Single-image diffusion inference is bottlenecked by kernel launch overhead and attention memory traffic, not raw FLOPs. torch.compile with mode="reduce-overhead", a fused attention backend, and CFG batching get you most of the way before you reach for distillation.

I spend a lot of time looking at flame graphs from production diffusion pipelines. The pattern is almost always the same. The team profiles their model, sees 50 steps of a UNet or DiT, and assumes the path to lower latency is fewer steps. So they try LCM, then TCD, then some flavor of consistency distillation, and the quality drops in ways the product team notices.

The nuance here is that at batch size 1, your GPU is mostly idle. You are not compute-bound. You are launch-bound and memory-bound. Distillation helps eventually, but only after you have fixed the boring things.

What the profiler actually shows

Run a vanilla SDXL or a 1B-parameter DiT at 1024x1024, batch 1, on an H100. Capture a trace with torch.profiler and zoom into a single denoising step.

You will see something like this, roughly:

~30-40% of wall time inside attention kernels
~20-25% inside conv and linear layers
~15-20% in layernorm, GELU, residual adds
The rest: kernel launch gaps, host-to-device syncs, Python overhead

That last bucket is the embarrassing one. On an H100 a kernel launch costs ~5 microseconds. A UNet step fires hundreds of kernels. A 50-step sample fires tens of thousands. You are paying for the privilege of dispatching work, not for the work itself.

To be precise: at batch 1, the same model at batch 8 often runs in less than 2x the wall time. That gap is your overhead bill.

Step one: torch.compile, but the right mode

The default torch.compile(model) call uses mode="default", which optimizes for compile time and flexibility. For inference you want:

import torch

unet = torch.compile(
    unet,
    mode="reduce-overhead",
    fullgraph=True,
    dynamic=False,
)

reduce-overhead enables CUDA graphs, which replay a captured sequence of kernels in one launch. This is the single largest win for batch 1 diffusion on modern GPUs. In my measurements on PyTorch 2.3, this alone takes a 1024x1024 SDXL UNet step from ~42ms to ~28ms on H100. No quality change, no architecture change.

The catch: fullgraph=True will yell at you about any graph break. CFG implementations that branch on guidance_scale need rewriting. Custom samplers that touch .item() between steps will break CUDA graph capture. Plan for a day of fighting this.

Step two: pick an attention backend on purpose

PyTorch's scaled_dot_product_attention dispatches to one of several backends. The defaults are not always right for high-resolution diffusion.

Backend	Best for	Notes
FlashAttention-2	Long sequences, H100/A100	Default on most setups, good general choice
FlashAttention-3	H100 only	~1.5x faster than FA2 on Hopper, requires manual install
xFormers memory-efficient	Older GPUs (V100, T4)	Lower memory, slower than Flash on modern hardware
Math (fallback)	Debugging only	Never ship this

For DiT-style models at 2K resolution the sequence length per attention block hits 16K+ tokens. FA3 on H100 is a real difference there. I have seen 18% end-to-end latency drop on a 2B DiT just from switching FA2 to FA3 via torch.nn.attention.sdpa_kernel.

Step three: batch your CFG

Classifier-free guidance runs the model twice per step, once conditional and once unconditional. Most reference implementations call the UNet twice sequentially. Do not do this.

Concatenate the two prompts into one batch of 2, run one forward pass, split the output. On batch 1 this nearly halves your per-step latency because you were leaving the GPU idle anyway. The memory cost is negligible at typical inference resolutions.

This is a 3-line change and somehow lives in maybe 60% of the codebases I review.

Step four, only now: think about steps

After the above, a 50-step SDXL sample on H100 is in the 1.2-1.5 second range. If your product needs sub-second, then yes, look at LCM, Hyper-SD, or DMD2. But evaluate quality on your own data, not on the curated examples in the paper. Distilled models lose the most quality on the long tail of prompts your users actually send, particularly text rendering and fine compositional structure.

Trade-offs and limitations

CUDA graphs hate dynamic shapes. If your service accepts arbitrary aspect ratios you will recompile constantly. Either bucket aspect ratios into a small set of fixed shapes, or accept the warmup cost on cold paths.

reduce-overhead mode increases memory usage because it pins workspace buffers. On a 24GB consumer card this can push you over the edge with larger models. Profile before deploying.

FlashAttention-3 requires building from source against a specific CUDA version. If your deployment runs across mixed GPU generations, the version matrix becomes painful. Pick one backend per deployment target.

And the obvious one: none of this fixes a slow VAE decode. If you are generating at 2K, the VAE can dominate. Tiled VAE decoding or a distilled decoder like TAESD is a separate fight.

Routing diffusion inference traffic across three providers

Elise Moreau — Thu, 21 May 2026 14:52:31 +0000

TL;DR: We route a mix of diffusion and LLM traffic across three providers from a single Go-based gateway called Bifrost. The 11 microsecond overhead is real, the failover works, and the part I care about most (weighted routing for cost vs latency tradeoffs) finally stopped being a custom Python service nobody wanted to maintain.

I work on diffusion models for product photography. Most of what I write about is training, but the boring truth is that inference traffic management eats more of my week than I would like to admit.

We have three categories of model calls in production. Hosted diffusion endpoints for fallback when our own GPU pool is saturated. LLM calls for prompt rewriting and caption generation. And a small embedding service for similarity search on reference images. Three providers, three SDKs, three retry policies. It was becoming a mess.

What we had before

A Python FastAPI service in front of everything. It worked. It was also slow, and the team had stopped trusting the metrics because the gateway itself was adding 40-80ms of overhead depending on the day.

The nuance here is that for a diffusion call taking 3 seconds, 60ms of gateway overhead is noise. For a small LLM rewrite that should take 200ms, it is a third of your budget. We were optimizing the wrong axis.

I spent a weekend evaluating replacements. Kong felt heavy. LiteLLM was the obvious choice for the LLM side but does not really speak the dialect of provider-specific diffusion APIs we need. Then a colleague pointed me at Bifrost.

Why a Go gateway actually matters here

To be precise: the language is not the point. The point is the runtime model. Bifrost runs as a single Go binary, uses goroutines for concurrency, and the published overhead is around 11 microseconds per request. I measured it on our own staging hardware and got numbers in the same ballpark, which is rare enough that I noticed.

For our embedding service this matters. For diffusion it does not. But having one gateway that does not become the bottleneck for the fast calls is what made the consolidation possible.

providers:
  openai:
    keys:
      - value: env.OPENAI_KEY_PRIMARY
        weight: 0.7
      - value: env.OPENAI_KEY_SECONDARY
        weight: 0.3
    network:
      retry:
        max_retries: 2
        backoff_initial_ms: 100
  anthropic:
    keys:
      - value: env.ANTHROPIC_KEY
  stability:
    keys:
      - value: env.STABILITY_KEY

mcp:
  - id: prompt-rewrite
    primary: openai/gpt-4o-mini
    fallbacks:
      - anthropic/claude-haiku-4-5

That config replaced about 400 lines of Python.

The weighted routing thing

This is the feature I did not know I wanted. We have two OpenAI accounts because of rate limits and billing isolation between research and production workloads. Previously we ran two separate clients with manual round-robin logic that always had off-by-one bugs.

Weighted routing in the gateway just handles it. 70/30 split, configured declaratively, and when one key hits a 429 the failover kicks in without us writing the retry code ourselves. Virtual keys on top of that let us issue per-team credentials that map to the underlying provider keys, so the research team and the production team see different rate limits and different cost dashboards.

Comparison with what we considered

Capability	LiteLLM	Kong	Bifrost
Per-request overhead	~50ms (Python)	~5ms but heavy footprint	~11μs
Failover across providers	Yes	Plugin required	Yes, built-in
Weighted key routing	Limited	Custom plugin	Native
Semantic caching	Via plugin	No	Native
Diffusion provider support	Weak	Generic HTTP only	Provider-aware
Operational footprint	Python service	Lua plugins, DB	Single Go binary

LiteLLM remains excellent for pure LLM-only stacks. Kong is the right answer if you already run Kong. For us, the combination of low overhead and provider-aware routing was the deciding factor.

Semantic caching on prompt rewrites

About 40% of our prompt-rewrite calls are near-duplicates. Same product, slightly different angle, same desired caption style. We were paying for every one of them.

Bifrost has semantic caching built in, using embeddings to match similar requests within a configurable threshold. I was skeptical because cache invalidation on semantic similarity is famously a footgun. We set the threshold conservatively (cosine similarity above 0.94) and audit the cache hits weekly. Hit rate is around 22%, cost savings are real, and we have not had a quality complaint yet. The audit is the part nobody talks about, but you need it.

Trade-offs and Limitations

It is a young project. The documentation has gaps, particularly around custom provider plugins. I had to read the source to understand how the streaming response handling works for SSE-heavy diffusion APIs.

Observability is functional but basic. We forward to our existing Prometheus setup and it works, but if you expect a polished UI for traffic analysis you will be disappointed. We built our own Grafana dashboards.

Semantic caching is only as good as your embedding model and threshold tuning. If your prompts have high lexical variation but identical intent, you will get false negatives. If your prompts are templated and only the parameters change, you will get false positives. Test on your own traffic before trusting it.

And one honest note: an 11 microsecond gateway does not make a 3-second diffusion call faster. It just stops being the reason your fast calls are slow. Know which problem you are solving.

Why your diffusion model is slow at batch size 1 (and what actually helps)

Elise Moreau — Tue, 19 May 2026 05:37:02 +0000

What the profiler actually shows

Run a vanilla SDXL or a 1B-parameter DiT at 1024x1024, batch 1, on an H100. Capture a trace with torch.profiler and zoom into a single denoising step.

You will see something like this, roughly:

~30-40% of wall time inside attention kernels
~20-25% inside conv and linear layers
~15-20% in layernorm, GELU, residual adds
The rest: kernel launch gaps, host-to-device syncs, Python overhead

To be precise: at batch 1, the same model at batch 8 often runs in less than 2x the wall time. That gap is your overhead bill.

Step one: torch.compile, but the right mode

The default torch.compile(model) call uses mode="default", which optimizes for compile time and flexibility. For inference you want:

import torch

unet = torch.compile(
    unet,
    mode="reduce-overhead",
    fullgraph=True,
    dynamic=False,
)

Step two: pick an attention backend on purpose

PyTorch's scaled_dot_product_attention dispatches to one of several backends. The defaults are not always right for high-resolution diffusion.

Backend	Best for	Notes
FlashAttention-2	Long sequences, H100/A100	Default on most setups, good general choice
FlashAttention-3	H100 only	~1.5x faster than FA2 on Hopper, requires manual install
xFormers memory-efficient	Older GPUs (V100, T4)	Lower memory, slower than Flash on modern hardware
Math (fallback)	Debugging only	Never ship this

Step three: batch your CFG

Classifier-free guidance runs the model twice per step, once conditional and once unconditional. Most reference implementations call the UNet twice sequentially. Do not do this.

This is a 3-line change and somehow lives in maybe 60% of the codebases I review.

Step four, only now: think about steps

Trade-offs and limitations

reduce-overhead mode increases memory usage because it pins workspace buffers. On a 24GB consumer card this can push you over the edge with larger models. Profile before deploying.

And the obvious one: none of this fixes a slow VAE decode. If you are generating at 2K, the VAE can dominate. Tiled VAE decoding or a distilled decoder like TAESD is a separate fight.

Why Your Diffusion Model Is Slow at Inference (And It's Not the UNet)

Elise Moreau — Mon, 27 Apr 2026 10:55:43 +0000

TL;DR: Most inference bottlenecks in diffusion pipelines are not in the UNet denoising loop. They are in the VAE decoder, the text encoder on first call, and CPU-GPU synchronization between steps. Profile before you optimize. To be precise, a 30% speedup often comes from fixing the 5% of the code nobody looks at.

I spent three weeks last month trying to make a Stable Diffusion XL variant run faster on A10G. The model was trained in-house for product photography. Inference was around 4.2 seconds per image at 1024x1024, 30 steps. Target was under 2 seconds.

My first instinct was wrong. I went straight to the UNet. Compiled it with torch.compile, tried different attention implementations, looked at FlashAttention-3. I got it from 3.1s to 2.7s on the UNet alone. Nice. But total pipeline time barely moved.

Then I actually profiled.

What the profile showed

import torch
from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
) as prof:
    image = pipe(prompt, num_inference_steps=30).images[0]


print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=15
))

The breakdown was not what I expected:

| Component | Time (ms) | % of pipeline |
|---|---|
| UNet forward (30 steps) | 2700 | 64% |
| VAE decoder | 890 | 21% |
| Text encoder (first call) | 340 | 8% |
| Scheduler + CPU ops | 270 | 6% |

The VAE decoder, which runs once at the end, was taking almost a quarter of total latency. The text encoders, which I assumed were negligible, were non-trivial on the first call because of kernel compilation.

The nuance here is that people optimize what they read about. Every blog post is about UNet attention. Almost nobody writes about the VAE.

Fixing the VAE

SDXL's VAE decoder processes a 128x128x4 latent into a 1024x1024x3 image. The default implementation in diffusers runs in fp32 for numerical stability. The tiled decoder, which splits the latent into patches, is even slower but uses less memory.

Three things helped:

First, cast the VAE to bf16. The numerical argument for fp32 is weak on modern GPUs. I ran a small eval on 500 prompts, compared LPIPS and a CLIP-based aesthetic score between fp32 and bf16 output. Differences were within noise. Paper to look at: the SDXL technical report touches on this, but the TAESD work from madebyollin is where the practical tricks live.

Second, use channels_last memory format for the VAE. This one is documented but rarely applied:

pipe.vae.to(memory_format=torch.channels_last)
pipe.vae = torch.compile(
    pipe.vae,
    mode="reduce-overhead",
    fullgraph=True,
)

Third, if you do not need full 1024x1024 decoding quality, swap in TAESD (Tiny AutoEncoder). It is a distilled VAE that decodes 8x faster. Quality is worse for fine details but fine for thumbnails and previews. We use the full VAE for final renders and TAESD for the interactive preview in the product UI.

Combined, VAE time dropped from 890ms to 210ms.

The text encoder trap

On the first pipeline call, the text encoders compile their kernels. If you are benchmarking with a single prompt, you pay this cost once and it looks small. In production, if you have cold starts on autoscaled GPUs, every new replica eats that 300-400ms on the first request.

Solution is unglamorous: warm up the encoders at startup.

def warmup(pipe, device="cuda"):
    dummy = "a photo of a product on a white background"
    with torch.no_grad():
        for _ in range(3):
            pipe.encode_prompt(dummy, device=device)
    torch.cuda.synchronize()

Run this during container startup, not on first user request.

CPU sync between steps

This one took me a while to find. In the scheduler step, there are small tensor operations that implicitly synchronize GPU and CPU. On A10G with a well-tuned UNet, these become visible. You see it in the profiler as gaps between CUDA kernel launches.

The fix is either a custom scheduler that keeps everything on GPU, or using torch.cuda.graphs to capture the full denoising loop. Graphs are fragile, they break if any input shape changes, but for a fixed-resolution product they are worth it. I got another 8% off pipeline time this way.

If you route through a gateway that fronts multiple model backends (internal triton, replicate, fal), the gateway itself adds 20-80ms depending on implementation. Bifrost (https://github.com/maximhq/bifrost), LiteLLM, and Portkey sit in this space. Measure your gateway overhead before you blame the model. We saw 35ms of unnecessary latency from a naive proxy before we switched.

Final numbers

After all the above:

| Stage | Before (ms) | After (ms) |
|---|---|
| Text encode | 340 | 12 (warmed) |
| UNet 30 steps | 2700 | 2100 |
| VAE decode | 890 | 210 |
| Scheduler/sync | 270 | 90 |
| Total | 4200 | 2410 |

Still above target. To hit 2s we dropped to 24 steps with a DPM++ 2M Karras scheduler. Acceptable quality trade-off for our use case.

Trade-offs and limitations

Casting the VAE to bf16 is fine for photographic content. For pixel art or content with hard edges, fp32 can preserve small structures better. Test on your data.

torch.compile in reduce-overhead mode uses CUDA graphs internally. It is strict about input shapes. Dynamic batch sizes or resolutions will trigger recompilation, which costs seconds. Pin your shapes or expect volatility.

TAESD is not a free lunch. Look at outputs manually before shipping. It is a lossy compression of the VAE, and the losses are not always perceptually small.

CUDA graph capture can hide memory leaks. If you see OOM on long-running workers, disable graphs and re-profile before assuming the model is the problem.

Kimi K2.6 Is a Legit Opus 4.7 Replacement

Elise Moreau — Mon, 27 Apr 2026 04:39:40 +0000

For a long time, Opus 4.7 has been the default recommendation when someone asks for a top tier model. It has been reliable, capable, and strong across a wide range of tasks.

After spending real time with Kimi K2.6 and gathering feedback from customers using it in production workflows, I have started to change my mind. It is the first model I feel comfortable recommending as a practical replacement for Opus 4.7.

Not better, but close enough

Kimi K2.6 is not outright better than Opus 4.7. If you are comparing raw performance on difficult reasoning or edge case tasks, Opus still wins.

What matters more in practice is coverage. Kimi K2.6 can handle around 85 percent of the tasks that Opus can, and it does so at a quality level that is good enough for real work. That gap sounds large on paper, but in day to day usage it is surprisingly small.

Most users are not constantly pushing models to their limits. They need something that works consistently across writing, coding, research, and general problem solving. In that context, Kimi K2.6 holds up very well.

Strong features that actually matter

Two areas where Kimi K2.6 stands out are vision and browser use.

Vision is not just a checkbox feature here. It is genuinely useful for workflows that involve screenshots, documents, or UI level debugging. Being able to mix text and visual context smoothly removes a lot of friction.

Browser use is another big win. It handles multi step information gathering better than expected, especially for longer tasks where the model needs to plan, search, and refine results over time.

These features are not always the headline benchmarks, but they have a real impact on productivity.

Surprisingly good at long horizon tasks

One of the more unexpected strengths of Kimi K2.6 is how well it handles longer time horizon work.

I have been slowly replacing parts of my personal workflows with it, including tasks that require multiple steps, iteration, and context retention. It performs more reliably than I expected, and it does not fall apart as quickly over extended interactions.

This makes it useful for things like research threads, content pipelines, and multi step coding tasks.

The size question and what it signals

Kimi K2.6 is a very large model. There is no getting around that.

But its performance raises an interesting point. Frontier models like Opus 4.7 are not necessarily introducing completely new capabilities. Instead, we are seeing strong alternatives that can replicate most of that value.

If a model can deliver 80 to 90 percent of the experience, the remaining gap starts to matter less, especially when other factors come into play.

Limits, cost, and the shift to local

One of the biggest complaints around models like Opus 4.7 is usage limits. As demand increases, constraints become more noticeable.

This is where models like Kimi K2.6 become more attractive. There is growing interest in running models locally or in more controlled environments, where limits are less of a concern.

It feels like the conversation is starting to shift. Instead of chasing the absolute best model, people are looking for models that are good enough, flexible, and easier to integrate into their own systems.

Final thoughts

Kimi K2.6 is not a perfect replacement for Opus 4.7. If you need the absolute best performance on every task, Opus is still ahead.

But for most real world use cases, Kimi K2.6 gets you very close. Close enough that the tradeoffs start to make sense.

That is what makes it interesting. Not that it beats Opus, but that it makes you question whether you still need Opus at all.

Diffusion Model Inference in Production: What the Benchmarks Leave Out

Elise Moreau — Thu, 23 Apr 2026 14:31:27 +0000

BIFROST COMMENT

The routing overhead caught us off guard. We were running caption generation through a larger model for every input when 70% of them only needed a fast small model. Adding a gateway with cost-aware routing (we landed on Bifrost for this, though LiteLLM and Portkey do the same thing: https://github.com/maximhq/bifrost) cut LLM spend in our vision pipeline by 38% without touching the heavy-model cases.

Forem: Elise Moreau

Cost accounting for diffusion image generation at $0.0008 per render

Where the money actually goes

Killing the bill in three places

The LLM caption step was the messy one

Trade-offs and Limitations

Further Reading

Why your diffusion model is slow at batch size 1 (and what actually helps)

What the profiler actually shows

Step one: torch.compile, but the right mode

Step two: pick an attention backend on purpose

Step three: batch your CFG

Step four, only now: think about steps

Trade-offs and limitations

Further Reading

Routing diffusion inference traffic across three providers

What we had before

Why a Go gateway actually matters here

The weighted routing thing

Comparison with what we considered

Semantic caching on prompt rewrites

Trade-offs and Limitations

Further Reading

Why your diffusion model is slow at batch size 1 (and what actually helps)

What the profiler actually shows

Step one: torch.compile, but the right mode

Step two: pick an attention backend on purpose

Step three: batch your CFG

Step four, only now: think about steps

Trade-offs and limitations

Further Reading

Why Your Diffusion Model Is Slow at Inference (And It's Not the UNet)

What the profile showed

Fixing the VAE

The text encoder trap

CPU sync between steps

Final numbers

Trade-offs and limitations

Further reading

Kimi K2.6 Is a Legit Opus 4.7 Replacement

Diffusion Model Inference in Production: What the Benchmarks Leave Out

BIFROST COMMENT