Forem: Thurmon Demich

RTX 5090 vs RTX 4090 for LLM: 32GB vs 24GB in 2026

Thurmon Demich — Fri, 22 May 2026 01:13:53 +0000

This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.

Quick answer: The RTX 5090's 32GB GDDR7 opens up 34B models at high quantization and comfortably runs models that squeeze tight on 24GB. If you can afford ~$2,000, it is the best single consumer GPU for local LLM in 2026. The RTX 4090 remains excellent at $1,600 if 24GB is enough.

The flagship face-off

The RTX 5090 launched in early 2025 as NVIDIA's Blackwell-architecture consumer flagship. For LLM users, the headline is simple: 32GB of fast GDDR7 memory versus the 4090's 24GB of GDDR6X. That 8GB difference matters more than it sounds.

Spec comparison

Spec	RTX 5090	RTX 4090
VRAM	32GB GDDR7	24GB GDDR6X
Memory bandwidth	1,792 GB/s	1,008 GB/s
CUDA cores	21,760	16,384
Architecture	Blackwell	Ada Lovelace
TDP	575W	450W
FP16 TFLOPS	104.8	82.6
Price (2026)	~$2,000	~$1,600

The bandwidth jump is massive: 78% more than the 4090. For LLM inference, where token generation is bandwidth-bound, this translates directly to faster output.

VRAM chart available at the original article

LLM inference benchmarks

Model (Quantization)	RTX 5090 tok/s	RTX 4090 tok/s	Difference
Llama 3 8B (Q4_K_M)	~155	~95	+63%
Llama 2 13B (Q4_K_M)	~90	~55	+64%
CodeLlama 34B (Q4_K_M)	~40	~22	+82%
Yi-34B (Q6_K)	~28	Won't fit	N/A
Qwen 34B (Q5_K_M)	~32	Won't fit	N/A
Llama 2 70B (Q3_K_M)	~12	Won't fit	N/A

The 5090 does not just run the same models faster -- it runs models the 4090 physically cannot load. For an even larger jump, see RTX 5090 vs 3090 for LLM which captures the full generation gap from the used market's top card to the current flagship.

The VRAM advantage explained

Here is what 32GB vs 24GB means in practice:

34B models at Q5-Q6: Require ~26-30GB. The 5090 handles them; the 4090 cannot.
70B models at Q3: Barely squeezes into 32GB (~30-31GB). Impossible on 24GB. See how to run 70B on a single GPU for practical configuration tips to maximize quality on the 5090's 32GB.
13B models at FP16: Uses ~26GB. Only the 5090 can do full-precision 13B.
KV cache headroom: Longer context windows need extra VRAM beyond the model weights. 32GB gives meaningful breathing room.

For users who work with 34B parameter models, the 5090 is the first consumer GPU that runs them comfortably without aggressive quantization.

When to buy the RTX 5090

The 5090 is the right choice if you:

Regularly run 34B models (Yi-34B, CodeLlama 34B, Qwen 34B)
Want to experiment with 70B models at low quantization on a single card
Need long context windows (32K+) that eat VRAM for KV cache
Plan to keep one GPU for 3-4 years as models grow
Do any fine-tuning or LoRA training locally

When to stick with the RTX 4090

The 4090 still makes sense if you:

Primarily run 7B-13B models where 24GB is plenty
Cannot justify $400 extra for 8GB more VRAM
Already own a 4090 and are considering an upgrade (the jump is not dramatic enough for 13B workloads)
Want the more proven, widely-tested card with a larger community of LLM benchmarks

If your workflow lives in the 7B-13B range, the 4090 delivers excellent speed and the VRAM gap does not matter. For users considering the RTX 5070 as a cheaper Blackwell alternative to the 4090, see RTX 5070 vs 4090 for LLM for a head-to-head comparison on key LLM workloads.

Value comparison

Metric	RTX 5090	RTX 4090
Cost	~$2,000	~$1,600
VRAM per $1,000	16 GB	15 GB
34B Q4 tok/s per $1,000	20	14
Max model (single card)	70B Q3	34B Q4

Dollar for dollar, the 5090 edges ahead on VRAM efficiency and demolishes the 4090 on maximum model size. The 4090 only wins on absolute price.

Common mistakes when choosing between RTX 5090 and 4090

Buying the 5090 for 7B-13B models — If your workload fits in 24GB, the 5090's extra 8GB sits unused. The 4090 handles 7B-13B models at any quantization with room to spare. Save the $400 unless you plan to run larger models. If you are also weighing the RTX 5080 as a midpoint between the 4090 and 5090, see RTX 5080 vs 4090 for LLM for a direct comparison.

Assuming the 5090 handles 70B well — The 5090 can technically load a 70B model at Q2_K-Q3_K, but quality at that quantization is poor and you have no headroom for context. Do not buy a 5090 expecting a good 70B experience on a single card.

Upgrading from a 4090 too early — If you already own a 4090 and run 13B models, the 5090 gives you faster tokens but no new capability. Wait for the next generation unless you specifically need 34B at higher quantization.

Ignoring total system cost — The 5090 draws 575W, requiring a premium PSU and good case airflow. Budget an extra $100-200 for power delivery and cooling beyond the GPU price.

Our verdict

The RTX 5090 is the best single GPU for local LLM in 2026. The combination of 32GB VRAM and 1,792 GB/s bandwidth means you can run 34B models at quality quantization with room to spare. For anyone serious about local inference, the $400 premium over the 4090 pays for itself in model flexibility.

If you are budget-conscious and mostly run smaller models, the RTX 4090 still competes well against the previous-gen 3090 and remains a strong buy at $1,600. For a comprehensive roundup of everything in the $1,500-2,000 price range, see our best GPU for LLM under $2000 guide.

Related guides on Best GPU for LLM

Read the full guide on Best GPU for LLM — includes our VRAM calculator, GPU comparison table, and live pricing.

Best GPU for AI Animation in 2026 (5 Picks Ranked)

Thurmon Demich — Thu, 21 May 2026 01:14:06 +0000

This article was originally published on Best GPU for AI. The full version with interactive tools, FAQ, and live pricing is on the original site.

You generate a stunning image with Stable Diffusion and think — what if it moved? AI animation tools like AnimateDiff, Stable Video Diffusion, and Deforum turn static images into motion, but they demand significantly more GPU power than image generation alone. Here is what you need.

Who this is for

This guide covers GPU selection for AI-powered animation workflows: AnimateDiff (motion modules for SD/SDXL), Stable Video Diffusion (SVD), Deforum (zoom/pan animations), and AI frame interpolation (RIFE, FILM). If you create animated content with generative AI, VRAM and render speed are your primary constraints.

VRAM requirements for AI animation

Tool	Resolution	Frames	Min VRAM	Recommended VRAM
AnimateDiff (SD 1.5)	512x512	16	8GB	12GB
AnimateDiff (SDXL)	1024x576	16	14GB	16GB
AnimateDiff (SDXL)	1024x576	32	18GB	24GB
Stable Video Diffusion	576x1024	25	12GB	16GB
Deforum (SD 1.5)	512x512	—	6GB	8GB
Deforum (SDXL)	1024x1024	—	10GB	16GB
RIFE frame interpolation	1080p	—	4GB	8GB

AI animation multiplies VRAM usage compared to single image generation. AnimateDiff with SDXL at 32 frames needs 18GB — more than most consumer GPUs provide.

Best GPUs for AI animation ranked

GPU	VRAM	AnimateDiff SDXL (16f)	SVD (25f)	Price
RTX 5090	32GB	~25 s/clip	~18 s/clip	~$2,000+
RTX 4090	24GB	~35 s/clip	~28 s/clip	~$1,600
RTX 5080	16GB	~55 s/clip	~45 s/clip	~$1,000
RTX 5070 Ti	16GB	~65 s/clip	~55 s/clip	~$750
RTX 4060 Ti 16GB	16GB	~90 s/clip	~80 s/clip	~$400

Times for a single clip at default settings.

GPU tier list available at the original article

RTX 4090 — best for serious AI animation

The RTX 4090 is the standard recommendation for AI animation:

24GB VRAM handles AnimateDiff SDXL at 32 frames without OOM
Fast enough for iterative creative work — adjust settings, render, review
Supports SVD at full resolution with ControlNet guidance
Dreambooth and LoRA training for custom animation styles
Established software support across ComfyUI, Automatic1111, and custom pipelines

For most AI animation creators, the 4090 offers the right balance of VRAM, speed, and reliability.

Budget options that work

RTX 5070 Ti (~$750) — 16GB handles AnimateDiff with SD 1.5 (all frame counts) and SDXL (16 frames). SVD runs at full quality. The generation is slower than the 4090 but entirely functional for hobbyist animation work.

RTX 4060 Ti 16GB (~$400) — The entry point for AI animation. AnimateDiff SD 1.5 runs well. SDXL animation is slower but possible at 16 frames. SVD works with patience.

RTX 5090 — for production workflows

If you produce AI animation content professionally or generate dozens of clips daily:

32GB VRAM runs AnimateDiff SDXL at 32+ frames without any compromise
Batch processing multiple animations back-to-back is viable
High-resolution SVD output (1024x1024+) with ControlNet fits comfortably
Future-proofed for next-generation video models that will demand even more VRAM

Which GPU should you buy?

Experimenting with AI animation as a hobby: The RTX 4060 Ti 16GB at $400 runs AnimateDiff (SD 1.5) and SVD. Slower render times, but you get to learn the tools without a major investment.

Regular AI animation work: The RTX 4090 at $1,600 is the go-to choice. Its 24GB VRAM covers every current tool at every practical frame count. Render times are fast enough for iterative workflows.

Professional AI animation production: The RTX 5090 at $2,000+ provides 32GB for maximum frame counts and high-resolution output. Worth it if animation is revenue-generating work.

You mainly do image generation with occasional animation: A 16GB card like the RTX 5070 Ti handles both image and animation workflows. You only need 24GB if animation is a primary focus.

Common mistakes to avoid

Assuming image generation specs translate to animation. AI animation multiplies VRAM usage by the number of frames. A card that handles SDXL images comfortably may OOM on SDXL animation.
Setting frame count too high for your VRAM. Start with 16 frames on 16GB cards and increase only if you have headroom. Rendering 32 frames into an OOM error wastes time.
Ignoring temporal ControlNet. AnimateDiff with ControlNet guidance produces dramatically more coherent animations but adds 2-3GB VRAM overhead. Budget for it.
Skipping frame interpolation. RIFE can turn 16 AI-generated frames into 64 smooth frames using minimal VRAM. Generate fewer frames at higher quality, then interpolate.

Final verdict

Budget	GPU	Best For
$400	RTX 4060 Ti 16GB	Hobby animation, SD 1.5 AnimateDiff
$750	RTX 5070 Ti	Regular animation, SDXL 16-frame
$1,600	RTX 4090	Serious animation, SDXL 32-frame
$2,000+	RTX 5090	Professional production

The RTX 4090 is the right card for AI animation. Its 24GB VRAM handles every current animation tool without compromise, and its render speed supports iterative creative workflows. For more on AI video hardware, see our AI video GPU guide and Flux GPU recommendations.

AI animation is the most VRAM-hungry creative workload in 2026. Buy 24GB and forget about memory limits.

Related guides on Best GPU for AI

Read the full guide on Best GPU for AI — includes our VRAM calculator, GPU comparison table, and live pricing.

Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?

Thurmon Demich — Wed, 20 May 2026 01:14:08 +0000

From the Best GPU for LLM archive. The canonical version has interactive calculators, an up-to-date GPU comparison table, and live pricing.

Three tools dominate local LLM inference in 2026. They are not interchangeable — each has a distinct use case, and choosing wrong wastes both time and hardware. Here is the direct comparison.

Quick comparison

Feature	Ollama	llama.cpp	vLLM
Setup difficulty	Easiest (one command)	Easy (compile or binary)	Harder (Python env)
Speed (single user)	Good	Best	Good
Speed (multi-user)	Limited	Limited	Best
Model format	GGUF	GGUF	HuggingFace / GPTQ / AWQ
GPU requirement	Any supported	Any	NVIDIA CUDA required
AMD support	Partial	Vulkan backend	Limited
API	OpenAI-compatible REST	REST (server mode)	OpenAI-compatible REST
Best for	Personal use	Power users	Production serving

Ollama — easiest, best for personal use

Ollama wraps llama.cpp under the hood with a model registry, automatic GPU detection, and a clean CLI. ollama run llama3 downloads the model and starts inference in seconds.

Best for:

Personal daily driver (chat, code assist, writing)
macOS users (native Apple Silicon support)
Non-technical users who want zero-friction setup
Running one model at a time

Limitations:

Less control over inference parameters than raw llama.cpp
Multi-user concurrency is limited
Model selection is limited to what's in the Ollama registry (though custom models work)

Minimum GPU: Any 8GB+ VRAM card with CUDA, ROCm, or Apple Silicon. Start here if you are new to local LLMs.

llama.cpp — fastest raw inference, most flexible

llama.cpp is a C++ inference engine that runs GGUF-format quantized models. It is what Ollama is built on, but running it directly gives you more control: batch size, rope scaling, context length, GPU layer splitting across multiple cards.

Best for:

Squeezing maximum tokens per second from a single GPU
Splitting large models across multiple GPUs or GPU+CPU
Running any GGUF model file, not just registry models
Linux power users who tune inference settings
Embedding and batch processing workloads

Limitations:

No built-in model management (you download files yourself)
Server mode is less polished than Ollama's API
Config requires some familiarity with inference parameters

GPU requirement: Same as Ollama — any CUDA or ROCm GPU. Vulkan backend provides AMD compatibility without ROCm. For multi-GPU tensor parallelism on large models, you need matching GPU pairs.

Speed note: Direct llama.cpp with optimized settings runs 10-20% faster than Ollama on the same hardware, since Ollama adds wrapper overhead. For interactive chat, the difference is small. For batch processing, it adds up.

vLLM — best for production serving

vLLM is a Python inference server designed for high-throughput multi-user serving. Its PagedAttention algorithm allows it to batch multiple requests efficiently, turning what would be sequential processing into parallel GPU utilization.

Best for:

Serving LLMs to multiple users simultaneously
Production API endpoints with SLA requirements
Teams running shared LLM infrastructure
Maximizing GPU utilization on expensive hardware (A100, H100)

Limitations:

Requires NVIDIA CUDA. AMD support exists but is incomplete.
Higher VRAM overhead than llama.cpp due to paging and batching buffers (plan for 20-30% more VRAM than the model base size)
Slower than llama.cpp for single-user, single-request inference
More complex setup (Python environment, HuggingFace model formats)

GPU requirement: NVIDIA cards with 16GB+ VRAM minimum for practical serving. The sweet spot for vLLM is 24GB+ cards. For multi-user production use, A100/H100 class hardware is the real target.

GPU tier list available at the original article

GPU requirements side by side

Tool	Minimum VRAM	Recommended	Notes
Ollama	8GB	16GB+	8GB limits you to small quantized models
llama.cpp	8GB	16GB+	Same as Ollama, but better multi-GPU support
vLLM	16GB	24GB+	Needs VRAM headroom for batching buffers

vLLM needs more VRAM than llama.cpp for the same model because it pre-allocates memory for its paging mechanism. A 14B Q4_K_M model that fits in 12GB under llama.cpp may need 16GB under vLLM.

Which tool should YOU use?

New to local LLMs, just want to run models? Use Ollama. Install in 30 seconds, download a model, start chatting. No config needed.
Want maximum speed on your personal setup? Use llama.cpp directly. The extra tokens-per-second adds up over long sessions. Worth it if you know what you're doing.
Building an LLM API for a team or app? Use vLLM. PagedAttention batching makes it the only practical choice for multi-user workloads. Ollama and llama.cpp do not scale to concurrent users efficiently.
Running on AMD or Apple Silicon? Use Ollama or llama.cpp. vLLM's AMD support is incomplete. Ollama is the easiest path on macOS.
Need to run very large models across multiple GPUs? llama.cpp with tensor split gives you the most control over layer distribution. vLLM handles multi-GPU better for serving workloads.

Common mistakes to avoid

Using vLLM for personal single-user inference. vLLM's advantages are for concurrent requests. For a single user, llama.cpp is faster with less overhead and complexity.
Using Ollama for production serving. Ollama is a personal tool. It handles one request at a time without batching. Under load from multiple users, it becomes a bottleneck immediately.
Assuming all three tools run identical models. Ollama and llama.cpp use GGUF quantized models. vLLM uses HuggingFace format with GPTQ or AWQ quantization. The model files are different — you can't swap them.
Forgetting vLLM's CUDA requirement. People coming from Ollama on AMD sometimes assume vLLM will work the same way. It won't. Check hardware compatibility before planning a production vLLM deployment.

Final verdict

You are...	Use this	GPU needed
Personal daily user	Ollama	8GB+ any vendor
Power user, max speed	llama.cpp	8GB+ any vendor
Serving to a team	vLLM	16GB+ NVIDIA only
Building a product	vLLM	24GB+ NVIDIA

All three tools are excellent. Ollama for getting started, llama.cpp for squeezing performance, vLLM for scaling to users. If you are weighing Ollama against a GUI-first alternative, our LM Studio vs Ollama comparison shows how the two tools differ on GPU utilization, model loading, and ease of setup for non-technical users.

For GPU-specific Ollama advice, see our best GPU for Ollama guide. Optimizing your Ollama configuration? Check how to choose a GPU for Ollama. For production vLLM deployments, see best GPU for vLLM. If you are sizing hardware for a dedicated, always-on inference box rather than a personal workstation, our best GPU for an LLM server guide covers the throughput, ECC, and 24/7 thermals math.

Related guides on Best GPU for LLM

The full version lives on Best GPU for LLM — VRAM calculator, GPU comparison table, and live Amazon pricing.

Best GPU for Forge UI in 2026 (5 Picks Compared)

Thurmon Demich — Tue, 19 May 2026 14:25:18 +0000

This article was originally published on Best GPU for AI. The full version with interactive tools, FAQ, and live pricing is on the original site.

Stable Diffusion Forge exists because A1111 wastes VRAM. Built by lllyasviel (same developer behind ControlNet and Fooocus), Forge is a performance-first fork that applies aggressive memory optimizations — shared attention, split attention, FP8 automatic casting — to squeeze more from less hardware. The result: SDXL runs on 6GB cards that struggle with vanilla A1111, and generation speed improves 20-30% on identical hardware.

If you are choosing a GPU specifically for Forge, you can aim one tier lower than you would for A1111. But more VRAM still means more capability.

Forge VRAM requirements

Forge's memory optimizations meaningfully reduce the VRAM floor for every workload:

Workload	Forge VRAM	A1111 VRAM	Savings
SD 1.5 (512x512)	3-4 GB	4-5 GB	~1 GB
SDXL (1024x1024)	5-6 GB	7-8 GB	~2 GB
SDXL + ControlNet	7-8 GB	9-10 GB	~2 GB
Flux.1 Dev (FP8)	8-10 GB	12-14 GB	~4 GB
Flux.1 Dev (FP16)	12-14 GB	14-16 GB	~2 GB
SDXL + 2 LoRAs + ControlNet	8-10 GB	11-13 GB	~3 GB

The Flux numbers are particularly striking. Forge's FP8 automatic casting and aggressive model offloading bring Flux into range for 8GB cards — something that requires 12GB+ on A1111 or even ComfyUI without manual optimization.

VRAM chart available at the original article

Top GPU picks for Forge

Minimum viable: RTX 4060 (8GB) — $280

Forge makes 8GB cards genuinely usable for SDXL. The RTX 4060 handles SDXL at 1024x1024 within 5-6GB, leaving headroom for a single ControlNet. Flux works with FP8 quantization but sits right at the memory ceiling — do not expect to stack LoRAs on top.

Buy this if: you only run SDXL, your budget is strict, and you accept that Flux will be tight.

Sweet spot: RTX 4060 Ti 16GB — $400

The best GPU for Forge at any reasonable price. 16GB clears every Forge workload — SDXL, Flux at full FP16, multi-ControlNet stacks, LoRA training. The card never memory-limits you on Forge, and Ada Lovelace tensor cores deliver solid generation speed.

Forge's optimizations mean this card performs closer to how a 24GB card performs on A1111. You get premium-tier capability at a mid-range price because Forge makes the most of every gigabyte.

Speed king: RTX 5080 — $1,000

For users who measure productivity in images-per-minute, the RTX 5080 is the performance pick. 16GB GDDR7 provides enormous bandwidth — SDXL images generate in 3-4 seconds, and Flux at FP8 runs under 15 seconds. Blackwell tensor cores with FP8/FP4 hardware support align perfectly with Forge's automatic FP8 casting.

The 5080 is not about running things the 4060 Ti cannot — both have 16GB. It is about running them 2-3x faster.

Performance comparison on Forge

GPU	SDXL 1024x1024 (20 steps)	Flux FP8 1024x1024	Price
RTX 4060 (8GB)	~12 sec	~45 sec	$280
RTX 4060 Ti 16GB	~8 sec	~25 sec	$400
RTX 3090 (used)	~7 sec	~22 sec	$600
RTX 5070 Ti	~5 sec	~16 sec	$750
RTX 5080	~3-4 sec	~12 sec	$1,000
RTX 4090	~4 sec	~14 sec	$1,600
RTX 5090	~2-3 sec	~8 sec	$2,000

Notice the RTX 5080 trades blows with the RTX 4090 despite costing $600 less. Blackwell architecture advantages are most visible in Forge, where FP8 tensor operations are used by default.

Why Forge specifically favors certain GPUs

Forge's optimizations interact differently with GPU hardware:

FP8 tensor cores (Blackwell/Ada): Forge automatically casts models to FP8 where possible. GPUs with native FP8 tensor support (RTX 40/50 series) benefit enormously. Older Ampere cards (3060, 3090) do not have dedicated FP8 hardware, so the speed gain is smaller.
High bandwidth memory: Forge's split attention mechanisms move data between VRAM regions rapidly. GDDR7 (RTX 50 series) and GDDR6X (RTX 3090, 4090) handle this better than GDDR6 (RTX 3060, 4060 Ti).
Large VRAM pools: Forge can use extra VRAM as a model cache, keeping frequently-used models loaded instead of reloading from disk. 16GB+ cards switch between SDXL and Flux models without full reloads.

GPU tier list available at the original article

Quick recommendations

Budget	GPU	Forge Experience
$280	RTX 4060 8GB	SDXL works, Flux is tight
$400	RTX 4060 Ti 16GB	Everything works comfortably
$600	RTX 3090 (used)	24GB, fast, aging tensor cores
$750	RTX 5070 Ti	Fast 16GB with modern arch
$1,000	RTX 5080	Maximum speed at 16GB

For a comparison of SD frontends, see our A1111 vs ComfyUI breakdown. The complete best GPU for Stable Diffusion guide ranks every option, and the best GPU for Flux guide covers the most VRAM-hungry workload. If you are considering Forge's sibling project, our best GPU for ComfyUI picks apply to node-based workflows. For Forge's other sibling — Fooocus — see that guide for the simplified-UI take. And if you're running a Flux-based fork like Chroma, see our best GPU for Chroma AI guide.

Related guides on Best GPU for AI

The full version lives on Best GPU for AI — VRAM calculator, GPU comparison table, and live Amazon pricing.

Best GPU for LoRA Training in 2026 (5 Picks Ranked)

Thurmon Demich — Mon, 18 May 2026 01:14:15 +0000

Cross-posted from Best GPU for AI — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.

Which GPU do you actually need for LoRA training? It depends on the model size and whether you use LoRA or QLoRA. A 16GB card handles QLoRA on 7B models comfortably, but LoRA on 13B+ models demands 24GB or more. Here is the full breakdown.

Who this is for

This guide is for anyone fine-tuning language models or image generation checkpoints with LoRA adapters. Whether you are customizing a 7B LLM for a specific domain or training a Stable Diffusion LoRA for a character style, VRAM and training speed are your two constraints.

LoRA vs QLoRA VRAM requirements

Method	7B Model	13B Model	34B Model	70B Model
LoRA (FP16 base)	~18GB	~30GB	~72GB	~140GB
QLoRA (4-bit base)	~6GB	~10GB	~22GB	~40GB
LoRA (SDXL)	~10GB	—	—	—
LoRA (Flux)	~14GB	—	—	—

QLoRA cuts memory usage by 60-70% compared to standard LoRA by quantizing the base model to 4-bit while keeping the LoRA adapters in FP16. The quality tradeoff is minimal for most use cases.

VRAM chart available at the original article

Best GPUs for LoRA training ranked

Rank	GPU	VRAM	Price	Best For
1	RTX 5090	32GB GDDR7	~$2,000+	LoRA 13B, QLoRA 34B-70B
2	RTX 4090	24GB GDDR6X	~$1,600	LoRA 7B-13B, QLoRA 34B
3	RTX 5080	16GB GDDR7	~$1,000	QLoRA 13B, SDXL LoRA
4	RTX 5070 Ti	16GB GDDR7	~$750	QLoRA 7B-13B, SDXL LoRA
5	RTX 4060 Ti 16GB	16GB GDDR6	~$400	QLoRA 7B, budget entry

Training speed comparison

Task	RTX 4060 Ti 16GB	RTX 5070 Ti	RTX 4090	RTX 5090
QLoRA 7B (1 epoch, 10k samples)	~45 min	~25 min	~12 min	~8 min
LoRA 7B (1 epoch, 10k samples)	OOM	OOM	~18 min	~11 min
LoRA SDXL (1500 steps)	~18 min	~10 min	~5 min	~3.5 min
LoRA Flux (1500 steps)	OOM	~14 min	~7 min	~5 min

The RTX 4090 hits the sweet spot — it handles LoRA on 7B models in FP16 and QLoRA on models up to 34B. The 5090 adds headroom for larger models and cuts training time by 30-40%.

Budget picks for LoRA training

If $1,600 is too steep, two 16GB options get the job done:

RTX 5070 Ti (~$750) — QLoRA on 7B-13B models with comfortable headroom. GDDR7 bandwidth keeps gradients moving. Handles SDXL and Flux LoRA training without issues.

RTX 4060 Ti 16GB (~$400) — The cheapest meaningful entry point. QLoRA on 7B models works at batch size 1 with gradient accumulation. SDXL LoRA training is slower but functional.

Which GPU should you buy?

QLoRA on 7B models only: The RTX 4060 Ti 16GB at $400 is sufficient. You save $1,200 compared to the 4090 and still get usable training speeds.

LoRA on 7B or QLoRA on 13B: The RTX 5070 Ti at $750 gives you faster GDDR7 memory and better compute. Worth the step up from the 4060 Ti.

LoRA on 7B-13B or QLoRA on 34B: The RTX 4090 at 24GB is the standard recommendation. Its VRAM covers the widest range of training scenarios on a single consumer card.

LoRA on 13B+ or QLoRA on 70B: The RTX 5090 at 32GB is the only consumer card that can handle these workloads without multi-GPU setups.

Common mistakes to avoid

Running LoRA when QLoRA would produce equivalent results. Start with QLoRA and compare output quality before committing to the higher VRAM requirement of full LoRA.
Setting LoRA rank too high. Rank 16-32 is sufficient for most tasks. Higher ranks waste VRAM without meaningful quality gains.
Forgetting gradient checkpointing. Enabling it reduces peak VRAM by ~30% at the cost of ~20% slower training. Always turn it on for tight-VRAM scenarios.
Training without Flash Attention 2. It reduces attention memory from O(n^2) to O(n). This single setting can prevent OOM errors on borderline configurations.

Final verdict

Budget	GPU	Why
$400	RTX 4060 Ti 16GB	Cheapest QLoRA entry
$750	RTX 5070 Ti	Fast QLoRA, SDXL/Flux LoRA
$1,600	RTX 4090	Best all-around LoRA card
$2,000+	RTX 5090	Maximum model size coverage

The RTX 4090 remains the top recommendation for LoRA training. Its 24GB VRAM handles both LLM and image model fine-tuning without compromise. For deeper coverage, see our guides on fine-tuning GPUs and deep learning hardware. For Stable Diffusion LoRA training specifically using Kohya_ss, see our best GPU for Kohya_ss guide for script-specific settings and VRAM tuning.

LoRA training is a VRAM game. Buy the most VRAM you can afford, then optimize everything else around it.

Related guides on Best GPU for AI

Continue on Best GPU for AI for the complete guide with interactive calculators and current GPU prices.

Best Quantization for Local LLM in 2026 (Q4 to Q8)

Thurmon Demich — Sun, 17 May 2026 08:20:44 +0000

This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.

Q4_K_M. That is the answer for 90% of users — skip the rest of this article if you just need a quick recommendation. But if you want to understand why, and when the other options make sense, read on. The difference between Q3 and Q5 can mean the gap between a model that hallucinates and one that reasons cleanly.

What quantization actually does

Quantization reduces the precision of model weights from 16-bit floating point (FP16) to lower bit representations. Fewer bits = smaller model = less VRAM = faster inference. The trade-off is output quality — lower precision means the model loses nuance in its weights, which can degrade reasoning, instruction following, and factual accuracy.

GGUF is the standard format for quantized models on consumer hardware. Tools like llama.cpp, Ollama, and LM Studio all use GGUF files. When you download a model from HuggingFace, the filename tells you the quantization: model-Q4_K_M.gguf, model-Q5_K_M.gguf, etc.

The quantization comparison table

Quant	Bits/param	Quality vs FP16	VRAM (7B)	VRAM (13B)	VRAM (34B)	VRAM (70B)
Q2_K	~2.5	75-80%	~2.5GB	~5GB	~12GB	~25GB
Q3_K_M	~3.5	85-90%	~3.5GB	~7GB	~17GB	~35GB
Q4_K_M	~4.5	93-96%	~4.5GB	~8.5GB	~21GB	~42GB
Q5_K_M	~5.5	96-98%	~5.5GB	~10GB	~25GB	~50GB
Q6_K	~6.5	98-99%	~6.5GB	~12GB	~30GB	~60GB
Q8_0	~8	99%+	~8GB	~15GB	~38GB	~75GB
FP16	16	100%	~14GB	~26GB	~68GB	~140GB

VRAM estimates include ~1-2GB overhead for KV cache at moderate context lengths. Actual usage varies by model architecture and context window size.

The breakdown: when to use each level

Q4_K_M — the default choice

Use when: You want the best balance of quality and VRAM efficiency.

Q4_K_M preserves 93-96% of FP16 quality on most benchmarks. The "_K_M" suffix means it uses k-quant mixed precision — important layers (attention, output) get higher precision while less critical layers get lower precision. This targeted approach is why Q4_K_M outperforms naive 4-bit quantization by a meaningful margin.

For conversational AI, coding assistance, and general reasoning, Q4_K_M is virtually indistinguishable from FP16 in blind tests. We recommend it as the starting point for any model.

Q5_K_M — the upgrade if you have headroom

Use when: You have 20-30% more VRAM than Q4 requires.

Q5_K_M closes most of the remaining gap to FP16. The quality improvement over Q4 is most noticeable on:

Complex multi-step reasoning
Creative writing with specific style constraints
Code generation for less common languages
Tasks requiring precise numerical reasoning

If your GPU has the VRAM to spare, Q5 is always worth choosing over Q4. The performance (tok/s) difference is small — the model is ~20% larger, but inference speed is dominated by memory bandwidth, not model size.

Q3_K_M — acceptable compromise

Use when: Your VRAM is tight and Q4 does not fit comfortably.

Q3 is the lowest we recommend for serious use. Quality degrades noticeably on reasoning-heavy tasks — you will see more hallucinations and logic errors compared to Q4. But for simple chat, summarization, and straightforward Q&A, Q3 models remain functional. If the alternative is not running the model at all, Q3 is a valid option.

Q6_K and Q8_0 — diminishing returns

Use when: You have abundant VRAM and want maximum quality.

The jump from Q5 to Q6 is marginal — maybe 1-2% on benchmarks. Q8 is nearly identical to FP16 in practice. These quantizations make sense for small models (7B at Q8 = ~8GB, easily fits on most GPUs) but become impractical for larger models. Running a 34B at Q8 needs ~38GB — beyond any single consumer GPU.

Q2_K and below — last resort

Use when: You absolutely must fit a specific model on limited hardware and accept significant quality loss.

Q2 models lose 20-25% of FP16 quality. Reasoning degrades substantially. Instruction following becomes unreliable. We do not recommend Q2 for anything beyond experimentation.

VRAM chart available at the original article

Dynamic quantization: the new frontier

Unsloth introduced UD (Ultra Dynamic) quantization in 2025, and it is gaining traction in 2026. UD-Q2, UD-Q3, and UD-Q4 use variable bit allocation across layers — critical layers get more bits, less important layers get fewer. The result: a UD-Q3 model can match traditional Q4_K_M quality at Q3-level VRAM usage.

If you see UD-quantized models on HuggingFace, prefer them over standard quants at the same nominal bit level. The VRAM savings are real and the quality is measurably better.

Practical recommendations by GPU

GPU tier list available at the original article

GPU	VRAM	Best quant for 7B	Best quant for 14B	Best quant for 34B
RTX 3060 12GB	12GB	Q8_0	Q4_K_M	Won't fit
RTX 4060 Ti 16GB	16GB	Q8_0	Q5_K_M	Won't fit
RTX 4090	24GB	FP16	Q8_0	Q4_K_M
RTX 5090	32GB	FP16	FP16	Q5_K_M

The pattern is simple: use the highest quantization your VRAM can hold while leaving 2-3GB headroom for KV cache.

Common mistakes

Defaulting to Q8 or FP16 "for quality." Unless you are evaluating or fine-tuning, Q8 is overkill for inference. Q5_K_M captures nearly all the quality at 60-70% of the VRAM cost.
Using Q2/Q3 to fit a bigger model. Running a 70B at Q2 is almost always worse than running a 34B at Q4. A well-quantized smaller model beats a poorly quantized larger one.
Ignoring the _K_M suffix. Plain Q4 and Q4_K_M are not the same. Always prefer the k-quant variants — they allocate bits more intelligently.
Not checking for UD quants. Before downloading a standard Q4_K_M, check if a UD-Q4 version exists. Same VRAM, better quality.

Final answer

Situation	Recommended quant
General use, most users	Q4_K_M
Have VRAM headroom (~20%+)	Q5_K_M
VRAM-constrained	Q3_K_M
Small models (7B) on 16GB+	Q8_0
Evaluating/benchmarking	FP16

Q4_K_M remains king in 2026. The quality-to-VRAM ratio is unmatched. Upgrade to Q5 when you can, drop to Q3 when you must, and check for UD quants before downloading anything.

For VRAM planning across model sizes, see how much VRAM for local LLM. Running models through Ollama? Our best GPU for Ollama guide covers setup. Budget shoppers should check best budget GPU for local LLM for affordable options. And if you want to push the limits with a single GPU, read how to run 70B on a single GPU.

Related guides on Best GPU for LLM

Read the full guide on Best GPU for LLM — includes our VRAM calculator, GPU comparison table, and live pricing.

RTX 5090 vs RTX 3090 for AI: New Flagship vs Used Value King

Thurmon Demich — Sat, 16 May 2026 05:19:09 +0000

Cross-posted from Best GPU for AI — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.

Here is the uncomfortable truth: the RTX 3090 still wins for most AI users in 2026, and it costs $800 used. The RTX 5090 is a spectacular GPU — but at $2,000, it needs to justify a 2.5x price premium. For the majority of workloads, it cannot.

Quick answer: The RTX 3090 is the value king for AI in 2026. 24GB GDDR6X at $800 handles 90% of consumer AI workloads. The RTX 5090 is faster and has more VRAM, but only makes sense if you run models above 24GB or need maximum throughput for production workloads.

Specs at a glance

Spec	RTX 5090	RTX 3090
Architecture	Blackwell	Ampere
VRAM	32GB GDDR7	24GB GDDR6X
Memory bandwidth	~1.8 TB/s	936 GB/s
TDP	575W	350W
Retail price	~$2,000	~$800 (used)
Price per GB VRAM	$62.50	$33.33

VRAM chart available at the original article

What the RTX 5090 gets you

The RTX 5090 is genuinely faster — roughly 2-3x faster than the 3090 in most AI benchmarks. Its 32GB GDDR7 with nearly double the memory bandwidth means models load faster, tokens generate faster, and image batches complete faster. For production throughput, it is a different class of hardware.

Where it matters most:

Running 70B+ models in 4-bit quantization (32GB just barely fits)
Stable Diffusion XL batch generation at scale
Fine-tuning medium-sized models locally without offloading

Where the RTX 3090 holds its ground

The 3090's 24GB is enough for every 7B, 13B, and most 34B models in GGUF format. Stable Diffusion XL, Flux.1, and ComfyUI all run well. LoRA training and basic fine-tuning work fine. For the vast majority of what people actually do with local AI, 24GB is not a bottleneck.

What 24GB handles comfortably:

Llama 3 70B at Q4 quantization (~37GB) — needs offloading, but 34B fits clean
Stable Diffusion 3.5 Large and Flux.1 Dev
ComfyUI workflows with multiple loaded models
LoRA and DreamBooth training at moderate batch sizes

Performance comparison

Workload	RTX 5090	RTX 3090	Difference
SD XL (512 img/hr)	~480 img/hr	~180 img/hr	~2.7x faster
Llama 3 34B (tokens/sec)	~65 tok/s	~28 tok/s	~2.3x faster
Flux.1 Dev (1024px)	~8 sec	~22 sec	~2.75x faster
VRAM headroom (34B Q4)	16GB free	~4GB free	Much more

The 5090 is faster on every metric. That is not the argument. The argument is whether that speed is worth $1,200 more.

The value math

If you run AI for personal or hobbyist use, the RTX 3090 at $800 is almost always the right call. $1,200 saved is a meaningful amount. The 3090 does not bottleneck you on VRAM for standard workloads, and the speed difference — while real — does not change what you can do, only how long you wait.

If you run AI commercially or at scale, the calculus flips. Time savings compound across thousands of generations. The 5090's throughput advantage starts paying back over months of heavy use.

See also: Best used GPU for AI and Best GPU for AI for broader context.

Which GPU should YOU buy?

Hobbyist or researcher on a budget? RTX 3090 at ~$800 used. 24GB handles everything you will actually run.
Running 70B+ models locally? The RTX 5090's 32GB is genuinely useful here. Consider it. If you're wondering whether the 3090 alone can handle a 70B with offloading, our can the RTX 3090 run 70B? deep-dive walks through the exact math.
Doing commercial AI work or heavy batch generation? RTX 5090 pays back through throughput gains over time.
Want a middle ground? The RTX 4090 at ~$1,600 new gives 24GB with better power efficiency than the 3090 and better value than the 5090. See the RTX 4090 vs 5090 comparison, and the more direct RTX 3090 vs 4090 for AI head-to-head if you're choosing between Ampere used and Ada Lovelace new.

Common mistakes to avoid

Buying an RTX 5090 for hobby use because it is "future-proof" — you are paying for throughput you will not use
Dismissing the RTX 3090 because it is old — Ampere still runs every major AI framework correctly
Forgetting the 3090 runs at 350W and the 5090 at 575W — the power draw difference matters for your PSU and electricity bill
Assuming more VRAM always matters — 24GB covers most consumer use cases and the extra 8GB rarely changes what models you can load

Final verdict

Criteria	Winner
Raw performance	RTX 5090
Value per dollar	RTX 3090
VRAM capacity	RTX 5090
Power efficiency	RTX 3090
Best for hobbyists	RTX 3090
Best for production	RTX 5090

The RTX 3090 is still the value king of AI GPUs in 2026. If you are spending your own money for personal AI work, save $1,200 and buy a used 3090.

Buying the newest GPU because it exists is not a strategy. Buy the GPU that matches the work you are actually doing.

Related guides on Best GPU for AI

Continue on Best GPU for AI for the complete guide with interactive calculators and current GPU prices.

Best GPU for Llama 70B in 2026 (48GB+ VRAM Required)

Thurmon Demich — Fri, 15 May 2026 01:14:34 +0000

This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.

Quick answer: You need at least 48GB of VRAM to run Llama 70B at usable quality. A single RTX 5090 (32GB) can run it at aggressive Q3/Q4 quantization, but for good quality you'll need dual GPUs or a workstation card like the A6000.

The VRAM problem with 70B models

Llama 70B is one of the most capable open-source language models available, but it's demanding. Here's how much VRAM it actually needs:

VRAM chart available at the original article

Quantization	Model Size	VRAM Required	Quality Impact
FP16 (full)	~140GB	140GB+	Best quality
Q8	~70GB	72GB+	Near-lossless
Q6_K	~54GB	56GB+	Minimal loss
Q5_K_M	~48GB	50GB+	Slight loss
Q4_K_M	~40GB	42GB+	Noticeable on complex tasks
Q3_K_M	~32GB	34GB+	Significant degradation
Q2_K	~25GB	28GB+	Major quality loss

The VRAM column includes overhead for context window and KV cache. Actual usage varies with context length.

GPU options for Llama 70B

Single GPU options

GPU	VRAM	Can Run 70B?	Best Quantization	Price
RTX 5090	32GB	Yes, limited	Q3_K_M (degraded)	~$2,000
RTX 4090	24GB	Barely	Q2_K only (poor)	~$1,600
A6000	48GB	Yes	Q4_K_M+ (good)	~$3,500
A100 80GB	80GB	Yes	Q8+ (excellent)	~$8,000+

Dual GPU options

Setup	Total VRAM	Best Quantization	Approx Cost
2x RTX 3090	48GB	Q4_K_M (good)	~$1,800 used
2x RTX 4090	48GB	Q5_K_M (great)	~$3,200
2x RTX 5090	64GB	Q6_K (excellent)	~$4,000+

Best approaches by budget

Budget: Under $2,000 — Dual RTX 3090

The cheapest way to run Llama 70B at decent quality:

48GB combined VRAM handles Q4_K_M quantization
RTX 3090s are widely available used for $800-900 each — see our dual RTX 3090 setup guide for the full build walkthrough
Ollama and llama.cpp support multi-GPU splitting natively
Inference speed is slower due to inter-GPU communication

Downsides: Needs a motherboard with two x16 PCIe slots, a beefy PSU (1200W+), and good case airflow. Two cards at 350W each generate serious heat.

Mid-range: $2,000-4,000 — RTX 5090 or dual 4090

Single RTX 5090: Simplest setup. Can run 70B at Q3_K_M, which is usable but you'll notice quality loss on reasoning-heavy tasks. Best if you also use the GPU for smaller models where it excels. For tips on making the most of a single-card 70B setup, see how to run 70B on a single GPU, and for a broader look at the $2,000 tier our best GPU for LLM under $2,000 guide ranks the alternatives.

Dual RTX 4090: 48GB total VRAM for Q4_K_M+ quality. Better output quality than a single 5090, but more complex setup and higher power draw.

High-end: $3,500+ — NVIDIA A6000

The NVIDIA A6000 with 48GB VRAM on a single card is the cleanest solution:

Runs Q4_K_M and Q5_K_M on one card
No multi-GPU complexity
Professional-grade reliability
ECC memory for consistent results

The downside is price and availability. The A6000 is a professional card with professional pricing.

Ollama setup for multi-GPU

If you go the dual-GPU route, Ollama handles GPU splitting automatically:

OLLAMA_NUM_GPU=999 ollama run llama3:70b-q4_K_M

For llama.cpp, specify the split:

--tensor-split 24,24

Both tools will distribute model layers across available GPUs. Inference speed scales roughly 60-70% of linear with two cards due to communication overhead.

Inference speed expectations

Setup	Llama 70B Q4_K_M	Tokens/sec
Single A6000 (48GB)	Full model on GPU	~15-20 tok/s
2x RTX 4090 (48GB)	Split across GPUs	~12-18 tok/s
2x RTX 3090 (48GB)	Split across GPUs	~8-12 tok/s
Single RTX 5090 (Q3)	Degraded quality	~18-22 tok/s
CPU offload (partial)	Slow	~2-5 tok/s

These are approximate for 2048 context length. Longer contexts reduce speed.

Should you even run 70B locally?

Before investing in hardware, consider:

Is 70B actually better for your use case? For many tasks, a well-prompted 13B or fine-tuned 34B model performs nearly as well.
Would cloud be cheaper? If you only need 70B occasionally, cloud GPU rental (RunPod, Vast.ai) at $1-2/hour may be more cost-effective than a $3,000+ hardware investment. See RunPod vs Vast.ai for LLM to understand which platform offers better pricing and reliability for this workload, and our cloud GPU TCO vs self-hosted LLM breakdown for the exact monthly break-even math.
Do you need the privacy? Local inference means your data never leaves your machine. If that matters, the hardware cost is justified.

Which GPU should YOU buy for Llama 70B?

Running 70B as your primary model? Get 2x RTX 4090 ($3,200). 48GB combined VRAM handles Q4_K_M with good quality and decent speed.
Running 70B occasionally alongside smaller models? Get an RTX 5090 ($2,000). Handles Q3_K_M for 70B and excels at 7B-34B models the rest of the time.
Need the best single-card 70B experience? Get an NVIDIA A6000 ($3,500). 48GB on one card means Q4_K_M+ without multi-GPU complexity.
Only need 70B sometimes? Use cloud GPUs instead. $1-2/hour beats a $3,000+ hardware investment for occasional use.

Common mistakes to avoid

Buying a single 24GB GPU expecting to run 70B — the RTX 4090 at 24GB can only fit Q2_K quantization, where output quality is significantly degraded. You need 32GB minimum, and realistically 48GB for good results.
Ignoring memory bandwidth in dual-GPU setups — inter-GPU communication adds latency. Two RTX 3090s (936 GB/s each) outperform two RTX 4060 Tis even if total VRAM is similar, because bandwidth determines token generation speed.
Not accounting for context length VRAM overhead — at Q4_K_M, Llama 70B uses ~40GB for weights alone. A 4K context window adds 3-5GB for the KV cache. Plan your VRAM budget accordingly. For a full breakdown of exactly how much VRAM each 70B quantization level needs, see how much VRAM for a 70B model.
Skipping the "do I actually need 70B" question — a well-quantized 34B model on a single RTX 4090 often matches 70B at Q2_K in output quality, at 3x the inference speed and half the hardware cost. Llama 4 Scout is another alternative worth considering — it beats Llama 3 70B on benchmarks and fits on a single RTX 5090; see our Llama 4 Scout GPU guide for details. DeepSeek's reasoning-tuned 32B is another single-card alternative — see our DeepSeek GPU guide for VRAM needs and tok/s on 24GB cards. If you are wondering whether a budget card like the 4060 Ti can even attempt 70B, see can the RTX 4060 Ti run Llama 70B?

Final verdict

Situation	Recommendation
Must be single GPU	NVIDIA A6000 (48GB)
Best value	2x RTX 3090 used (~$1,800)
Best performance/value	2x RTX 4090 (~$3,200)
Occasional 70B use	Cloud GPU (RunPod/Vast.ai)
Mostly smaller models	RTX 5090 single card

For most people, Llama 70B is not a single-GPU workload at consumer prices. Accept that and plan for either dual GPUs, a workstation card, or cloud.

The best GPU for Llama 70B is the one that gives you enough VRAM to avoid aggressive quantization. Quality degrades fast below Q4 — don't sacrifice output quality to save on hardware.

Related guides on Best GPU for LLM

Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.

Best GPU for HunyuanVideo (AI Video Generation) in 2026

Thurmon Demich — Thu, 14 May 2026 01:14:39 +0000

From the Best GPU for AI archive. The canonical version has interactive calculators, an up-to-date GPU comparison table, and live pricing.

HunyuanVideo is one of the most demanding open-source models you can run locally. Tencent's flagship video generation model produces genuinely impressive results — but it needs serious hardware to do it. Under 24GB of VRAM, your options narrow fast.

Quick answer: You need at least 24GB VRAM for practical HunyuanVideo generation at good quality. The RTX 4090 is the best value pick. The RTX 5090 is the fastest consumer option. If you do not have a 24GB GPU, cloud is the better path.

VRAM requirements for HunyuanVideo

HunyuanVideo is not a 12GB GPU task. The model weights alone push 30GB+ in full precision, and even with quantization, you need significant headroom.

Resolution / Quality	Minimum VRAM	Recommended VRAM
480p, low steps	18GB (with offload)	24GB
720p, standard	24GB	32GB
1080p experimental	32GB	40GB+
Full quality, no offload	32GB+	48GB+

With 24GB and careful quantization (fp8 or int8), 720p generation is achievable. Under 24GB, you are relying on system RAM offloading which slows generation dramatically.

VRAM chart available at the original article

Best GPU picks for HunyuanVideo

RTX 5090 — Fastest consumer option

32GB GDDR7 is currently the best consumer setup for HunyuanVideo. The extra 8GB over the 4090 gives meaningful headroom at 720p without quantization, and generation times are roughly 2x faster. At ~$2,000, it is expensive but it is the only consumer GPU that runs HunyuanVideo comfortably without aggressive quantization.

RTX 4090 — Best value for local generation

The 4090's 24GB is the practical floor for HunyuanVideo. With fp8 quantization, you can run 720p generation without CPU offloading. Generation times are slower than the 5090 but acceptable for personal projects. At ~$1,600, it is the most cost-effective local option.

RTX 3090 — Usable with caveats

24GB GDDR6X can technically run HunyuanVideo with the same quantization tricks as the 4090. The slower memory bandwidth means generation takes noticeably longer. If you already own a 3090, it works. Buying one specifically for HunyuanVideo is harder to justify when the 4090 is not much more expensive.

Generation speed comparison

GPU	VRAM	5-sec 480p clip	5-sec 720p clip
RTX 5090	32GB	~4 min	~9 min
RTX 4090	24GB	~9 min	~22 min
RTX 3090	24GB	~13 min	~32 min
RTX 4070 Ti Super	16GB	Not recommended	Not recommended

Estimates based on community benchmarks with fp8 quantization. Actual times vary by system, ComfyUI version, and model settings.

Should you use cloud instead?

For casual or experimental use of HunyuanVideo, cloud is the smarter option. RunPod and Vast.ai give you access to A100 or H100 instances that run HunyuanVideo at full quality without buying a $1,600+ GPU. If you generate fewer than 10-15 clips per week, cloud costs less than owning the hardware.

For heavy daily use, local hardware pays back within months. For occasional experimentation, it rarely does.

Which GPU should YOU buy?

Want the fastest local generation? RTX 5090 (32GB) — runs HunyuanVideo at 720p without compromise.
Best value for serious local use? RTX 4090 (24GB) — usable with fp8 quantization, significant cost savings over 5090.
Already own a 3090? It works. Not worth upgrading just for HunyuanVideo.
Casual or occasional use? Skip the hardware entirely and use cloud GPU instances — much better economics for low volume.
Have under 16GB VRAM? Cloud is your only practical option for HunyuanVideo at reasonable quality.

Common mistakes to avoid

Trying to run HunyuanVideo on a 12GB GPU expecting usable results — the experience is painful and slow
Skipping quantization on a 24GB GPU and running out of VRAM mid-generation
Buying a GPU specifically for HunyuanVideo without checking whether you will use it heavily enough to justify the cost
Overlooking Flux.1 video variants as alternatives — some require less VRAM for similar quality outputs
Underestimating storage requirements — HunyuanVideo model files are large and outputs fill up drives fast
Skipping a broader VRAM check before buying — our how much VRAM for AI video breakdown covers every major model so you know what tomorrow's video tools will demand from the same hardware

Final verdict

Use case	Recommendation
Maximum performance	RTX 5090 (32GB)
Best value local	RTX 4090 (24GB)
Budget local option	RTX 3090 (24GB, used)
Occasional use	Cloud GPU (RunPod / Vast.ai)
Under 16GB VRAM	Cloud only

HunyuanVideo rewards having real hardware. If you plan to generate AI video regularly, the RTX 4090 at 24GB is the minimum worth buying. For everything else, cloud is the honest recommendation.

HunyuanVideo is VRAM-hungry by design. Match the hardware to your actual generation volume — cloud is legitimate for casual use.

Related guides on Best GPU for AI

Continue on Best GPU for AI for the complete guide with interactive calculators and current GPU prices.

Best GPU for Ollama in 2026: 7 Cards Ranked by Tok/s

Thurmon Demich — Wed, 13 May 2026 00:44:33 +0000

From the Best GPU for LLM archive. The canonical version has interactive calculators, an up-to-date GPU comparison table, and live pricing.

Quick answer: The best GPU for Ollama depends mainly on VRAM, model size, quantization level, and whether you want the fastest local inference or the best budget setup. For most users, the RTX 4090 is the best all-around pick. If you also want to transcribe audio locally alongside your LLM stack, our local Whisper GPU guide covers what VRAM Whisper adds on top.

What matters most for Ollama

VRAM for fitting your chosen model — our Ollama VRAM Requirements guide lists exact numbers per model and quant
Memory bandwidth for faster inference
Budget and availability
Power and thermals for long-running sessions

Best GPUs for Ollama

GPU	VRAM	Best For	Speed (13B Q4)	Price
RTX 5090	32GB	34B+ models, maximum speed	~85 tok/s	~$2,000
RTX 4090	24GB	Best overall, up to 34B	~55 tok/s	~$1,600
RTX 4070 Ti Super	16GB	7B-13B models	~35 tok/s	~$700
RTX 4060 Ti 16GB	16GB	Budget 7B-13B	~25 tok/s	~$400
RTX 3090 (used)	24GB	Value pick, same VRAM as 4090	~30 tok/s	~$800

For a detailed Ollama performance comparison between the 4090 and 3090, see RTX 4090 vs 3090 for Ollama. For the full generation leap from the used 3090 to the current flagship, see RTX 5090 vs 3090 for LLM.

GPU tier list available at the original article

How to choose

If your target is larger Llama-family models, prioritize VRAM first. If you mostly run smaller quantized models, value and power efficiency may matter more than flagship performance. For multi-step agentic workloads — where models plan, call tools, and loop autonomously — see our best GPU for AI agents guide for the additional VRAM considerations involved.

Which GPU should YOU buy for Ollama?

Running 7B models (Llama 3 8B, Mistral 7B)? Get the RTX 4060 Ti 16GB ($400). Plenty of VRAM and fast enough for interactive chat. Using it with a coding assistant like Continue.dev? Our Continue.dev GPU guide covers the exact latency targets you need, and for the broader workflow our local coding LLM GPU guide ties model choice and editor integration together.
Running 13B models (CodeLlama 13B, Qwen 14B)? Get the RTX 4070 Ti Super ($700) or RTX 4090 ($1,600) for headroom on context length. Running Google's Gemma family? Our best GPU for Gemma guide covers the 2B/7B/27B lineup, with separate Gemma 3 and Gemma 4 deep-dives for the latest releases.
Running 34B+ models (Qwen 32B, Llama 70B)? Get the RTX 4090 minimum for 34B; RTX 5090 or dual GPUs for 70B. Weighing whether the RTX 5070 is a viable cheaper alternative to the 4090? See RTX 5070 vs 4090 for LLM for a VRAM and speed comparison. Running the latest Qwen 3.6? See our Qwen 3.6 GPU guide for updated VRAM numbers.
Running Mistral 7B or Mistral variants? See our best GPU for Mistral guide for model-specific VRAM and speed numbers.
Pairing Ollama with a retrieval pipeline? Our best GPU for RAG guide covers the extra VRAM the embedding model and long context window need on top of base inference.
Only need occasional access to large models? Try cloud GPUs — cheaper than buying flagship hardware for occasional use.
Considering a Mac Mini instead of a discrete GPU? See our can the Mac Mini run LLMs guide for a realistic assessment of what the M4 chip handles well, and our Mac vs NVIDIA for LLM head-to-head for the broader platform decision.
Building an air-gapped or fully on-prem deployment? Our best GPU for private AI guide covers VRAM picks where data never leaves the machine.

Common mistakes to avoid

Buying an 8GB VRAM GPU for Ollama — 8GB limits you to small 7B models at low quantization with almost no context window. You will outgrow it within weeks. Wondering if an older card like the RTX 3060 is enough to start? Our can the RTX 3060 run Ollama guide answers that question with real benchmarks.
Ignoring memory bandwidth — two cards may have the same VRAM, but higher bandwidth means faster token generation. The RTX 3090's 936 GB/s crushes the RTX 4060 Ti's 288 GB/s in tokens per second. Choosing between the RTX 5080 and 4090 for Ollama? See RTX 5080 vs 4090 for LLM for a bandwidth and VRAM breakdown.
Not accounting for context length overhead — Ollama's KV cache grows with context. A model that "fits" at 2K context may OOM at 8K. Budget 2-4GB extra VRAM beyond model size. Choosing the right quantization level is key to fitting your model — our best quantization for local LLM guide breaks down the quality-vs-VRAM tradeoffs. This is especially critical for LLM summarization workloads, where long documents push context windows to their limits.
Choosing AMD without checking Ollama compatibility — Ollama's ROCm support is improving but still inconsistent. Verify your specific AMD card works before buying. For a practical breakdown of how Ollama performs differently on Windows versus Linux, including ROCm driver behavior, see our Windows vs Linux for local LLM guide. If you plan to run Ollama with a web interface, see our best GPU for Open WebUI guide — the GPU requirements are the same but there are configuration tips specific to that stack. If you are still deciding between Ollama and other inference engines, see Ollama vs llama.cpp vs vLLM compared to understand which tool best matches your use case.

Final verdict

The best GPU for Ollama is the one that fits your target model size and usage pattern without overspending on performance you will not use. If you are choosing between Ollama and LM Studio as your inference frontend, our LM Studio vs Ollama comparison covers the GPU requirements, model format support, and usability tradeoffs of each tool. If you have settled on LM Studio specifically, our best GPU for LM Studio guide covers which cards deliver the best VRAM-to-speed ratio for that interface. Prefer a traditional model loader GUI over Ollama? See our text-generation-webui GPU guide for hardware recommendations tailored to that interface. For budget-focused picks at specific price points, see our best GPU for LLM under $1500 guide.

Match your GPU to the model you actually run, not the one you might try someday. You can always upgrade — but you can't refund wasted headroom.

Frequently Asked Questions

What is the best budget GPU for Ollama?

The RTX 3060 12GB (around $250 used) is the best budget GPU for Ollama. It handles all 7B models at Q4_K_M or higher quantization with speeds fast enough for interactive chat. For a modest step up, the RTX 4060 Ti 16GB at $400 adds 13B model support and is the best new budget card for Ollama in 2026.

What Ollama models can I run on an RTX 3060 12GB?

With 12GB VRAM, the RTX 3060 comfortably runs all 7B models (Llama 3 8B, Mistral 7B, Gemma 7B) at Q4_K_M to Q8 quantization. You can also run 13B models like Llama 2 13B at Q3_K_M or Q4_K_M, though context length will be limited. Models larger than 13B will not fit.

What Ollama models can I run on an RTX 4090?

The RTX 4090's 24GB VRAM handles all 7B and 13B models at full Q8 or FP16 precision, plus 34B models like CodeLlama 34B and Qwen 32B at Q4_K_M quantization. Expect fast, conversational-speed inference for 13B Q4 models — comfortably above 40 tok/s. For 70B models, even the 4090 falls short — you would need dual GPUs or cloud.

Does Ollama support AMD GPUs?

Yes, Ollama supports AMD GPUs through the ROCm framework on Linux. However, ROCm compatibility is inconsistent across AMD card models and driver versions, and performance is generally noticeably slower than equivalent NVIDIA CUDA setups — expect a meaningful speed penalty that varies by card and model. Always verify your specific AMD GPU is supported before purchasing. NVIDIA remains the safer choice for a hassle-free Ollama experience.

Related guides on Best GPU for LLM

Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.

Best GPU for CHROMA Image Generation in 2026 (Ranked)

Thurmon Demich — Tue, 12 May 2026 00:44:47 +0000

Cross-posted from Best GPU for AI — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.

CHROMA is a next-generation text-to-image model built on transformer architecture, and it raises the hardware bar compared to SDXL or even Flux.1. The model demands 16GB VRAM at minimum for comfortable local use — and that minimum is not generous. If you are buying a GPU specifically to run CHROMA, this guide cuts through the specs to tell you what actually works.

Quick answer: The RTX 4090 is the best GPU for CHROMA. For value-conscious buyers, the RTX 4070 Ti Super (16GB) covers the minimum requirement. Budget users should target the RTX 4060 Ti 16GB — the absolute floor for usable CHROMA performance.

CHROMA VRAM requirements

CHROMA uses a transformer-based diffusion architecture (similar to Flux) that holds large intermediate representations in memory during the denoising process. Unlike SDXL, you cannot easily shrink this with standard memory tricks.

CHROMA Mode	Minimum VRAM	Recommended VRAM	Notes
CHROMA standard	16GB	20GB+	Standard resolution (1024px)
CHROMA high-res	20GB	24GB	1536px+ outputs
CHROMA with ControlNet	18GB	24GB	Additional ControlNet overhead
CHROMA batched (2 images)	24GB	32GB	Parallel generation

Cards below 16GB require aggressive quantization or model offloading, which noticeably degrades output quality compared to full precision inference.

VRAM chart available at the original article

Best GPUs for CHROMA

Best overall: RTX 4090 (~$1,600)

The RTX 4090's 24GB VRAM runs CHROMA without any compromise. High-resolution generation, ControlNet layers, and even batched inference work comfortably. Generation speed is fast enough that iterating on prompts feels fluid rather than laborious.

For anyone serious about CHROMA as a primary workflow, the 4090 is the clear recommendation. Its lead over 16GB cards is not marginal — 24GB opens output resolutions and pipeline configurations that simply do not fit in less VRAM.

Best value: RTX 4070 Ti Super (~$700)

The RTX 4070 Ti Super's 16GB VRAM meets the minimum requirement for CHROMA at standard resolutions. Generation at 1024px works. High-res outputs above 1280px become constrained and may require resolution tiling.

Compared to the 4090, you will notice slower generation times and more limits on batch size. But for a card that costs less than half the price, the 4070 Ti Super delivers a real CHROMA experience — not a compromised one.

Budget: RTX 4060 Ti 16GB (~$430)

The 4060 Ti 16GB is the entry point for CHROMA. It has the VRAM capacity but weaker compute than the 4070 Ti Super, which means generation takes longer. Expect roughly 2x the generation time compared to the 4070 Ti Super for similar outputs.

At this tier, you are doing local CHROMA work — but slowly. For experimentation and occasional generation rather than production use, the 4060 Ti 16GB is viable. Do not buy the 8GB variant; it cannot run CHROMA without severe quality loss.

CHROMA vs Flux: which is more demanding?

CHROMA is more demanding than Flux.1. This matters because many buyers already have CHROMA on their radar after running Flux successfully.

Model	Min VRAM	Recommended VRAM
Flux.1 Schnell	12GB	16GB
Flux.1 Dev	16GB	20GB
CHROMA standard	16GB	24GB

If your card runs Flux.1 Dev comfortably, CHROMA will be tighter. The 16GB minimum holds for both models, but CHROMA consumes more of that headroom.

Running CHROMA in ComfyUI

CHROMA runs best in ComfyUI, which offers more memory management control than other frontends:

Enable CPU offloading for VAE — reduces VRAM pressure during decode
Use FP16 precision — standard for CHROMA, significant VRAM reduction vs FP32
Load-on-demand for ControlNet models — avoids holding multiple models in VRAM simultaneously
Tile for high-res outputs — splits large generations into overlapping tiles to reduce peak VRAM

With these settings, a 16GB card can produce higher-quality outputs than naive full-precision runs would suggest.

Which GPU should YOU buy?

Buy the RTX 4090 if:

CHROMA is your primary workload and quality is the priority
You want to run high-resolution outputs (1536px+) without tiling
You also run Stable Diffusion or video models alongside CHROMA

Buy the RTX 4070 Ti Super if:

You want good CHROMA performance at a reasonable budget
Standard resolution (1024px) outputs cover your use case
You are balancing CHROMA with other 16GB-compatible AI tasks

Buy the RTX 4060 Ti 16GB if:

Budget is the primary constraint
You are exploring CHROMA experimentally rather than as a production workflow
Speed is secondary to VRAM capacity

Avoid:

Any GPU with less than 16GB VRAM — the quality degradation from heavy quantization makes CHROMA substantially worse than the model is capable of

Common mistakes to avoid

Buying a 12GB card for CHROMA. Cards like the RTX 4070 Super (12GB) hit hard VRAM limits with CHROMA. The model was designed for 16GB minimum. You will spend more time fighting memory errors than generating.
Assuming CHROMA runs like SDXL. SDXL fits in 8GB with optimization. CHROMA does not. The two models have fundamentally different memory requirements.
Ignoring the speed difference between 16GB tiers. The RTX 4060 Ti 16GB and RTX 4070 Ti Super both have 16GB — but the Ti Super is significantly faster. If you generate at high volume, the speed gap matters.
Skipping ComfyUI memory settings. Default ComfyUI settings may not be optimal for CHROMA. Take 10 minutes to configure VAE offloading and precision settings before concluding your card cannot run the model.

Final verdict

GPU	VRAM	CHROMA Quality	Generation Speed	Price
RTX 4090	24GB	Excellent	Fast	~$1,600
RTX 5080	16GB	Good	Fast	~$1,000
RTX 4070 Ti Super	16GB	Good	Moderate	~$700
RTX 4060 Ti 16GB	16GB	Adequate	Slow	~$430
RTX 4070 Super	12GB	Poor (quantized)	—	~$550

CHROMA is a demanding model that rewards GPU investment. The 16GB threshold is real — below it, the experience degrades meaningfully. The RTX 4070 Ti Super is the value sweet spot: it meets the requirement at a fair price and leaves headroom for the rest of your AI toolkit.

Related guides on Best GPU for AI

Read the full guide on Best GPU for AI — includes our VRAM calculator, GPU comparison table, and live pricing.

Best GPU for LM Studio in 2026: 7 Cards Compared & Ranked

Thurmon Demich — Mon, 11 May 2026 00:45:39 +0000

This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.

LM Studio is one of the most hardware-aware LLM frontends available. Unlike tools that run the same inference backend regardless of platform, LM Studio selects its backend based on what hardware it detects: MLX on Apple Silicon, CUDA on NVIDIA, and Metal as an Intel Mac fallback. This means a Mac M4 Pro running LM Studio gets meaningfully better performance than the same hardware running a tool defaulting to llama.cpp's CPU path.

That backend selection decision is what this guide is built around.

Quick answer: For NVIDIA desktop builds, the RTX 4090 (24GB) handles 34B models smoothly and the RTX 4060 Ti 16GB is the budget entry point for 13B at full quality. For Apple Silicon, the M4 Pro 24GB is the minimum for comfortable 13B use, and M4 Max 48GB+ handles 34B. The used RTX 3090 (24GB) remains the strongest VRAM-per-dollar option if you find one at a good price.

How LM Studio picks its backend

This matters because it directly affects performance, and it's what separates LM Studio from other local inference tools.

Apple Silicon: LM Studio defaults to MLX, Apple's native machine learning framework for Apple chips. MLX uses the unified memory architecture of M-series chips efficiently — the same memory pool serves both CPU and GPU, meaning a MacBook Pro M4 Max with 48GB has 48GB available to the model with no VRAM ceiling separate from system RAM. MLX performance on Apple Silicon is significantly faster than running llama.cpp CPU inference, and in many cases faster than GPU-offloaded llama.cpp as well.

Before LM Studio made MLX the default on Apple Silicon, tools like earlier versions of Ollama defaulted to llama.cpp — which would use CPU inference unless explicitly configured for GPU offloading. LM Studio's automatic MLX backend is why Mac LLM performance for many users changed overnight when they switched frontends, not hardware.

NVIDIA GPUs: LM Studio uses CUDA-accelerated llama.cpp or its own CUDA inference path. Full GPU acceleration with VRAM management, quantization selection, and model splitting if needed.

Intel Mac / no supported GPU: Falls back to Metal or CPU inference via llama.cpp. Functional but significantly slower — not a recommended primary platform for LLM inference.

VRAM requirements by model size in LM Studio

LM Studio's quantization selector makes VRAM requirements variable. Here's a practical guide to what fits where:

Model size	Q4 quantization	Q8 quantization	Full precision (FP16)
7B	~4.5GB	~8GB	~14GB
13B	~7.5GB	~14GB	~26GB
34B	~20GB	~35GB	~68GB
70B	~40GB	~70GB	~140GB

For LM Studio on NVIDIA: if a model's quantized size fits in VRAM, it runs fully on GPU. If it doesn't fit, LM Studio can split layers across GPU and CPU — but layers running on CPU are dramatically slower. The practical target is fitting the entire model in VRAM for acceptable generation speed.

For Apple Silicon: unified memory means the 7B Q4 / 13B Q4 / 34B Q4 question is just about total system memory, not a separate VRAM limit. This is the architectural advantage.

For more on VRAM sizing principles, see how much VRAM do you need for local LLM.

NVIDIA picks for LM Studio

RTX 4090 (24GB) — best NVIDIA option:
24GB handles 13B models at Q8 or FP16, 34B models at Q4 and Q5, and provides fast generation on 7B models. LM Studio's CUDA path with 24GB means no model splitting on mainstream LLMs in 2026 — everything runs fully on GPU at comfortable speeds. Community users report 25–40 tokens/second for 13B Q4 on RTX 4090, which is fast enough for productive use.

RTX 4060 Ti 16GB — best budget 13B card:
16GB is the sweet spot for 13B model users. The RTX 4060 Ti 16GB at around $400 fits 13B Q8 (14GB) with margin, and handles 34B Q4 (20GB) with minor layer splitting. For users primarily running 7B and 13B models, this card handles LM Studio workloads well. Generation speed is slower than the 4090 due to lower bandwidth (288 GB/s vs 1,008 GB/s), but fully functional. See best GPU for 13B models for a detailed comparison.

Used RTX 3090 (24GB) — best VRAM-per-dollar:
If you're willing to buy used, the RTX 3090 offers 24GB GDDR6X — the same VRAM capacity as the RTX 4090 — at significantly lower prices on the secondhand market. Generation speed is noticeably slower than the 4090 (lower memory bandwidth), but for users whose bottleneck is VRAM capacity rather than raw throughput, the 3090 gives 34B model compatibility at a fraction of 4090 pricing. LM Studio runs cleanly on RTX 3090 with full CUDA support.

Apple Silicon picks for LM Studio

The MLX backend makes Apple Silicon uniquely competitive for LLM inference in LM Studio. The math is straightforward: unified memory means no separate VRAM ceiling, and MLX performance on M-series chips is fast enough that M-series Macs can outperform lower-VRAM NVIDIA cards for certain model sizes.

M4 Pro 24GB — minimum for 13B:
The M4 Pro with 24GB unified memory handles 13B Q8 comfortably and 34B Q4 with performance. 24GB is the practical minimum for productive 13B work — 16GB unified memory (base M4 Pro) is sufficient for 7B but cramped for 13B Q8. LM Studio's MLX path on M4 Pro gives smooth generation that would require an RTX 4060 Ti or better on the NVIDIA side. Community comparisons put M4 Pro 24GB roughly equivalent to an RTX 4070 for 13B inference through LM Studio.

M4 Max 48GB+ — for 34B models:
48GB unified memory handles 34B Q8 and is the entry point for comfortable 34B use. M4 Max with 48GB sits in a unique position: no NVIDIA consumer card reaches 48GB VRAM. The RTX 4090 maxes out at 24GB; fitting a 34B Q8 model (35GB) requires either a Mac or a workstation-class card. For users who want 34B models at full quality without workstation GPU pricing, M4 Max 48GB is the most accessible option.

M3 Ultra / M4 Ultra 192GB — for 70B+ models:
Ultra-class chips with 192GB unified memory can run 70B models at Q8 and 34B at full precision — configurations that aren't possible on any consumer NVIDIA GPU. LM Studio's MLX backend exploits this fully. For users who need 70B-class performance locally without a multi-GPU server setup, the M3 or M4 Ultra is the only consumer-accessible path. The price is workstation-level, but the capability is genuine.

For a full head-to-head comparison of these platforms, see Mac vs NVIDIA for LLM.

Which GPU for LM Studio?

You run 7B models, budget build: RTX 3060 12GB or RTX 4060 8GB handles 7B Q4/Q8 fully in VRAM. Not comfortable for 13B.
You run 7B–13B models, NVIDIA desktop: RTX 4060 Ti 16GB (~$400) is the right call — 16GB fits 13B Q8, every 7B fits easily.
You run 34B models, NVIDIA: RTX 4090 (24GB) or used RTX 3090 (24GB). 24GB fits 34B Q4/Q5 fully in VRAM.
You're on Apple Silicon, running 13B: M4 Pro 24GB minimum. 16GB is workable but cramped.
You're on Apple Silicon, running 34B: M4 Max 48GB+. This is the only accessible path to 34B Q8 on a single consumer device.
You run 70B models: M3/M4 Ultra (192GB) or multi-GPU NVIDIA setup. No single consumer NVIDIA card handles 70B on its own.
You want to explore models without committing: LM Studio's model browser and built-in chat interface make it ideal for this. Use LM Studio for exploration, then move to Ollama for production automation.

Why LM Studio is worth using even on NVIDIA

Several GPU buyers default to Ollama because it has better automation and API support. That's a valid workflow — but LM Studio offers something distinct that makes it worth running alongside Ollama:

Model browser: LM Studio has a built-in model discovery interface connected to HuggingFace. You can browse, filter by size and quantization, and download directly. No manual HuggingFace navigation or CLI commands.

Built-in chat interface: A polished chat UI with conversation history, system prompt editing, and context length controls. Better than Ollama's default web UI for interactive use.

Quantization comparison: LM Studio makes it easy to test the same model at Q4, Q5, Q6, and Q8 side-by-side and assess quality vs speed trade-offs with your actual VRAM. This is valuable during the exploration phase when you're deciding what model to run long-term.

LM Studio as exploration, Ollama for production: The common pattern among experienced local LLM users is to use LM Studio to explore new models and find quantizations that work well, then export the model path to Ollama for API-accessible, automation-friendly production use. LM Studio has an Ollama-compatible server mode that bridges this workflow. See best GPU for Ollama for Ollama-specific guidance, and best GPU for Open WebUI if you plan to put a browser chat interface in front of that Ollama backend.

LM Studio system requirements

LM Studio's official documentation notes that CUDA 11.8+ is required for NVIDIA GPU acceleration on Windows and Linux. Apple Silicon requires macOS 13.6+ for MLX support. For optimal MLX performance on Mac, running the latest available macOS version is recommended as Apple ships MLX optimizations through OS updates.

GPU memory requirements are model-dependent — LM Studio displays available VRAM and flags whether your selected model fits before loading, which makes it more user-friendly than tools that discover VRAM limits at runtime.

For broader LLM hardware context, see how much VRAM for local LLM and best GPU for Llama 4.

Frequently Asked Questions

What are LM Studio's GPU requirements?

LM Studio requires CUDA 11.8 or newer for NVIDIA GPU acceleration on Windows and Linux. Any NVIDIA GPU with 8GB+ VRAM can run 7B models. For Apple Silicon, macOS 13.6+ is required for MLX support. LM Studio displays whether your GPU has enough VRAM before loading a model, so you can check compatibility before downloading.

Does LM Studio support multiple GPUs?

LM Studio can split model layers across multiple NVIDIA GPUs when a single card does not have enough VRAM. However, multi-GPU support is not as seamless as single-GPU use — you may need to manually configure layer allocation, and inter-GPU communication adds some overhead. For most users, a single high-VRAM card like the RTX 4090 is simpler and often faster than two smaller cards.

How much VRAM does LM Studio need?

VRAM needs depend on the model size and quantization level. For 7B models at Q4, you need about 6GB. For 13B models at Q4, about 10GB. For 34B models at Q4, about 22GB. LM Studio also uses VRAM for the KV cache during conversations, so budget an extra 2-4GB beyond the base model size for comfortable context lengths.

Does LM Studio work on Apple Silicon with MLX?

Yes, and it is one of LM Studio's biggest advantages. LM Studio automatically selects the MLX backend on Apple Silicon Macs, which uses unified memory efficiently. An M4 Pro with 24GB handles 13B models well, and an M4 Max with 48GB runs 34B models comfortably. MLX performance on Apple Silicon often matches or exceeds mid-range NVIDIA GPUs for equivalent model sizes.

Related guides on Best GPU for LLM

The full version lives on Best GPU for LLM — VRAM calculator, GPU comparison table, and live Amazon pricing.