Forem: Max Vyaznikov

Running DeepSeek, Llama 3, and Qwen Locally: Complete GPU Requirements Guide

Max Vyaznikov — Thu, 12 Mar 2026 05:09:12 +0000

Want to run the latest open-source LLMs on your own hardware? Here's exactly what you need for each popular model family.

Quick Reference: VRAM Requirements

Model	FP16	Q8	Q4_K_M	Min GPU
Llama 3.1 8B	16 GB	8.5 GB	5 GB	RTX 3060 12GB
Llama 3.1 70B	140 GB	70 GB	40 GB	2× RTX 3090
Llama 3.1 405B	810 GB	405 GB	228 GB	8× A100 80GB
Qwen2.5 7B	14 GB	7.5 GB	4.5 GB	RTX 3060 8GB
Qwen2.5 14B	28 GB	14 GB	8.5 GB	RTX 4060 Ti 16GB
Qwen2.5 32B	64 GB	32 GB	18 GB	RTX 3090 24GB
Qwen2.5 72B	144 GB	72 GB	41 GB	2× RTX 3090
Mistral Small 24B	48 GB	24 GB	14 GB	RTX 4080 16GB
Mistral Large 123B	246 GB	123 GB	69 GB	4× RTX 3090
DeepSeek V3 671B	1,340 GB	670 GB	376 GB	5× A100 80GB
DeepSeek R1 671B	1,340 GB	670 GB	376 GB	5× A100 80GB
Phi-3.5 Mini 3.8B	7.6 GB	4 GB	2.5 GB	RTX 3060 8GB
Gemma 2 27B	54 GB	27 GB	16 GB	RTX 4080 16GB

For any model, you can calculate exact VRAM needs at the VRAM calculator on gpuark.com.

Model-by-Model Deep Dive

Llama 3.1 — The All-Rounder

Meta's Llama 3.1 comes in 8B, 70B, and 405B sizes. The 8B is perfect for getting started:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Llama 3.1 8B (auto-downloads ~4.7GB)
ollama run llama3.1

# Or the 70B if you have the VRAM
ollama run llama3.1:70b

8B at Q4_K_M: Fits on any 8GB+ GPU. Great for coding, summarization, general chat. Not competitive with GPT-4 on complex reasoning.

70B at Q4_K_M: This is where Llama 3.1 really shines — competitive with GPT-4 on many benchmarks. Needs ~40GB VRAM, so two 3090s or a single A100 80GB.

405B: Research-grade. Needs 5+ A100 80GB at Q4. Not practical for most individuals.

DeepSeek V3 / R1 — The MoE Giants

DeepSeek V3 (671B) uses Mixture of Experts — only ~37B parameters active per token, but all 671B must fit in memory. This means:

At Q4_K_M: ~376 GB VRAM minimum
Realistic minimum: 5× A100 80GB (400 GB total)
On consumer hardware: not feasible for the full model

But: DeepSeek R1 distilled versions exist:

DeepSeek-R1-7B: 4.5 GB at Q4 — runs on any modern GPU
DeepSeek-R1-14B: 8.5 GB at Q4 — RTX 4060 Ti
DeepSeek-R1-32B: 18 GB at Q4 — RTX 3090
DeepSeek-R1-70B: 40 GB at Q4 — 2× RTX 3090

The distilled 32B is arguably the best reasoning model you can run on a single consumer GPU.

Qwen2.5 — Best for Coding

Alibaba's Qwen2.5 series excels at code generation. The -Coder variants are particularly strong:

# Qwen2.5-Coder-14B — best coding model for 16GB GPUs
ollama run qwen2.5-coder:14b

# Qwen2.5-32B — strong general model for 24GB GPUs
ollama run qwen2.5:32b

Qwen2.5-Coder-14B at Q4_K_M (~8.5 GB) is the sweet spot for developer use. It handles Python, JavaScript, Rust, Go with impressive accuracy and fits on a 12GB card.

Mistral — Efficient and Fast

Mistral models are known for good quality-to-size ratio:

# Mistral Small 24B — best quality under 16GB
ollama run mistral-small

# Mistral Large 123B — needs serious hardware
ollama run mistral-large

Mistral Small 24B at Q4_K_M (~14 GB) is the best general-purpose model for 16GB GPUs. Solid reasoning, good instruction following, fast.

GPU Setup Recommendations

Beginner Setup (~$400)

GPU: RTX 4060 Ti 16GB
Models: Qwen2.5-14B, Mistral-Small-24B (Q4), Llama 3.1 8B (Q8)
Software: Ollama + Open WebUI

Enthusiast Setup (~$700)

GPU: Used RTX 3090 24GB
Models: Qwen2.5-32B, DeepSeek-R1-32B, any 34B model
Software: Ollama or ExLlamaV2 + TabbyAPI

Power User Setup (~$1,400)

GPUs: 2× Used RTX 3090 (48GB total)
Models: Llama 3.1 70B, Qwen2.5-72B, Mixtral 8x22B
Software: llama.cpp with --tensor-split 24,24

Prosumer Setup (~$2,000)

GPU: RTX 4090 + used RTX 3090
Models: Same as above, faster inference
Software: ExLlamaV2 with tensor parallelism

Performance Tips

1. Use the right quantization

Q4_K_M for most models. Go Q5 or Q6 only if VRAM allows — the quality gain is marginal but measurable on reasoning.

2. Optimize KV cache

# llama.cpp: limit context to what you need
llama-server -m model.gguf -c 4096  # instead of default 8192+

Halving context length saves significant VRAM.

3. Flash Attention

Requires CC 8.0+ (RTX 3000+). Enabled by default in most frameworks. Reduces memory usage for long contexts from O(n²) to O(n).

4. CPU offloading for oversized models

# llama.cpp: offload only some layers to GPU
llama-server -m model.gguf -ngl 20  # 20 layers on GPU, rest on CPU

Slower but lets you run models that don't fully fit. Expect ~2-5 tok/s for CPU layers vs ~30+ for GPU.

Conclusion

The local LLM ecosystem has matured enormously. For most developers:

Start with Ollama — zero-friction setup
Get at least 16GB VRAM — opens up 24B models
24GB (RTX 3090) is the sweet spot — runs everything up to 34B comfortably
Two GPUs if you need 70B+ — pipeline parallelism just works

The quality gap between local 32B models and cloud GPT-4 has narrowed significantly, especially for coding and domain-specific tasks. For many workflows, local is now good enough.

What's your local LLM setup? Drop your GPU + favorite model in the comments!

A Developer's Guide to Choosing a GPU for Machine Learning in 2025-2026

Max Vyaznikov — Thu, 12 Mar 2026 05:04:11 +0000

Choosing the right GPU for ML is confusing. Marketing specs don't tell you what matters for training and inference. Here's what actually counts.

The Four Specs That Matter

1. VRAM (Most Important)

VRAM determines what models you can run. No amount of compute power helps if your model doesn't fit in memory.

VRAM	What Fits (Inference)	What Fits (Training)
8 GB	7B at Q4	7B QLoRA
12 GB	13B at Q4	7B QLoRA comfortably
16 GB	24B at Q4	13B QLoRA
24 GB	34B at Q5	13B full fine-tune, 34B QLoRA
48 GB	70B at Q4	34B full fine-tune
80 GB	70B at FP16	70B QLoRA

Rule of thumb: buy the most VRAM you can afford. You can't upgrade VRAM later.

2. Memory Bandwidth

For LLM inference, throughput is limited by how fast you can read model weights from VRAM. This is the memory bandwidth spec.

GPU	Bandwidth	Llama 8B Q4 tok/s
RTX 4060	272 GB/s	~35
RTX 4070	504 GB/s	~60
RTX 3090	936 GB/s	~85
RTX 4090	1,008 GB/s	~105
A100 80GB	2,039 GB/s	~180
H100	3,350 GB/s	~300

Higher bandwidth = faster token generation. This is why a 3090 feels faster for LLMs than a 4070 Ti despite being older.

3. Tensor Cores

Tensor Cores accelerate matrix multiplication — the core operation in neural networks. They matter most for training.

Generation	CC	Supported Precisions
1st (Volta)	7.0	FP16
2nd (Turing)	7.5	FP16, INT8, INT4
3rd (Ampere)	8.x	FP16, BF16, TF32, INT8
4th (Ada)	8.9	FP16, BF16, TF32, FP8, INT8
5th (Blackwell)	10.0	All above + FP4

BF16 support (Ampere+) is especially important — it's the default training precision for modern models and avoids the NaN issues that FP16 can cause.

4. CUDA Compute Capability

CC determines what frameworks and features your GPU supports. As of 2026:

Minimum CC 5.0 for PyTorch/TensorFlow
CC 7.0+ for Tensor Cores
CC 8.0+ for Flash Attention, BF16
CC 8.9 for FP8

You can look up any GPU's compute capability at gpuark.com.

GPU Recommendations by Budget

Under $400: RTX 4060 Ti 16GB

16 GB VRAM — runs 24B models at Q4
CC 8.9 (Ada Lovelace) — all modern features
165W TDP — low power
Limitation: 128-bit bus, 288 GB/s bandwidth (slow for LLMs)

$500-700: Used RTX 3090

24 GB VRAM — the sweet spot
CC 8.6 — BF16, Flash Attention, everything you need
936 GB/s bandwidth — fast LLM inference
350W TDP — needs a beefy PSU
Best value in ML GPUs right now

$1,500-1,800: RTX 4090

24 GB VRAM (same as 3090)
2× training throughput vs 3090
Better power efficiency
CC 8.9 — FP8 support

$3,000-5,000: Used A100 40GB/80GB

Professional GPU with ECC memory
80GB version fits 70B at FP16
2 TB/s bandwidth
NVLink support for multi-GPU
Best for research labs and startups

Common Mistakes

"More CUDA cores = better for ML"

Not always. A 4070 (5,888 cores) vs 3090 (10,496 cores) — the 3090 is better for ML despite the 4070 being newer. VRAM and bandwidth matter more.

"I need the latest generation"

The RTX 3090 (2020) is still one of the best ML GPUs in 2026. Unless you specifically need FP8 or newer features, older high-end cards often beat newer mid-range ones.

"Gaming benchmarks predict ML performance"

Gaming uses completely different GPU capabilities. A GPU that's 20% faster in games might be 50% slower for training if it has less VRAM or lower bandwidth.

"I'll just use the cloud"

Cloud GPUs cost $1-4/hour. If you train regularly, a $700 used 3090 pays for itself in ~3-6 months compared to cloud rentals.

Quick Decision Matrix

Priority	Best Choice	Why
Max VRAM per $	Used RTX 3090	24GB at ~$650
Training speed	RTX 4090	2× faster than 3090
Inference tok/s	RTX 3090 or 4090	Best bandwidth at consumer price
LLM 70B+	2× Used 3090	48GB for ~$1,300
Professional	A100 80GB	80GB, NVLink, ECC

Building an ML rig? Drop your budget and use case in the comments — happy to help pick components!

RTX 4090 vs RTX 3090 for AI/ML: Is the Upgrade Worth It?

Max Vyaznikov — Thu, 12 Mar 2026 05:03:04 +0000

The RTX 3090 and RTX 4090 are the two most popular consumer GPUs for AI/ML work. Both have 24GB VRAM, but the price gap is massive. Let's break down when each one makes sense.

Specs Comparison

Spec	RTX 3090	RTX 4090
Architecture	Ampere (CC 8.6)	Ada Lovelace (CC 8.9)
VRAM	24 GB GDDR6X	24 GB GDDR6X
Memory Bandwidth	936 GB/s	1,008 GB/s
CUDA Cores	10,496	16,384
Tensor Cores	328 (3rd gen)	512 (4th gen)
TDP	350W	450W
FP16 Tensor	142 TFLOPS	330 TFLOPS
New Price (2026)	Discontinued	~$1,800
Used Price (2026)	~$600-700	~$1,400-1,500

For a detailed side-by-side with all specifications, see the RTX 4090 vs RTX 3090 comparison page on gpuark.com.

Training Performance

The 4090 is roughly 1.7-2× faster for training due to:

56% more CUDA cores
4th gen Tensor Cores (better FP8, BF16 throughput)
Higher clock speeds
Better power efficiency

Real-world training benchmarks:

Task	RTX 3090	RTX 4090	Speedup
ResNet-50 (BS=64)	780 img/s	1,420 img/s	1.82×
BERT fine-tune (BS=32)	145 samples/s	268 samples/s	1.85×
Stable Diffusion training	2.1 it/s	3.8 it/s	1.81×
LLaMA 7B LoRA (r=16)	1.4 it/s	2.6 it/s	1.86×

Inference Performance (LLMs)

For LLM inference, the gap narrows because it's memory-bandwidth bound:

Task	RTX 3090	RTX 4090	Speedup
Llama 3.1 8B Q4 (tok/s)	85	105	1.24×
Llama 3.1 70B Q4 (tok/s)	doesn't fit	doesn't fit	—
Mistral 7B Q4 (prompt)	1,200 tok/s	1,800 tok/s	1.50×

Memory bandwidth difference is only 8% (936 vs 1,008 GB/s), so for pure token generation the 4090 advantage is modest.

The Real Decision

Buy a 4090 if:

Training throughput is your bottleneck (research, frequent fine-tuning)
You need FP8 features (CC 8.9 vs 8.6)
Power efficiency matters (performance per watt is much better)
You want one powerful card, not multi-GPU hassle

Buy a used 3090 (or two) if:

VRAM is your bottleneck (most LLM use cases)
Budget matters — two 3090s = 48GB for ~$1,300 vs one 4090 = 24GB for ~$1,500
You primarily do inference
You want to run 34B+ models

The multi-GPU argument

Two used 3090s give you 48GB total VRAM for less than one 4090:

Can run Llama 3.1 70B at Q4_K_M
Pipeline parallelism with llama.cpp works out of the box
Training with FSDP/DeepSpeed ZeRO-3 across both cards

The catch: inter-GPU communication over PCIe is slower than a single card's internal bandwidth. For training, expect ~1.5-1.7× scaling (not 2×). For inference with pipeline parallelism, the latency penalty is minimal.

Power Consumption

Often overlooked but significant:

Config	TDP	Annual electricity (24/7)
1× RTX 3090	350W	~$370/year
1× RTX 4090	450W	~$475/year
2× RTX 3090	700W	~$740/year

If running 24/7 as an inference server, the 4090's better perf/watt matters. For occasional use, it doesn't.

Bottom Line

The RTX 3090 at $600-700 used is the best value proposition in ML hardware right now. The 4090 is a better card in every metric except price-per-VRAM-GB, but the 3090 gives you 80% of the capability at 40% of the price.

If you're VRAM-limited (and you probably are if you're running LLMs), two 3090s beat one 4090 every time.

Running ML workloads on consumer GPUs? Share your setup in the comments!

CUDA Compute Capability: What It Is and Why It Matters for ML Engineers

Max Vyaznikov — Thu, 12 Mar 2026 03:45:21 +0000

If you've ever seen an error like "CUDA error: no kernel image is available for execution on the device" or "minimum required Cuda capability is 3.5" — you've run into Compute Capability issues. Here's everything you need to know.

What Is Compute Capability?

CUDA Compute Capability (CC) is a version number assigned to every NVIDIA GPU that identifies its architecture and supported feature set. It's NOT a performance score.

Format: Major.Minor

Major = GPU architecture generation
Minor = incremental improvements within that generation

GeForce GTX 1080  → CC 6.1 (Pascal)
GeForce RTX 3090  → CC 8.6 (Ampere)
GeForce RTX 4090  → CC 8.9 (Ada Lovelace)
H100              → CC 9.0 (Hopper)
RTX 5090          → CC 10.0 (Blackwell)

Why It Matters

1. Framework compatibility

Modern ML frameworks have minimum CC requirements:

Framework	Minimum CC	What's excluded
PyTorch 2.x	3.7	Kepler (K80), some Maxwell
TensorFlow 2.15+	5.0	All Maxwell, Kepler
JAX latest	5.2	Same as TF
Flash Attention 2	8.0	Everything before Ampere

If your GPU's CC is below the minimum, the framework will not use it — you'll silently fall back to CPU or get a hard error.

2. Feature availability

Each CC level unlocks hardware features:

CC	Architecture	Key ML Features
5.0-5.2	Maxwell	Basic CUDA, cuDNN
6.0-6.1	Pascal	FP16 compute, unified memory
7.0	Volta	Tensor Cores (1st gen), WMMA
7.5	Turing	INT8/INT4 Tensor Cores, mixed precision
8.0	Ampere	BF16, TF32, sparse Tensor Cores, 3rd gen
8.6	Ampere (consumer)	Same features, fewer SMs
8.9	Ada Lovelace	FP8, 4th gen Tensor Cores
9.0	Hopper	Transformer Engine, FP8 matmul, DPX
10.0	Blackwell	5th gen Tensor Cores, FP4

3. Compilation targets

When you compile CUDA code (or when PyTorch ships prebuilt binaries), it targets specific CC versions:

# Compile for multiple architectures
nvcc -gencode arch=compute_80,code=sm_80 \
     -gencode arch=compute_86,code=sm_86 \
     -gencode arch=compute_89,code=sm_89 \
     my_kernel.cu

PyTorch wheels on PyPI typically include CC 5.0, 6.0, 7.0, 7.5, 8.0, 8.6, 8.9, 9.0. If your GPU isn't covered, you may need to build from source.

How to Check Your GPU's CC

nvidia-smi (easiest, no CUDA toolkit needed)

nvidia-smi --query-gpu=compute_cap --format=csv,noheader
# Output: 8.6

Python (PyTorch)

import torch
major, minor = torch.cuda.get_device_capability()
print(f"Compute Capability: {major}.{minor}")

Python (TensorFlow)

import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    details = tf.config.experimental.get_device_details(gpu)
    print(details.get('compute_capability'))

C++ (CUDA Runtime)

cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
printf("CC: %d.%d\n", prop.major, prop.minor);

Lookup table

Don't have the GPU installed yet? The CUDA Compute Capability table on gpuark.com covers every NVIDIA GPU from Kepler to Blackwell.

Common CC-Related Errors and Fixes

"no kernel image is available for execution on the device"

Your PyTorch/TensorFlow binary wasn't compiled for your GPU's CC. Fix:

# Install PyTorch with the right CUDA version
pip install torch --index-url https://download.pytorch.org/whl/cu124

Or build from source with your CC:

TORCH_CUDA_ARCH_LIST="8.6" pip install torch --no-binary torch

"minimum required Cuda capability is X.X"

Your GPU is too old for the framework version. Options:

Use an older framework version
Upgrade your GPU
Use CPU mode: CUDA_VISIBLE_DEVICES="" python train.py

Flash Attention requires CC ≥ 8.0

Flash Attention 2 only works on Ampere (RTX 3000) and newer. For older GPUs:

# Use xformers instead (supports CC ≥ 6.0)
pip install xformers
# Or use PyTorch's built-in SDPA
from torch.nn.functional import scaled_dot_product_attention

Practical Advice for GPU Shopping

When buying a GPU for ML:

Minimum CC 7.5 (Turing) for mixed precision training — gives you Tensor Cores
CC 8.0+ (Ampere) strongly recommended — BF16, Flash Attention, much better ML performance
CC 8.9 (Ada) for bleeding-edge features like FP8 quantization-aware training
VRAM matters more than CC in most cases — a 3090 (CC 8.6, 24GB) beats a 4070 (CC 8.9, 12GB) for LLMs

CC tells you what features your GPU supports. VRAM tells you how big a model fits. Both matter, but for LLM inference, VRAM is usually the bottleneck.

What GPU are you running your ML workloads on? Have you hit CC compatibility issues? Let me know in the comments!

How Much VRAM Do You Actually Need to Run LLMs Locally?

Max Vyaznikov — Thu, 12 Mar 2026 03:44:13 +0000

Running large language models locally has become increasingly practical — but figuring out exactly how much VRAM you need can be confusing. Here's a concrete breakdown.

The Simple Formula

For inference (running a model, not training):

VRAM ≈ Parameters × Bytes per Weight + KV Cache + Overhead

Where bytes per weight depends on quantization:

Precision	Bytes/Param	Example: 7B model
FP32	4.0	28 GB
FP16/BF16	2.0	14 GB
INT8 (Q8)	1.0	7 GB
INT4 (Q4_K_M)	0.56	~4 GB
INT4 (Q4_0)	0.5	3.5 GB

Add 10-20% overhead for KV cache (more for longer contexts) and runtime buffers.

Practical VRAM Requirements by Model

Here's what you can actually run on common GPUs:

8 GB VRAM (RTX 4060, RTX 3070)

Llama 3.1 8B at Q4_K_M ✅
Qwen2.5 7B at Q4_K_M ✅
Mistral 7B at Q5_K_M ✅
Phi-3.5 Mini (3.8B) at Q8 ✅
13B models at Q4 ⚠️ (tight, short context only)

12 GB VRAM (RTX 4070, RTX 3060 12GB)

13B models at Q4_K_M ✅
Llama 3.1 8B at Q8 ✅
CodeQwen 14B at Q4_K_M ✅
20B models at Q4 ⚠️

16 GB VRAM (RTX 4080, RTX 5070 Ti)

Mistral Small 24B at Q4_K_M ✅
Qwen2.5-Coder 14B at Q6_K ✅
20B models at Q5-Q6 ✅
34B models at Q4 ⚠️

24 GB VRAM (RTX 3090, RTX 4090)

Llama 3.1 70B at Q4_K_M ⚠️ (with partial offload)
34B models at Q5-Q6 ✅
Qwen2.5 32B at Q5_K_M ✅
DeepSeek-Coder-V2-Lite 16B at FP16 ✅
Mistral Small 24B at Q8 ✅

48 GB VRAM (2× RTX 3090, A6000)

Llama 3.1 70B at Q4_K_M ✅
DeepSeek V3 670B — not enough, even at Q2
Mixtral 8x22B at Q4 ✅

The Quantization Sweet Spot

Q4_K_M is the most popular quantization for local inference and for good reason:

Quality: ~1-2% degradation vs FP16 on most benchmarks
Size: ~56% of the original INT8 size
Speed: Fastest on most consumer GPUs (memory-bandwidth bound)

Going lower (Q3, Q2) introduces noticeable quality degradation, especially on reasoning tasks. Going higher (Q6, Q8) gives marginal quality improvement but costs significantly more VRAM.

What About Training?

Training needs much more memory than inference:

Training VRAM ≈ Model weights + Gradients + Optimizer states + Activations

For full fine-tuning with Adam optimizer at FP32:

Weights: 4 bytes/param
Gradients: 4 bytes/param
Adam states: 8 bytes/param
Total: ~16 bytes/param (before activations)

A 7B model needs ~112 GB for full FP32 training. That's why techniques like LoRA (which only trains ~1-2% of parameters) and QLoRA (quantized base + LoRA) are so popular:

QLoRA fine-tuning of 7B: ~6-8 GB VRAM
QLoRA fine-tuning of 13B: ~10-12 GB VRAM
QLoRA fine-tuning of 70B: ~40-48 GB VRAM

KV Cache: The Hidden VRAM Consumer

When generating long texts, the KV cache grows with context length:

KV cache ≈ 2 × num_layers × hidden_dim × context_length × bytes_per_element

For Llama 3.1 8B at FP16 with 8K context: ~1 GB
For Llama 3.1 8B at FP16 with 128K context: ~16 GB

This is why you might load a model fine but run out of memory during long conversations.

Tools for Estimating

Rather than doing this math by hand every time, there's a VRAM calculator that estimates memory requirements — plug in the model size, quantization level, and context length to see if it fits your GPU.

Bottom Line

Budget	Best GPU	What You Can Run
~$300	RTX 4060 8GB	7-8B models at Q4
~$400	RTX 4060 Ti 16GB	Up to 24B at Q4
~$600	Used RTX 3090 24GB	Up to 34B at Q5, 70B at Q3
~$1800	RTX 4090 24GB	Same as 3090 but 2× faster
~$1200	2× Used RTX 3090	70B at Q4, most models comfortably

The most cost-effective option for serious local LLM use in 2025-2026 is still a used RTX 3090 — 24 GB of VRAM at a fraction of the 4090 price.

What's your local LLM setup? Drop a comment with your GPU and favorite model!