<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Max Vyaznikov</title>
    <description>The latest articles on Forem by Max Vyaznikov (@maxvyaznikov).</description>
    <link>https://forem.com/maxvyaznikov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3819321%2F53229506-3d43-4511-a7f6-bb2f58e84931.png</url>
      <title>Forem: Max Vyaznikov</title>
      <link>https://forem.com/maxvyaznikov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/maxvyaznikov"/>
    <language>en</language>
    <item>
      <title>Running DeepSeek, Llama 3, and Qwen Locally: Complete GPU Requirements Guide</title>
      <dc:creator>Max Vyaznikov</dc:creator>
      <pubDate>Thu, 12 Mar 2026 05:09:12 +0000</pubDate>
      <link>https://forem.com/maxvyaznikov/running-deepseek-llama-3-and-qwen-locally-complete-gpu-requirements-guide-6fd</link>
      <guid>https://forem.com/maxvyaznikov/running-deepseek-llama-3-and-qwen-locally-complete-gpu-requirements-guide-6fd</guid>
      <description>&lt;p&gt;Want to run the latest open-source LLMs on your own hardware? Here's exactly what you need for each popular model family.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference: VRAM Requirements
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;FP16&lt;/th&gt;
&lt;th&gt;Q8&lt;/th&gt;
&lt;th&gt;Q4_K_M&lt;/th&gt;
&lt;th&gt;Min GPU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Llama 3.1 8B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;8.5 GB&lt;/td&gt;
&lt;td&gt;5 GB&lt;/td&gt;
&lt;td&gt;RTX 3060 12GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Llama 3.1 70B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;140 GB&lt;/td&gt;
&lt;td&gt;70 GB&lt;/td&gt;
&lt;td&gt;40 GB&lt;/td&gt;
&lt;td&gt;2× RTX 3090&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Llama 3.1 405B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;810 GB&lt;/td&gt;
&lt;td&gt;405 GB&lt;/td&gt;
&lt;td&gt;228 GB&lt;/td&gt;
&lt;td&gt;8× A100 80GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5 7B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;14 GB&lt;/td&gt;
&lt;td&gt;7.5 GB&lt;/td&gt;
&lt;td&gt;4.5 GB&lt;/td&gt;
&lt;td&gt;RTX 3060 8GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5 14B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;28 GB&lt;/td&gt;
&lt;td&gt;14 GB&lt;/td&gt;
&lt;td&gt;8.5 GB&lt;/td&gt;
&lt;td&gt;RTX 4060 Ti 16GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5 32B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;64 GB&lt;/td&gt;
&lt;td&gt;32 GB&lt;/td&gt;
&lt;td&gt;18 GB&lt;/td&gt;
&lt;td&gt;RTX 3090 24GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5 72B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;144 GB&lt;/td&gt;
&lt;td&gt;72 GB&lt;/td&gt;
&lt;td&gt;41 GB&lt;/td&gt;
&lt;td&gt;2× RTX 3090&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mistral Small 24B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;48 GB&lt;/td&gt;
&lt;td&gt;24 GB&lt;/td&gt;
&lt;td&gt;14 GB&lt;/td&gt;
&lt;td&gt;RTX 4080 16GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mistral Large 123B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;246 GB&lt;/td&gt;
&lt;td&gt;123 GB&lt;/td&gt;
&lt;td&gt;69 GB&lt;/td&gt;
&lt;td&gt;4× RTX 3090&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V3 671B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,340 GB&lt;/td&gt;
&lt;td&gt;670 GB&lt;/td&gt;
&lt;td&gt;376 GB&lt;/td&gt;
&lt;td&gt;5× A100 80GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek R1 671B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,340 GB&lt;/td&gt;
&lt;td&gt;670 GB&lt;/td&gt;
&lt;td&gt;376 GB&lt;/td&gt;
&lt;td&gt;5× A100 80GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Phi-3.5 Mini 3.8B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.6 GB&lt;/td&gt;
&lt;td&gt;4 GB&lt;/td&gt;
&lt;td&gt;2.5 GB&lt;/td&gt;
&lt;td&gt;RTX 3060 8GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma 2 27B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;54 GB&lt;/td&gt;
&lt;td&gt;27 GB&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;RTX 4080 16GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For any model, you can calculate exact VRAM needs at the &lt;a href="https://gpuark.com/en/vram-calculator/" rel="noopener noreferrer"&gt;VRAM calculator on gpuark.com&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model-by-Model Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Llama 3.1 — The All-Rounder
&lt;/h3&gt;

&lt;p&gt;Meta's Llama 3.1 comes in 8B, 70B, and 405B sizes. The 8B is perfect for getting started:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Ollama&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh

&lt;span class="c"&gt;# Run Llama 3.1 8B (auto-downloads ~4.7GB)&lt;/span&gt;
ollama run llama3.1

&lt;span class="c"&gt;# Or the 70B if you have the VRAM&lt;/span&gt;
ollama run llama3.1:70b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;8B at Q4_K_M&lt;/strong&gt;: Fits on any 8GB+ GPU. Great for coding, summarization, general chat. Not competitive with GPT-4 on complex reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;70B at Q4_K_M&lt;/strong&gt;: This is where Llama 3.1 really shines — competitive with GPT-4 on many benchmarks. Needs ~40GB VRAM, so two 3090s or a single A100 80GB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;405B&lt;/strong&gt;: Research-grade. Needs 5+ A100 80GB at Q4. Not practical for most individuals.&lt;/p&gt;

&lt;h3&gt;
  
  
  DeepSeek V3 / R1 — The MoE Giants
&lt;/h3&gt;

&lt;p&gt;DeepSeek V3 (671B) uses &lt;strong&gt;Mixture of Experts&lt;/strong&gt; — only ~37B parameters active per token, but all 671B must fit in memory. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At Q4_K_M: ~376 GB VRAM minimum&lt;/li&gt;
&lt;li&gt;Realistic minimum: &lt;strong&gt;5× A100 80GB&lt;/strong&gt; (400 GB total)&lt;/li&gt;
&lt;li&gt;On consumer hardware: not feasible for the full model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;But&lt;/strong&gt;: DeepSeek R1 distilled versions exist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1-7B&lt;/strong&gt;: 4.5 GB at Q4 — runs on any modern GPU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1-14B&lt;/strong&gt;: 8.5 GB at Q4 — RTX 4060 Ti&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1-32B&lt;/strong&gt;: 18 GB at Q4 — RTX 3090&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1-70B&lt;/strong&gt;: 40 GB at Q4 — 2× RTX 3090&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The distilled 32B is arguably the best reasoning model you can run on a single consumer GPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  Qwen2.5 — Best for Coding
&lt;/h3&gt;

&lt;p&gt;Alibaba's Qwen2.5 series excels at code generation. The -Coder variants are particularly strong:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Qwen2.5-Coder-14B — best coding model for 16GB GPUs&lt;/span&gt;
ollama run qwen2.5-coder:14b

&lt;span class="c"&gt;# Qwen2.5-32B — strong general model for 24GB GPUs&lt;/span&gt;
ollama run qwen2.5:32b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Qwen2.5-Coder-14B&lt;/strong&gt; at Q4_K_M (~8.5 GB) is the sweet spot for developer use. It handles Python, JavaScript, Rust, Go with impressive accuracy and fits on a 12GB card.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistral — Efficient and Fast
&lt;/h3&gt;

&lt;p&gt;Mistral models are known for good quality-to-size ratio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Mistral Small 24B — best quality under 16GB&lt;/span&gt;
ollama run mistral-small

&lt;span class="c"&gt;# Mistral Large 123B — needs serious hardware&lt;/span&gt;
ollama run mistral-large
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Mistral Small 24B&lt;/strong&gt; at Q4_K_M (~14 GB) is the best general-purpose model for 16GB GPUs. Solid reasoning, good instruction following, fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU Setup Recommendations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Beginner Setup (~$400)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: RTX 4060 Ti 16GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt;: Qwen2.5-14B, Mistral-Small-24B (Q4), Llama 3.1 8B (Q8)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software&lt;/strong&gt;: Ollama + Open WebUI&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Enthusiast Setup (~$700)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: Used RTX 3090 24GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt;: Qwen2.5-32B, DeepSeek-R1-32B, any 34B model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software&lt;/strong&gt;: Ollama or ExLlamaV2 + TabbyAPI&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Power User Setup (~$1,400)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPUs&lt;/strong&gt;: 2× Used RTX 3090 (48GB total)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt;: Llama 3.1 70B, Qwen2.5-72B, Mixtral 8x22B&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software&lt;/strong&gt;: llama.cpp with &lt;code&gt;--tensor-split 24,24&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prosumer Setup (~$2,000)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: RTX 4090 + used RTX 3090&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt;: Same as above, faster inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software&lt;/strong&gt;: ExLlamaV2 with tensor parallelism&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Performance Tips
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Use the right quantization
&lt;/h3&gt;

&lt;p&gt;Q4_K_M for most models. Go Q5 or Q6 only if VRAM allows — the quality gain is marginal but measurable on reasoning.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Optimize KV cache
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# llama.cpp: limit context to what you need&lt;/span&gt;
llama-server &lt;span class="nt"&gt;-m&lt;/span&gt; model.gguf &lt;span class="nt"&gt;-c&lt;/span&gt; 4096  &lt;span class="c"&gt;# instead of default 8192+&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Halving context length saves significant VRAM.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Flash Attention
&lt;/h3&gt;

&lt;p&gt;Requires CC 8.0+ (RTX 3000+). Enabled by default in most frameworks. Reduces memory usage for long contexts from O(n²) to O(n).&lt;/p&gt;

&lt;h3&gt;
  
  
  4. CPU offloading for oversized models
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# llama.cpp: offload only some layers to GPU&lt;/span&gt;
llama-server &lt;span class="nt"&gt;-m&lt;/span&gt; model.gguf &lt;span class="nt"&gt;-ngl&lt;/span&gt; 20  &lt;span class="c"&gt;# 20 layers on GPU, rest on CPU&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Slower but lets you run models that don't fully fit. Expect ~2-5 tok/s for CPU layers vs ~30+ for GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The local LLM ecosystem has matured enormously. For most developers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with Ollama&lt;/strong&gt; — zero-friction setup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Get at least 16GB VRAM&lt;/strong&gt; — opens up 24B models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;24GB (RTX 3090) is the sweet spot&lt;/strong&gt; — runs everything up to 34B comfortably&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two GPUs if you need 70B+&lt;/strong&gt; — pipeline parallelism just works&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The quality gap between local 32B models and cloud GPT-4 has narrowed significantly, especially for coding and domain-specific tasks. For many workflows, local is now good enough.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your local LLM setup? Drop your GPU + favorite model in the comments!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
    </item>
    <item>
      <title>A Developer's Guide to Choosing a GPU for Machine Learning in 2025-2026</title>
      <dc:creator>Max Vyaznikov</dc:creator>
      <pubDate>Thu, 12 Mar 2026 05:04:11 +0000</pubDate>
      <link>https://forem.com/maxvyaznikov/a-developers-guide-to-choosing-a-gpu-for-machine-learning-in-2025-2026-5d4f</link>
      <guid>https://forem.com/maxvyaznikov/a-developers-guide-to-choosing-a-gpu-for-machine-learning-in-2025-2026-5d4f</guid>
      <description>&lt;p&gt;Choosing the right GPU for ML is confusing. Marketing specs don't tell you what matters for training and inference. Here's what actually counts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Specs That Matter
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. VRAM (Most Important)
&lt;/h3&gt;

&lt;p&gt;VRAM determines &lt;strong&gt;what models you can run&lt;/strong&gt;. No amount of compute power helps if your model doesn't fit in memory.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;What Fits (Inference)&lt;/th&gt;
&lt;th&gt;What Fits (Training)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8 GB&lt;/td&gt;
&lt;td&gt;7B at Q4&lt;/td&gt;
&lt;td&gt;7B QLoRA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12 GB&lt;/td&gt;
&lt;td&gt;13B at Q4&lt;/td&gt;
&lt;td&gt;7B QLoRA comfortably&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;24B at Q4&lt;/td&gt;
&lt;td&gt;13B QLoRA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;24 GB&lt;/td&gt;
&lt;td&gt;34B at Q5&lt;/td&gt;
&lt;td&gt;13B full fine-tune, 34B QLoRA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;48 GB&lt;/td&gt;
&lt;td&gt;70B at Q4&lt;/td&gt;
&lt;td&gt;34B full fine-tune&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;80 GB&lt;/td&gt;
&lt;td&gt;70B at FP16&lt;/td&gt;
&lt;td&gt;70B QLoRA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb&lt;/strong&gt;: buy the most VRAM you can afford. You can't upgrade VRAM later.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Memory Bandwidth
&lt;/h3&gt;

&lt;p&gt;For LLM inference, throughput is limited by how fast you can read model weights from VRAM. This is the &lt;strong&gt;memory bandwidth&lt;/strong&gt; spec.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;Bandwidth&lt;/th&gt;
&lt;th&gt;Llama 8B Q4 tok/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4060&lt;/td&gt;
&lt;td&gt;272 GB/s&lt;/td&gt;
&lt;td&gt;~35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4070&lt;/td&gt;
&lt;td&gt;504 GB/s&lt;/td&gt;
&lt;td&gt;~60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3090&lt;/td&gt;
&lt;td&gt;936 GB/s&lt;/td&gt;
&lt;td&gt;~85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090&lt;/td&gt;
&lt;td&gt;1,008 GB/s&lt;/td&gt;
&lt;td&gt;~105&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A100 80GB&lt;/td&gt;
&lt;td&gt;2,039 GB/s&lt;/td&gt;
&lt;td&gt;~180&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;H100&lt;/td&gt;
&lt;td&gt;3,350 GB/s&lt;/td&gt;
&lt;td&gt;~300&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Higher bandwidth = faster token generation. This is why a 3090 feels faster for LLMs than a 4070 Ti despite being older.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Tensor Cores
&lt;/h3&gt;

&lt;p&gt;Tensor Cores accelerate matrix multiplication — the core operation in neural networks. They matter most for &lt;strong&gt;training&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Generation&lt;/th&gt;
&lt;th&gt;CC&lt;/th&gt;
&lt;th&gt;Supported Precisions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1st (Volta)&lt;/td&gt;
&lt;td&gt;7.0&lt;/td&gt;
&lt;td&gt;FP16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2nd (Turing)&lt;/td&gt;
&lt;td&gt;7.5&lt;/td&gt;
&lt;td&gt;FP16, INT8, INT4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3rd (Ampere)&lt;/td&gt;
&lt;td&gt;8.x&lt;/td&gt;
&lt;td&gt;FP16, BF16, TF32, INT8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4th (Ada)&lt;/td&gt;
&lt;td&gt;8.9&lt;/td&gt;
&lt;td&gt;FP16, BF16, TF32, FP8, INT8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5th (Blackwell)&lt;/td&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;All above + FP4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;BF16 support (Ampere+)&lt;/strong&gt; is especially important — it's the default training precision for modern models and avoids the NaN issues that FP16 can cause.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. CUDA Compute Capability
&lt;/h3&gt;

&lt;p&gt;CC determines what frameworks and features your GPU supports. As of 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Minimum CC 5.0&lt;/strong&gt; for PyTorch/TensorFlow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CC 7.0+&lt;/strong&gt; for Tensor Cores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CC 8.0+&lt;/strong&gt; for Flash Attention, BF16&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CC 8.9&lt;/strong&gt; for FP8&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can look up any GPU's compute capability at &lt;a href="https://gpuark.com/en/cuda-compute-capability/" rel="noopener noreferrer"&gt;gpuark.com&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU Recommendations by Budget
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Under $400: RTX 4060 Ti 16GB
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;16 GB VRAM — runs 24B models at Q4&lt;/li&gt;
&lt;li&gt;CC 8.9 (Ada Lovelace) — all modern features&lt;/li&gt;
&lt;li&gt;165W TDP — low power&lt;/li&gt;
&lt;li&gt;Limitation: 128-bit bus, 288 GB/s bandwidth (slow for LLMs)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  $500-700: Used RTX 3090
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;24 GB VRAM&lt;/strong&gt; — the sweet spot&lt;/li&gt;
&lt;li&gt;CC 8.6 — BF16, Flash Attention, everything you need&lt;/li&gt;
&lt;li&gt;936 GB/s bandwidth — fast LLM inference&lt;/li&gt;
&lt;li&gt;350W TDP — needs a beefy PSU&lt;/li&gt;
&lt;li&gt;Best value in ML GPUs right now&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  $1,500-1,800: RTX 4090
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;24 GB VRAM (same as 3090)&lt;/li&gt;
&lt;li&gt;2× training throughput vs 3090&lt;/li&gt;
&lt;li&gt;Better power efficiency&lt;/li&gt;
&lt;li&gt;CC 8.9 — FP8 support&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  $3,000-5,000: Used A100 40GB/80GB
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Professional GPU with ECC memory&lt;/li&gt;
&lt;li&gt;80GB version fits 70B at FP16&lt;/li&gt;
&lt;li&gt;2 TB/s bandwidth&lt;/li&gt;
&lt;li&gt;NVLink support for multi-GPU&lt;/li&gt;
&lt;li&gt;Best for research labs and startups&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  "More CUDA cores = better for ML"
&lt;/h3&gt;

&lt;p&gt;Not always. A 4070 (5,888 cores) vs 3090 (10,496 cores) — the 3090 is better for ML despite the 4070 being newer. VRAM and bandwidth matter more.&lt;/p&gt;

&lt;h3&gt;
  
  
  "I need the latest generation"
&lt;/h3&gt;

&lt;p&gt;The RTX 3090 (2020) is still one of the best ML GPUs in 2026. Unless you specifically need FP8 or newer features, older high-end cards often beat newer mid-range ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Gaming benchmarks predict ML performance"
&lt;/h3&gt;

&lt;p&gt;Gaming uses completely different GPU capabilities. A GPU that's 20% faster in games might be 50% slower for training if it has less VRAM or lower bandwidth.&lt;/p&gt;

&lt;h3&gt;
  
  
  "I'll just use the cloud"
&lt;/h3&gt;

&lt;p&gt;Cloud GPUs cost $1-4/hour. If you train regularly, a $700 used 3090 pays for itself in ~3-6 months compared to cloud rentals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Best Choice&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Max VRAM per $&lt;/td&gt;
&lt;td&gt;Used RTX 3090&lt;/td&gt;
&lt;td&gt;24GB at ~$650&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training speed&lt;/td&gt;
&lt;td&gt;RTX 4090&lt;/td&gt;
&lt;td&gt;2× faster than 3090&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference tok/s&lt;/td&gt;
&lt;td&gt;RTX 3090 or 4090&lt;/td&gt;
&lt;td&gt;Best bandwidth at consumer price&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM 70B+&lt;/td&gt;
&lt;td&gt;2× Used 3090&lt;/td&gt;
&lt;td&gt;48GB for ~$1,300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Professional&lt;/td&gt;
&lt;td&gt;A100 80GB&lt;/td&gt;
&lt;td&gt;80GB, NVLink, ECC&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Building an ML rig? Drop your budget and use case in the comments — happy to help pick components!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
    </item>
    <item>
      <title>RTX 4090 vs RTX 3090 for AI/ML: Is the Upgrade Worth It?</title>
      <dc:creator>Max Vyaznikov</dc:creator>
      <pubDate>Thu, 12 Mar 2026 05:03:04 +0000</pubDate>
      <link>https://forem.com/maxvyaznikov/rtx-4090-vs-rtx-3090-for-aiml-is-the-upgrade-worth-it-c68</link>
      <guid>https://forem.com/maxvyaznikov/rtx-4090-vs-rtx-3090-for-aiml-is-the-upgrade-worth-it-c68</guid>
      <description>&lt;p&gt;The RTX 3090 and RTX 4090 are the two most popular consumer GPUs for AI/ML work. Both have 24GB VRAM, but the price gap is massive. Let's break down when each one makes sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  Specs Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;RTX 3090&lt;/th&gt;
&lt;th&gt;RTX 4090&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;Ampere (CC 8.6)&lt;/td&gt;
&lt;td&gt;Ada Lovelace (CC 8.9)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VRAM&lt;/td&gt;
&lt;td&gt;24 GB GDDR6X&lt;/td&gt;
&lt;td&gt;24 GB GDDR6X&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory Bandwidth&lt;/td&gt;
&lt;td&gt;936 GB/s&lt;/td&gt;
&lt;td&gt;1,008 GB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CUDA Cores&lt;/td&gt;
&lt;td&gt;10,496&lt;/td&gt;
&lt;td&gt;16,384&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tensor Cores&lt;/td&gt;
&lt;td&gt;328 (3rd gen)&lt;/td&gt;
&lt;td&gt;512 (4th gen)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TDP&lt;/td&gt;
&lt;td&gt;350W&lt;/td&gt;
&lt;td&gt;450W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP16 Tensor&lt;/td&gt;
&lt;td&gt;142 TFLOPS&lt;/td&gt;
&lt;td&gt;330 TFLOPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New Price (2026)&lt;/td&gt;
&lt;td&gt;Discontinued&lt;/td&gt;
&lt;td&gt;~$1,800&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Used Price (2026)&lt;/td&gt;
&lt;td&gt;~$600-700&lt;/td&gt;
&lt;td&gt;~$1,400-1,500&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a detailed side-by-side with all specifications, see the &lt;a href="https://gpuark.com/en/gpu/nvidia-geforce-rtx-4090-vs-nvidia-geforce-rtx-3090/" rel="noopener noreferrer"&gt;RTX 4090 vs RTX 3090 comparison page on gpuark.com&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training Performance
&lt;/h2&gt;

&lt;p&gt;The 4090 is roughly &lt;strong&gt;1.7-2× faster&lt;/strong&gt; for training due to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;56% more CUDA cores&lt;/li&gt;
&lt;li&gt;4th gen Tensor Cores (better FP8, BF16 throughput)&lt;/li&gt;
&lt;li&gt;Higher clock speeds&lt;/li&gt;
&lt;li&gt;Better power efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real-world training benchmarks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;RTX 3090&lt;/th&gt;
&lt;th&gt;RTX 4090&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ResNet-50 (BS=64)&lt;/td&gt;
&lt;td&gt;780 img/s&lt;/td&gt;
&lt;td&gt;1,420 img/s&lt;/td&gt;
&lt;td&gt;1.82×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BERT fine-tune (BS=32)&lt;/td&gt;
&lt;td&gt;145 samples/s&lt;/td&gt;
&lt;td&gt;268 samples/s&lt;/td&gt;
&lt;td&gt;1.85×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stable Diffusion training&lt;/td&gt;
&lt;td&gt;2.1 it/s&lt;/td&gt;
&lt;td&gt;3.8 it/s&lt;/td&gt;
&lt;td&gt;1.81×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLaMA 7B LoRA (r=16)&lt;/td&gt;
&lt;td&gt;1.4 it/s&lt;/td&gt;
&lt;td&gt;2.6 it/s&lt;/td&gt;
&lt;td&gt;1.86×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Inference Performance (LLMs)
&lt;/h2&gt;

&lt;p&gt;For LLM inference, the gap narrows because it's &lt;strong&gt;memory-bandwidth bound&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;RTX 3090&lt;/th&gt;
&lt;th&gt;RTX 4090&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B Q4 (tok/s)&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;105&lt;/td&gt;
&lt;td&gt;1.24×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 70B Q4 (tok/s)&lt;/td&gt;
&lt;td&gt;doesn't fit&lt;/td&gt;
&lt;td&gt;doesn't fit&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral 7B Q4 (prompt)&lt;/td&gt;
&lt;td&gt;1,200 tok/s&lt;/td&gt;
&lt;td&gt;1,800 tok/s&lt;/td&gt;
&lt;td&gt;1.50×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Memory bandwidth difference is only 8% (936 vs 1,008 GB/s), so for pure token generation the 4090 advantage is modest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Decision
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Buy a 4090 if:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Training throughput is your bottleneck (research, frequent fine-tuning)&lt;/li&gt;
&lt;li&gt;You need FP8 features (CC 8.9 vs 8.6)&lt;/li&gt;
&lt;li&gt;Power efficiency matters (performance per watt is much better)&lt;/li&gt;
&lt;li&gt;You want one powerful card, not multi-GPU hassle&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Buy a used 3090 (or two) if:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;VRAM is your bottleneck (most LLM use cases)&lt;/li&gt;
&lt;li&gt;Budget matters — two 3090s = 48GB for ~$1,300 vs one 4090 = 24GB for ~$1,500&lt;/li&gt;
&lt;li&gt;You primarily do inference&lt;/li&gt;
&lt;li&gt;You want to run 34B+ models&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The multi-GPU argument
&lt;/h3&gt;

&lt;p&gt;Two used 3090s give you &lt;strong&gt;48GB total VRAM&lt;/strong&gt; for less than one 4090:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can run Llama 3.1 70B at Q4_K_M&lt;/li&gt;
&lt;li&gt;Pipeline parallelism with llama.cpp works out of the box&lt;/li&gt;
&lt;li&gt;Training with FSDP/DeepSpeed ZeRO-3 across both cards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The catch: inter-GPU communication over PCIe is slower than a single card's internal bandwidth. For training, expect ~1.5-1.7× scaling (not 2×). For inference with pipeline parallelism, the latency penalty is minimal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Power Consumption
&lt;/h2&gt;

&lt;p&gt;Often overlooked but significant:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;TDP&lt;/th&gt;
&lt;th&gt;Annual electricity (24/7)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1× RTX 3090&lt;/td&gt;
&lt;td&gt;350W&lt;/td&gt;
&lt;td&gt;~$370/year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1× RTX 4090&lt;/td&gt;
&lt;td&gt;450W&lt;/td&gt;
&lt;td&gt;~$475/year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2× RTX 3090&lt;/td&gt;
&lt;td&gt;700W&lt;/td&gt;
&lt;td&gt;~$740/year&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If running 24/7 as an inference server, the 4090's better perf/watt matters. For occasional use, it doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;The RTX 3090 at $600-700 used is the &lt;strong&gt;best value proposition in ML hardware&lt;/strong&gt; right now. The 4090 is a better card in every metric except price-per-VRAM-GB, but the 3090 gives you 80% of the capability at 40% of the price.&lt;/p&gt;

&lt;p&gt;If you're VRAM-limited (and you probably are if you're running LLMs), two 3090s beat one 4090 every time.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Running ML workloads on consumer GPUs? Share your setup in the comments!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>deeplearning</category>
    </item>
    <item>
      <title>CUDA Compute Capability: What It Is and Why It Matters for ML Engineers</title>
      <dc:creator>Max Vyaznikov</dc:creator>
      <pubDate>Thu, 12 Mar 2026 03:45:21 +0000</pubDate>
      <link>https://forem.com/maxvyaznikov/cuda-compute-capability-what-it-is-and-why-it-matters-for-ml-engineers-1mhg</link>
      <guid>https://forem.com/maxvyaznikov/cuda-compute-capability-what-it-is-and-why-it-matters-for-ml-engineers-1mhg</guid>
      <description>&lt;p&gt;If you've ever seen an error like "CUDA error: no kernel image is available for execution on the device" or "minimum required Cuda capability is 3.5" — you've run into &lt;strong&gt;Compute Capability&lt;/strong&gt; issues. Here's everything you need to know.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Compute Capability?
&lt;/h2&gt;

&lt;p&gt;CUDA Compute Capability (CC) is a &lt;strong&gt;version number&lt;/strong&gt; assigned to every NVIDIA GPU that identifies its &lt;strong&gt;architecture and supported feature set&lt;/strong&gt;. It's NOT a performance score.&lt;/p&gt;

&lt;p&gt;Format: &lt;code&gt;Major.Minor&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Major&lt;/strong&gt; = GPU architecture generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minor&lt;/strong&gt; = incremental improvements within that generation
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GeForce GTX 1080  → CC 6.1 (Pascal)
GeForce RTX 3090  → CC 8.6 (Ampere)
GeForce RTX 4090  → CC 8.9 (Ada Lovelace)
H100              → CC 9.0 (Hopper)
RTX 5090          → CC 10.0 (Blackwell)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why It Matters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Framework compatibility
&lt;/h3&gt;

&lt;p&gt;Modern ML frameworks have &lt;strong&gt;minimum CC requirements&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Minimum CC&lt;/th&gt;
&lt;th&gt;What's excluded&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PyTorch 2.x&lt;/td&gt;
&lt;td&gt;3.7&lt;/td&gt;
&lt;td&gt;Kepler (K80), some Maxwell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TensorFlow 2.15+&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;All Maxwell, Kepler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JAX latest&lt;/td&gt;
&lt;td&gt;5.2&lt;/td&gt;
&lt;td&gt;Same as TF&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flash Attention 2&lt;/td&gt;
&lt;td&gt;8.0&lt;/td&gt;
&lt;td&gt;Everything before Ampere&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If your GPU's CC is below the minimum, the framework &lt;strong&gt;will not use it&lt;/strong&gt; — you'll silently fall back to CPU or get a hard error.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Feature availability
&lt;/h3&gt;

&lt;p&gt;Each CC level unlocks hardware features:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CC&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Key ML Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5.0-5.2&lt;/td&gt;
&lt;td&gt;Maxwell&lt;/td&gt;
&lt;td&gt;Basic CUDA, cuDNN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6.0-6.1&lt;/td&gt;
&lt;td&gt;Pascal&lt;/td&gt;
&lt;td&gt;FP16 compute, unified memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7.0&lt;/td&gt;
&lt;td&gt;Volta&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Tensor Cores&lt;/strong&gt; (1st gen), WMMA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7.5&lt;/td&gt;
&lt;td&gt;Turing&lt;/td&gt;
&lt;td&gt;INT8/INT4 Tensor Cores, mixed precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8.0&lt;/td&gt;
&lt;td&gt;Ampere&lt;/td&gt;
&lt;td&gt;BF16, TF32, sparse Tensor Cores, 3rd gen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8.6&lt;/td&gt;
&lt;td&gt;Ampere (consumer)&lt;/td&gt;
&lt;td&gt;Same features, fewer SMs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8.9&lt;/td&gt;
&lt;td&gt;Ada Lovelace&lt;/td&gt;
&lt;td&gt;FP8, 4th gen Tensor Cores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9.0&lt;/td&gt;
&lt;td&gt;Hopper&lt;/td&gt;
&lt;td&gt;Transformer Engine, FP8 matmul, DPX&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10.0&lt;/td&gt;
&lt;td&gt;Blackwell&lt;/td&gt;
&lt;td&gt;5th gen Tensor Cores, FP4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  3. Compilation targets
&lt;/h3&gt;

&lt;p&gt;When you compile CUDA code (or when PyTorch ships prebuilt binaries), it targets specific CC versions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Compile for multiple architectures&lt;/span&gt;
nvcc &lt;span class="nt"&gt;-gencode&lt;/span&gt; &lt;span class="nb"&gt;arch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;compute_80,code&lt;span class="o"&gt;=&lt;/span&gt;sm_80 &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-gencode&lt;/span&gt; &lt;span class="nb"&gt;arch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;compute_86,code&lt;span class="o"&gt;=&lt;/span&gt;sm_86 &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-gencode&lt;/span&gt; &lt;span class="nb"&gt;arch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;compute_89,code&lt;span class="o"&gt;=&lt;/span&gt;sm_89 &lt;span class="se"&gt;\&lt;/span&gt;
     my_kernel.cu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PyTorch wheels on PyPI typically include CC 5.0, 6.0, 7.0, 7.5, 8.0, 8.6, 8.9, 9.0. If your GPU isn't covered, you may need to build from source.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Check Your GPU's CC
&lt;/h2&gt;

&lt;h3&gt;
  
  
  nvidia-smi (easiest, no CUDA toolkit needed)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;compute_cap &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader
&lt;span class="c"&gt;# Output: 8.6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Python (PyTorch)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="n"&gt;major&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;minor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_device_capability&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Compute Capability: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;major&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;minor&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Python (TensorFlow)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;
&lt;span class="n"&gt;gpus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_physical_devices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GPU&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;gpu&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;gpus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;details&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;experimental&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_device_details&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;compute_capability&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  C++ (CUDA Runtime)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;cudaDeviceProp&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;cudaGetDeviceProperties&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"CC: %d.%d&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;major&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;minor&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lookup table
&lt;/h3&gt;

&lt;p&gt;Don't have the GPU installed yet? The &lt;a href="https://gpuark.com/en/cuda-compute-capability/" rel="noopener noreferrer"&gt;CUDA Compute Capability table on gpuark.com&lt;/a&gt; covers every NVIDIA GPU from Kepler to Blackwell.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common CC-Related Errors and Fixes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  "no kernel image is available for execution on the device"
&lt;/h3&gt;

&lt;p&gt;Your PyTorch/TensorFlow binary wasn't compiled for your GPU's CC. Fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install PyTorch with the right CUDA version&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;torch &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cu124
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or build from source with your CC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;TORCH_CUDA_ARCH_LIST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"8.6"&lt;/span&gt; pip &lt;span class="nb"&gt;install &lt;/span&gt;torch &lt;span class="nt"&gt;--no-binary&lt;/span&gt; torch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  "minimum required Cuda capability is X.X"
&lt;/h3&gt;

&lt;p&gt;Your GPU is too old for the framework version. Options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use an older framework version&lt;/li&gt;
&lt;li&gt;Upgrade your GPU&lt;/li&gt;
&lt;li&gt;Use CPU mode: &lt;code&gt;CUDA_VISIBLE_DEVICES="" python train.py&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Flash Attention requires CC ≥ 8.0
&lt;/h3&gt;

&lt;p&gt;Flash Attention 2 only works on Ampere (RTX 3000) and newer. For older GPUs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Use xformers instead (supports CC ≥ 6.0)
&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;xformers&lt;/span&gt;
&lt;span class="c1"&gt;# Or use PyTorch's built-in SDPA
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.nn.functional&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scaled_dot_product_attention&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Practical Advice for GPU Shopping
&lt;/h2&gt;

&lt;p&gt;When buying a GPU for ML:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Minimum CC 7.5&lt;/strong&gt; (Turing) for mixed precision training — gives you Tensor Cores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CC 8.0+&lt;/strong&gt; (Ampere) strongly recommended — BF16, Flash Attention, much better ML performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CC 8.9&lt;/strong&gt; (Ada) for bleeding-edge features like FP8 quantization-aware training&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM matters more than CC&lt;/strong&gt; in most cases — a 3090 (CC 8.6, 24GB) beats a 4070 (CC 8.9, 12GB) for LLMs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;CC tells you &lt;em&gt;what features your GPU supports&lt;/em&gt;. VRAM tells you &lt;em&gt;how big a model fits&lt;/em&gt;. Both matter, but for LLM inference, VRAM is usually the bottleneck.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What GPU are you running your ML workloads on? Have you hit CC compatibility issues? Let me know in the comments!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>nvidia</category>
    </item>
    <item>
      <title>How Much VRAM Do You Actually Need to Run LLMs Locally?</title>
      <dc:creator>Max Vyaznikov</dc:creator>
      <pubDate>Thu, 12 Mar 2026 03:44:13 +0000</pubDate>
      <link>https://forem.com/maxvyaznikov/how-much-vram-do-you-actually-need-to-run-llms-locally-2604</link>
      <guid>https://forem.com/maxvyaznikov/how-much-vram-do-you-actually-need-to-run-llms-locally-2604</guid>
      <description>&lt;p&gt;Running large language models locally has become increasingly practical — but figuring out exactly how much VRAM you need can be confusing. Here's a concrete breakdown.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Simple Formula
&lt;/h2&gt;

&lt;p&gt;For &lt;strong&gt;inference&lt;/strong&gt; (running a model, not training):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VRAM ≈ Parameters × Bytes per Weight + KV Cache + Overhead
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where bytes per weight depends on quantization:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Bytes/Param&lt;/th&gt;
&lt;th&gt;Example: 7B model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FP32&lt;/td&gt;
&lt;td&gt;4.0&lt;/td&gt;
&lt;td&gt;28 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP16/BF16&lt;/td&gt;
&lt;td&gt;2.0&lt;/td&gt;
&lt;td&gt;14 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT8 (Q8)&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;7 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT4 (Q4_K_M)&lt;/td&gt;
&lt;td&gt;0.56&lt;/td&gt;
&lt;td&gt;~4 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT4 (Q4_0)&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;3.5 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Add &lt;strong&gt;10-20% overhead&lt;/strong&gt; for KV cache (more for longer contexts) and runtime buffers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical VRAM Requirements by Model
&lt;/h2&gt;

&lt;p&gt;Here's what you can actually run on common GPUs:&lt;/p&gt;

&lt;h3&gt;
  
  
  8 GB VRAM (RTX 4060, RTX 3070)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Llama 3.1 8B at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;Qwen2.5 7B at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;Mistral 7B at Q5_K_M ✅&lt;/li&gt;
&lt;li&gt;Phi-3.5 Mini (3.8B) at Q8 ✅&lt;/li&gt;
&lt;li&gt;13B models at Q4 ⚠️ (tight, short context only)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  12 GB VRAM (RTX 4070, RTX 3060 12GB)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;13B models at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;Llama 3.1 8B at Q8 ✅&lt;/li&gt;
&lt;li&gt;CodeQwen 14B at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;20B models at Q4 ⚠️&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  16 GB VRAM (RTX 4080, RTX 5070 Ti)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Mistral Small 24B at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;Qwen2.5-Coder 14B at Q6_K ✅&lt;/li&gt;
&lt;li&gt;20B models at Q5-Q6 ✅&lt;/li&gt;
&lt;li&gt;34B models at Q4 ⚠️&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  24 GB VRAM (RTX 3090, RTX 4090)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Llama 3.1 70B at Q4_K_M ⚠️ (with partial offload)&lt;/li&gt;
&lt;li&gt;34B models at Q5-Q6 ✅&lt;/li&gt;
&lt;li&gt;Qwen2.5 32B at Q5_K_M ✅&lt;/li&gt;
&lt;li&gt;DeepSeek-Coder-V2-Lite 16B at FP16 ✅&lt;/li&gt;
&lt;li&gt;Mistral Small 24B at Q8 ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  48 GB VRAM (2× RTX 3090, A6000)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Llama 3.1 70B at Q4_K_M ✅&lt;/li&gt;
&lt;li&gt;DeepSeek V3 670B — not enough, even at Q2&lt;/li&gt;
&lt;li&gt;Mixtral 8x22B at Q4 ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Quantization Sweet Spot
&lt;/h2&gt;

&lt;p&gt;Q4_K_M is the most popular quantization for local inference and for good reason:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quality:&lt;/strong&gt; ~1-2% degradation vs FP16 on most benchmarks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size:&lt;/strong&gt; ~56% of the original INT8 size&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed:&lt;/strong&gt; Fastest on most consumer GPUs (memory-bandwidth bound)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Going lower (Q3, Q2) introduces noticeable quality degradation, especially on reasoning tasks. Going higher (Q6, Q8) gives marginal quality improvement but costs significantly more VRAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  What About Training?
&lt;/h2&gt;

&lt;p&gt;Training needs &lt;strong&gt;much more&lt;/strong&gt; memory than inference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Training VRAM ≈ Model weights + Gradients + Optimizer states + Activations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For full fine-tuning with Adam optimizer at FP32:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weights: 4 bytes/param&lt;/li&gt;
&lt;li&gt;Gradients: 4 bytes/param&lt;/li&gt;
&lt;li&gt;Adam states: 8 bytes/param&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total: ~16 bytes/param&lt;/strong&gt; (before activations)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 7B model needs &lt;strong&gt;~112 GB&lt;/strong&gt; for full FP32 training. That's why techniques like &lt;strong&gt;LoRA&lt;/strong&gt; (which only trains ~1-2% of parameters) and &lt;strong&gt;QLoRA&lt;/strong&gt; (quantized base + LoRA) are so popular:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;QLoRA fine-tuning&lt;/strong&gt; of 7B: ~6-8 GB VRAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QLoRA fine-tuning&lt;/strong&gt; of 13B: ~10-12 GB VRAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QLoRA fine-tuning&lt;/strong&gt; of 70B: ~40-48 GB VRAM&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  KV Cache: The Hidden VRAM Consumer
&lt;/h2&gt;

&lt;p&gt;When generating long texts, the KV cache grows with context length:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;KV cache ≈ 2 × num_layers × hidden_dim × context_length × bytes_per_element
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Llama 3.1 8B at FP16 with 8K context: ~1 GB&lt;br&gt;
For Llama 3.1 8B at FP16 with 128K context: ~16 GB&lt;/p&gt;

&lt;p&gt;This is why you might load a model fine but run out of memory during long conversations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools for Estimating
&lt;/h2&gt;

&lt;p&gt;Rather than doing this math by hand every time, there's a &lt;a href="https://gpuark.com/en/vram-calculator/" rel="noopener noreferrer"&gt;VRAM calculator&lt;/a&gt; that estimates memory requirements — plug in the model size, quantization level, and context length to see if it fits your GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Budget&lt;/th&gt;
&lt;th&gt;Best GPU&lt;/th&gt;
&lt;th&gt;What You Can Run&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;~$300&lt;/td&gt;
&lt;td&gt;RTX 4060 8GB&lt;/td&gt;
&lt;td&gt;7-8B models at Q4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~$400&lt;/td&gt;
&lt;td&gt;RTX 4060 Ti 16GB&lt;/td&gt;
&lt;td&gt;Up to 24B at Q4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~$600&lt;/td&gt;
&lt;td&gt;Used RTX 3090 24GB&lt;/td&gt;
&lt;td&gt;Up to 34B at Q5, 70B at Q3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~$1800&lt;/td&gt;
&lt;td&gt;RTX 4090 24GB&lt;/td&gt;
&lt;td&gt;Same as 3090 but 2× faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~$1200&lt;/td&gt;
&lt;td&gt;2× Used RTX 3090&lt;/td&gt;
&lt;td&gt;70B at Q4, most models comfortably&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most cost-effective option for serious local LLM use in 2025-2026 is still a &lt;strong&gt;used RTX 3090&lt;/strong&gt; — 24 GB of VRAM at a fraction of the 4090 price.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your local LLM setup? Drop a comment with your GPU and favorite model!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
    </item>
  </channel>
</rss>
