<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Thurmon Demich</title>
    <description>The latest articles on Forem by Thurmon Demich (@thurmon_demich).</description>
    <link>https://forem.com/thurmon_demich</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3900489%2F09f665d8-a7ab-491e-a6b5-8fc8f6fc1992.png</url>
      <title>Forem: Thurmon Demich</title>
      <link>https://forem.com/thurmon_demich</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/thurmon_demich"/>
    <language>en</language>
    <item>
      <title>I Built a GPU Dataset for LLM Inference — Here’s What I Learned</title>
      <dc:creator>Thurmon Demich</dc:creator>
      <pubDate>Tue, 28 Apr 2026 13:49:49 +0000</pubDate>
      <link>https://forem.com/thurmon_demich/i-built-a-gpu-dataset-for-llm-inference-heres-what-i-learned-2ida</link>
      <guid>https://forem.com/thurmon_demich/i-built-a-gpu-dataset-for-llm-inference-heres-what-i-learned-2ida</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;TL;DR: Most GPU advice for LLMs is either outdated or too generic. I started collecting real-world data (VRAM, model fit, tokens/sec), and the patterns are surprisingly consistent.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why I built this
&lt;/h2&gt;

&lt;p&gt;If you’ve tried running LLMs locally (Ollama, llama.cpp, vLLM), you’ve probably hit this problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Can my GPU run this model?”&lt;/li&gt;
&lt;li&gt;“Why does 13B barely fit but runs so slow?”&lt;/li&gt;
&lt;li&gt;“Do I really need 24GB VRAM?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The answers online are all over the place.&lt;/p&gt;

&lt;p&gt;So I started putting together a small dataset based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;community benchmarks&lt;/li&gt;
&lt;li&gt;consistent VRAM constraints&lt;/li&gt;
&lt;li&gt;repeated patterns across setups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Repo: &lt;a href="https://github.com/airdropkalami/awesome-gpu-for-llm" rel="noopener noreferrer"&gt;https://github.com/airdropkalami/awesome-gpu-for-llm&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The pattern is simpler than people think
&lt;/h2&gt;

&lt;p&gt;After aggregating the data, a few rules kept showing up.&lt;/p&gt;

&lt;h3&gt;
  
  
  🧠 Practical rules (that actually work)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;7B models&lt;/strong&gt; → run on most GPUs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;13B models&lt;/strong&gt; → need ~16GB for comfortable use&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;34B models&lt;/strong&gt; → require 24GB-class GPUs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;70B models&lt;/strong&gt; → usually better on cloud&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren’t theoretical — they show up consistently across different frameworks.&lt;/p&gt;




&lt;h2&gt;
  
  
  VRAM matters more than compute
&lt;/h2&gt;

&lt;p&gt;One thing became obvious quickly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If it doesn’t fit in VRAM, it doesn’t run.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can optimize speed, but you can’t “optimize around” missing VRAM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-world dataset (simplified)
&lt;/h2&gt;

&lt;p&gt;Here’s a small snapshot from what I collected:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;tok/s&lt;/th&gt;
&lt;th&gt;Fit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4060&lt;/td&gt;
&lt;td&gt;8GB&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;~35&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4060&lt;/td&gt;
&lt;td&gt;8GB&lt;/td&gt;
&lt;td&gt;13B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;~18&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090&lt;/td&gt;
&lt;td&gt;24GB&lt;/td&gt;
&lt;td&gt;13B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;~45&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090&lt;/td&gt;
&lt;td&gt;24GB&lt;/td&gt;
&lt;td&gt;34B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;~25&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3090&lt;/td&gt;
&lt;td&gt;24GB&lt;/td&gt;
&lt;td&gt;34B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;~22&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;👉 Full dataset (updated):&lt;br&gt;
&lt;a href="https://github.com/airdropkalami/awesome-gpu-for-llm/blob/main/benchmark/dataset.md" rel="noopener noreferrer"&gt;https://github.com/airdropkalami/awesome-gpu-for-llm/blob/main/benchmark/dataset.md&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. 13B is the real sweet spot
&lt;/h3&gt;

&lt;p&gt;Not 7B. Not 70B.&lt;/p&gt;

&lt;p&gt;13B gives the best balance between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;quality&lt;/li&gt;
&lt;li&gt;speed&lt;/li&gt;
&lt;li&gt;hardware cost&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  2. 24GB is a hard ceiling for most users
&lt;/h3&gt;

&lt;p&gt;Once you go beyond that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cost explodes&lt;/li&gt;
&lt;li&gt;scaling becomes inefficient&lt;/li&gt;
&lt;li&gt;cloud often makes more sense&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  3. Benchmarks don’t reflect real usage
&lt;/h3&gt;

&lt;p&gt;A lot of GPU comparisons focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FLOPS&lt;/li&gt;
&lt;li&gt;synthetic benchmarks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But for LLMs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VRAM &amp;gt; everything else
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  When NOT to buy a GPU
&lt;/h2&gt;

&lt;p&gt;This is where most people overspend.&lt;/p&gt;

&lt;p&gt;If you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;want to run 70B models&lt;/li&gt;
&lt;li&gt;only experiment occasionally&lt;/li&gt;
&lt;li&gt;don’t need local inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 you’re better off using cloud GPUs.&lt;/p&gt;




&lt;h2&gt;
  
  
  If you want a deeper breakdown
&lt;/h2&gt;

&lt;p&gt;I wrote more detailed guides here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://bestgpuforllm.com/articles/best-gpu-for-ollama/" rel="noopener noreferrer"&gt;https://bestgpuforllm.com/articles/best-gpu-for-ollama/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://bestgpuforllm.com/articles/how-much-vram-for-llm/" rel="noopener noreferrer"&gt;https://bestgpuforllm.com/articles/how-much-vram-for-llm/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why I’m sharing this
&lt;/h2&gt;

&lt;p&gt;I’m still expanding the dataset, and I’m trying to keep it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;practical (not theoretical)&lt;/li&gt;
&lt;li&gt;consistent across setups&lt;/li&gt;
&lt;li&gt;easy to use for decision making&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have real benchmark data or setups, feel free to contribute.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;The “best GPU” isn’t the fastest.&lt;/p&gt;

&lt;p&gt;It’s the one that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fits your model&lt;/li&gt;
&lt;li&gt;matches your budget&lt;/li&gt;
&lt;li&gt;and actually works in your setup&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;If you’re building with local LLMs, I’d love to know what GPU + model combo you’re running.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
      <category>showdev</category>
    </item>
    <item>
      <title>How to Choose the Right GPU for Local LLMs (Without Wasting Money)</title>
      <dc:creator>Thurmon Demich</dc:creator>
      <pubDate>Mon, 27 Apr 2026 13:15:35 +0000</pubDate>
      <link>https://forem.com/thurmon_demich/how-to-choose-the-right-gpu-for-local-llms-without-wasting-money-2c9d</link>
      <guid>https://forem.com/thurmon_demich/how-to-choose-the-right-gpu-for-local-llms-without-wasting-money-2c9d</guid>
      <description>&lt;h1&gt;
  
  
  How to Choose the Right GPU for Local LLMs (Without Wasting Money)
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;TL;DR: Most people overspend on GPUs for local LLMs. If you match &lt;strong&gt;model size ↔ VRAM ↔ quantization&lt;/strong&gt;, you can save hundreds (or thousands) and still get great results.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;If you’re running local LLMs (Ollama, llama.cpp, vLLM, etc.), the biggest mistake I see is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Buying a GPU that’s &lt;strong&gt;too powerful (and too expensive)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Or worse, buying one with &lt;strong&gt;not enough VRAM&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both lead to frustration.&lt;/p&gt;

&lt;p&gt;This guide breaks down how to choose the &lt;strong&gt;right GPU for your actual workload&lt;/strong&gt; — not just benchmarks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1 — Understand what actually limits you
&lt;/h2&gt;

&lt;p&gt;For LLM inference, &lt;strong&gt;VRAM matters more than raw compute&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rough VRAM requirements
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model Size&lt;/th&gt;
&lt;th&gt;Typical VRAM (quantized)&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;6–8GB&lt;/td&gt;
&lt;td&gt;Entry-level, very easy to run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13B&lt;/td&gt;
&lt;td&gt;10–16GB&lt;/td&gt;
&lt;td&gt;Sweet spot for many users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;34B&lt;/td&gt;
&lt;td&gt;20–24GB&lt;/td&gt;
&lt;td&gt;High-end consumer GPUs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;70B&lt;/td&gt;
&lt;td&gt;40GB+&lt;/td&gt;
&lt;td&gt;Usually cloud or multi-GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you remember one thing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;VRAM determines what you &lt;em&gt;can&lt;/em&gt; run. Compute determines how fast it runs.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 2 — Pick your use case first (not the GPU)
&lt;/h2&gt;

&lt;p&gt;Before looking at GPUs, define your goal:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Lightweight local assistant (7B–13B)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Coding assistant&lt;/li&gt;
&lt;li&gt;Chatbot&lt;/li&gt;
&lt;li&gt;RAG experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 You don’t need a flagship GPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Serious local inference (13B–34B)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Better reasoning&lt;/li&gt;
&lt;li&gt;Higher quality outputs&lt;/li&gt;
&lt;li&gt;More stable pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 This is where most developers should aim.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Large models (70B+)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;High-end research&lt;/li&gt;
&lt;li&gt;Production-level inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Local becomes expensive very quickly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3 — Real GPU recommendations (2026)
&lt;/h2&gt;

&lt;p&gt;Here’s a practical breakdown:&lt;/p&gt;

&lt;h3&gt;
  
  
  Best budget option
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;RTX 4060 / 4060 Ti (8–16GB)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Good for: 7B–13B models&lt;/li&gt;
&lt;li&gt;Limitation: VRAM ceiling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Best overall value
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;RTX 4090 (24GB)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Good for: 13B–34B models&lt;/li&gt;
&lt;li&gt;Why: Enough VRAM + strong performance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Used value pick
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;RTX 3090 (24GB)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Still extremely relevant for LLMs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  High-end / no-compromise
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;RTX 5090-class&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Only if budget is not a concern&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 4 — When NOT to buy a GPU
&lt;/h2&gt;

&lt;p&gt;This is where most people get it wrong.&lt;/p&gt;

&lt;p&gt;If you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Want to run &lt;strong&gt;70B models&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Don’t need constant local inference&lt;/li&gt;
&lt;li&gt;Are just experimenting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 &lt;strong&gt;Use cloud GPUs instead&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It’s often cheaper and far more flexible.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5 — Common mistakes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ❌ Mistake 1: Buying for benchmarks
&lt;/h3&gt;

&lt;p&gt;Benchmarks ≠ your real workload.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ Mistake 2: Ignoring VRAM
&lt;/h3&gt;

&lt;p&gt;You can’t “optimize around” missing VRAM.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ Mistake 3: Overbuying
&lt;/h3&gt;

&lt;p&gt;A $1600 GPU for a 7B model is overkill.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ Mistake 4: Forcing everything local
&lt;/h3&gt;

&lt;p&gt;Cloud exists for a reason.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6 — Simple decision guide
&lt;/h2&gt;

&lt;p&gt;If you just want a quick answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Beginner / budget → RTX 4060&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Most users → RTX 4090&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tight budget but want 24GB → used 3090&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Need 70B → go cloud&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Want a deeper breakdown?
&lt;/h2&gt;

&lt;p&gt;I put together a more detailed guide (including VRAM charts and specific model compatibility):&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://bestgpuforllm.com/articles/best-gpu-for-ollama/" rel="noopener noreferrer"&gt;https://bestgpuforllm.com/articles/best-gpu-for-ollama/&lt;/a&gt;&lt;br&gt;
👉 &lt;a href="https://bestgpuforllm.com/articles/how-much-vram-for-llm/" rel="noopener noreferrer"&gt;https://bestgpuforllm.com/articles/how-much-vram-for-llm/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;The best GPU isn’t the most expensive one.&lt;/p&gt;

&lt;p&gt;It’s the one that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fits your &lt;strong&gt;model size&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Matches your &lt;strong&gt;budget&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;And doesn’t lock you into unnecessary cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you get those 3 right, you’re already ahead of most people building local AI setups.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Curious what setups others are running? Drop your GPU + model combo below — I’m collecting real-world configs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
