<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: tomohiro takada</title>
    <description>The latest articles on Forem by tomohiro takada (@leagames0221sys).</description>
    <link>https://forem.com/leagames0221sys</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3925651%2Fc74f5c5f-21e5-41d2-910a-3587f3cc85a1.png</url>
      <title>Forem: tomohiro takada</title>
      <link>https://forem.com/leagames0221sys</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/leagames0221sys"/>
    <language>en</language>
    <item>
      <title>Counterintuitive: WSL2 + vllm cannot fit Qwen2.5-7B-1M on 6GB VRAM where Windows transformers can</title>
      <dc:creator>tomohiro takada</dc:creator>
      <pubDate>Mon, 11 May 2026 18:49:24 +0000</pubDate>
      <link>https://forem.com/leagames0221sys/counterintuitive-wsl2-vllm-cannot-fit-qwen25-7b-1m-on-6gb-vram-where-windows-transformers-can-597b</link>
      <guid>https://forem.com/leagames0221sys/counterintuitive-wsl2-vllm-cannot-fit-qwen25-7b-1m-on-6gb-vram-where-windows-transformers-can-597b</guid>
      <description>&lt;p&gt;TL;DR — I tried to run Qwen2.5-7B-Instruct-1M on a consumer laptop (RTX 3050 Laptop 6GB VRAM) and mapped the literal feasibility frontier. All evidence in JSON, drift-CI enforced. Three honest findings:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;4k context = the hard ceiling&lt;/strong&gt; on Windows transformers + bitsandbytes int4 NF4. 5k, 6k, 8k all OOM at the first attention forward pass. The 4k cell passes only because Windows kernel shared-memory PCIe spillover (WDDM overcommit) lets allocations spill to system RAM at ~10x latency tax — peak measured 10.8GB on a 6GB GPU.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;WSL2 + vllm cannot even fit the model.&lt;/strong&gt; vllm 0.7.3 memory profile literal log: "model weights take 5.43GiB; PyTorch activation peak memory takes 1.42GiB; the rest of the memory reserved for KV Cache is &lt;strong&gt;-0.94GiB&lt;/strong&gt;". 0 GPU cache blocks allocated, 0.00x concurrency at 4200 tokens. Linux nvidia driver does not provide an equivalent shared-mem fallback — vllm sees only physical 6GB and refuses. The conventional wisdom "vllm &amp;gt; transformers for memory efficiency" is literal disproven at this hardware tier: it fails harder because Windows OS was the enabler, not the inference engine.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud free-tier is also capped, and unevenly.&lt;/strong&gt; GitHub Models free tier (zero credit card, gh OAuth only): gpt-4.1-mini PASS @ 4k in 8.54s (~30x faster than local). llama-3.3-70b-instruct PASS @ 4k in 5.17s. But: &lt;strong&gt;gpt-5 returns &lt;code&gt;unavailable_model&lt;/code&gt; at any context size&lt;/strong&gt; on free tier. DeepSeek-V3 + gpt-5 are capped at literal 4000 input tokens. And Anthropic Claude is &lt;strong&gt;not in the GitHub Models catalog at all&lt;/strong&gt; — zero CC + Claude = no path.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Full numbers + 11 JSON evidence cells + 3 ADRs at: &lt;a href="https://github.com/leagames0221-sys/longctx-bench-honest" rel="noopener noreferrer"&gt;https://github.com/leagames0221-sys/longctx-bench-honest&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hardware: RTX 3050 Laptop 6GB / driver 560.94 / CUDA 12.6 / Windows 11 + WSL2 Ubuntu 24.04. Software: torch 2.5.1+cu124, transformers (5.8.0 Win / 4.48.3 WSL), bitsandbytes 0.49.2, vllm 0.7.3. Everything fully reproducible — uv.lock committed, runners under examples/.&lt;/p&gt;

&lt;p&gt;Related sibling repo for browser RPA on the same constraints (5-layer defense-in-depth journey, 5 honest failures with JSON): &lt;a href="https://github.com/leagames0221-sys/browser-agent-demo" rel="noopener noreferrer"&gt;https://github.com/leagames0221-sys/browser-agent-demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cross-repo thesis is "constraint-optimized AI engineering": map the literal feasibility frontier under (zero credit card, consumer laptop, public OSS only, drift-CI enforced) and publish both the working zone AND the boundary. Happy to answer questions about the methodology or specific runner code.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
