Counterintuitive: WSL2 + vllm cannot fit Qwen2.5-7B-1M on 6GB VRAM where Windows transformers can

tomohiro takada — Mon, 11 May 2026 18:49:24 +0000

TL;DR — I tried to run Qwen2.5-7B-Instruct-1M on a consumer laptop (RTX 3050 Laptop 6GB VRAM) and mapped the literal feasibility frontier. All evidence in JSON, drift-CI enforced. Three honest findings:

4k context = the hard ceiling on Windows transformers + bitsandbytes int4 NF4. 5k, 6k, 8k all OOM at the first attention forward pass. The 4k cell passes only because Windows kernel shared-memory PCIe spillover (WDDM overcommit) lets allocations spill to system RAM at ~10x latency tax — peak measured 10.8GB on a 6GB GPU.
WSL2 + vllm cannot even fit the model. vllm 0.7.3 memory profile literal log: "model weights take 5.43GiB; PyTorch activation peak memory takes 1.42GiB; the rest of the memory reserved for KV Cache is -0.94GiB". 0 GPU cache blocks allocated, 0.00x concurrency at 4200 tokens. Linux nvidia driver does not provide an equivalent shared-mem fallback — vllm sees only physical 6GB and refuses. The conventional wisdom "vllm > transformers for memory efficiency" is literal disproven at this hardware tier: it fails harder because Windows OS was the enabler, not the inference engine.
Cloud free-tier is also capped, and unevenly. GitHub Models free tier (zero credit card, gh OAuth only): gpt-4.1-mini PASS @ 4k in 8.54s (~30x faster than local). llama-3.3-70b-instruct PASS @ 4k in 5.17s. But: gpt-5 returns unavailable_model at any context size on free tier. DeepSeek-V3 + gpt-5 are capped at literal 4000 input tokens. And Anthropic Claude is not in the GitHub Models catalog at all — zero CC + Claude = no path.

Full numbers + 11 JSON evidence cells + 3 ADRs at: https://github.com/leagames0221-sys/longctx-bench-honest

Hardware: RTX 3050 Laptop 6GB / driver 560.94 / CUDA 12.6 / Windows 11 + WSL2 Ubuntu 24.04. Software: torch 2.5.1+cu124, transformers (5.8.0 Win / 4.48.3 WSL), bitsandbytes 0.49.2, vllm 0.7.3. Everything fully reproducible — uv.lock committed, runners under examples/.

Related sibling repo for browser RPA on the same constraints (5-layer defense-in-depth journey, 5 honest failures with JSON): https://github.com/leagames0221-sys/browser-agent-demo

Cross-repo thesis is "constraint-optimized AI engineering": map the literal feasibility frontier under (zero credit card, consumer laptop, public OSS only, drift-CI enforced) and publish both the working zone AND the boundary. Happy to answer questions about the methodology or specific runner code.

Forem: tomohiro takada

Counterintuitive: WSL2 + vllm cannot fit Qwen2.5-7B-1M on 6GB VRAM where Windows transformers can