Forem: plasmon

Why Local LLM JSON Output Breaks — Failure Patterns and How to Fix Them in Code

plasmon — Thu, 23 Apr 2026 23:38:38 +0000

API Gets One Line. Local Gets a Minefield.

OpenAI has response_format={"type": "json_object"}. Claude has equivalent. Set it, and output is guaranteed JSON. Parse errors don't happen.

Local LLMs don't have this. llama.cpp offers --grammar to constrain output to valid JSON syntax, but that only forces the format to be JSON. Whether the content makes sense is a completely different problem.

// API output: as intended
{"name": "Qwen2.5-14B", "speed_tps": 31.5, "vram_gb": 7.3}

// Local LLM (grammar enabled): valid JSON, broken content
{"name": "Qwen2.5-14B", "speed_tps": "fast", "vram_gb": "enough"}
// → Types are wrong. Numbers became strings.

This "format is correct but content is broken" problem gets worse with smaller models. On RTX 4060 8GB, the model size constraint (7B-14B) directly impacts JSON output reliability.

The 3 Failure Patterns

Pattern 1: Failed — JSON Itself Is Broken

// Typical: explanation text wraps around the JSON
Here is the JSON output:
{"name": "Qwen2.5-14B", ...}
I hope this helps!

// → json.loads() parse error

When it happens: Frequent with 7B models without grammar. Enabling --grammar eliminates this completely. 14B+ rarely has this issue even without grammar.

Pattern 2: Broken — Valid JSON, Wrong Content

// Expected: {"speed_tps": 31.5, "vram_gb": 7.3}
// Actual:   {"speed_tps": "fast", "vram_gb": "7.3GB"}
// → Wrong types. Strings where numbers should be. Units leak into values.

When it happens: Frequent with 7B, sporadic with 14B. Including a JSON Schema in the prompt dramatically improves this. Grammar alone can't prevent it — the format is valid, just the values are wrong.

Pattern 3: Nested Structure Collapse

// Expected: {"items": [{"a": 1}, {"a": 2}]}
// Actual:   {"items": [{"a": 1}, {"b": 2}]}  // field name changed
// Or:       {"items": [{"a": 1}, 2]}          // type collapsed mid-array

When it happens: The nastiest pattern. When generating multiple objects inside an array, the first object is correct but subsequent ones drift — field names change, types collapse. This happens even with larger models. The best approach is to not ask the model to generate nested structures at all (see two-stage generation below).

Grammar: Necessary but Not Sufficient

llama.cpp's --grammar guarantees Pattern 1 (parse errors) goes to zero. But it can't prevent Pattern 2 or 3. Grammar constrains the token sequence format, not semantic correctness.

Grammar is a prerequisite, not a solution.

3 Fixes That Actually Work

Fix 1: Explicit Schema in Prompt

schema = {
    "type": "object",
    "properties": {
        "model": {"type": "string"},
        "speed_tps": {"type": "number"},
        "vram_gb": {"type": "number"}
    },
    "required": ["model", "speed_tps", "vram_gb"]
}

prompt = f"""Output JSON following this schema:
{json.dumps(schema, indent=2)}

Input: {input_text}"""

Giving the model the exact structure upfront dramatically improves field name consistency and type correctness. This works because the schema is in the context when the model generates each token.

Fix 2: Grammar + Retry

def reliable_json(prompt: str, max_retries: int = 3) -> dict:
    for attempt in range(max_retries):
        raw = call_llm(prompt)
        try:
            parsed = json.loads(raw)
            if validate_schema(parsed):
                return parsed
        except (json.JSONDecodeError, ValidationError):
            continue
    raise RuntimeError(f"JSON generation failed ({max_retries} attempts)")

Allowing 3 retries massively improves effective success rate. The cost is up to 3x latency — a reasonable tradeoff for pipelines where reliability matters. Measure how many retries YOUR model needs on YOUR tasks.

Fix 3: Two-Stage Generation (for Nested Structures)

Don't ask the model to build nested JSON in one shot. Generate flat JSON twice and merge.

# Step 1: Extract metadata
meta = call_llm('Output model name only: {"model": "..."}')

# Step 2: Extract array separately  
items = call_llm('Output each quantization as JSON array: [{"method": ..., "size_gb": ..., "speed_tps": ...}]')

# Step 3: Merge in code
result = {**json.loads(meta), "quantizations": json.loads(items)}

Two flat generations merged in code is dramatically more stable than one nested generation. For 7B models needing nested output, this is effectively the only practical option.

Model Size Decision Guide

[JSON Output System Design Guide]

High reliability (payments, medical):  32B + grammar + retry
                                       → Doesn't fit 8GB. Use an API.

Standard (RAG, analysis):              14B + grammar + schema + retry
                                       → Optimal for RTX 4060 8GB

Lightweight (log extraction, classification): 7B + grammar + two-stage
                                       → Practical if you stick to flat JSON

Nested structures required:            14B+ with two-stage generation
                                       → 7B can't do this reliably

Specific success rates depend on YOUR environment. Copy the test code above, run it with YOUR model and YOUR tasks. Those numbers are your real reliability. Don't trust anyone else's benchmarks — including this article.

References

llama.cpp grammar documentation: https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md
llama.cpp server API: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

INT8 Hits 58x, Voltage Underscaling Saves 36% — Semiconductor Physics Limits Are Being Bypassed by Software in 2026

plasmon — Thu, 23 Apr 2026 01:42:24 +0000

Why This Matters Now

Last week, Tesla started recruiting engineers in Taiwan for "Terafab" — Elon Musk's vision for an in-house AI semiconductor fab. Around the same time, IBM Japan announced development of a 2nm neuromorphic accelerator led from Japan.

Read these headlines individually and they're just more semiconductor noise. But overlay them with three ArXiv papers published this month, and a very different picture emerges.

2026's semiconductor industry is quietly shifting from "push physics harder" to "bypass physics with software."

This is my reading, but the evidence isn't thin.

Paper 1: INT8 Achieves 58x — DEEP-GAP Measures Where GPU Inference Actually Stands

DEEP-GAP: Deep-learning Evaluation of Execution Parallelism (arXiv:2604.14552) systematically benchmarks datacenter inference accelerator performance.

Key findings:

"Results show that reduced precision significantly improves performance, with INT8 achieving up to 58x throughput improvement over CPU baselines. L4 achieves up to 4.4x higher throughput than T4 while reaching peak efficiency at smaller batch sizes between 16 and 32."

Comparison	Factor	Metric
INT8 vs CPU baseline	up to 58x	Throughput
NVIDIA L4 vs T4	up to 4.4x	Throughput
L4 peak efficiency batch size	16-32	Latency-throughput tradeoff

58x is provocative, but note this compares FP32 CPU inference against INT8 GPU inference. Still, the implication is massive.

One generation of process node advancement yields maybe 20-30% performance improvement. Simply reducing precision (quantizing) delivers 58x. The optimization direction for hardware design is clearly "precision hierarchy" — deciding which computations need which precision, dynamically.

Paper 2: Run It Broken on Purpose — DRIFT's Fault-Tolerant Inference

DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Resilient Inference (arXiv:2604.09073) takes a contrarian approach.

"DRIFT can achieve on average 36% energy savings through voltage underscaling or 1.7x speedup via overclocking while maintaining generation quality."

"Voltage underscaling" means running chips below rated voltage. Normally this introduces memory errors and computational mistakes — fatal for numerical computing. But generative AI models tolerate a degree of bit errors without degrading output quality. DRIFT exploits this "soft fault tolerance" to intentionally lower voltage and save energy.

The reverse works too: overclocking for 1.7x speedup "while maintaining quality."

This is fundamental. The hardware's imperfections are being absorbed by the AI model's inherent tolerance. The design philosophy has flipped from "make hardware perfect" to "make software tolerant of imperfect hardware."

My prediction: this class of error-tolerant design becomes mainstream for NPU and edge AI chips by 2027-2028.

Paper 3: Spiking Neural at 4.2mW — L-SPINE Shows Another Direction

L-SPINE: A Low-Precision SIMD Spiking Neural Compute Engine (arXiv:2604.03626), implemented on AMD VC707 FPGA:

Critical delay: 0.39 ns
Power: 4.2 mW (neuron-level)
System total: 0.54 W, latency 2.38 ms

Compare 0.54W to RTX 4060's ~150W inference consumption. Two orders of magnitude less. "Different use case" is the correct objection — but that's exactly the point.

SNNs compute only on spike events. Idle time costs near-zero power. This is devastatingly efficient for sparse sensor inputs: drone LiDAR, factory vibration sensors, medical wearables. Using GPUs for these tasks is absurd overkill.

I expect the first mass-produced SNN chips for robotics/industrial edge sensor fusion by 2027-2028. L-SPINE being on FPGA means the prototyping phase is active now.

Measuring It on Real Hardware: RTX 4060 vs M4

Here's what I observe on my setup (Ryzen 7 7845HS + RTX 4060 + Windows / Apple M4):

RTX 4060: Running Qwen on llama.cpp, monitoring with nvidia-smi:

nvidia-smi --query-gpu=timestamp,name,temperature.gpu,power.draw,clocks.gr,clocks.mem \
  --format=csv,noheader,nounits \
  --loop=1

Thermal throttling visibly drops clock speeds. "Process node miniaturization limits" are already firing daily on laptop GPUs.

M4 comparison: Same workload, and the fan doesn't spin. In sustained workloads where RTX 4060 throttles, M4 maintains equivalent tokens/sec silently. Sustained performance, not peak performance, is the real metric — exactly what DEEP-GAP's "peak efficiency at batch 16-32" is saying from a different angle.

2026-2030: My Predictions (All Personal Analysis)

Prediction 1: "Precision Hierarchy" Becomes the Next Design Axis

CPUs, GPUs, and NPUs will all dynamically control which operations run at which precision. The era of universal FP32 is over.

Prediction 2: DRIFT's "Tolerate Broken State" Design Goes Mainstream

Semiconductor design orthodoxy was "prevent errors." DRIFT reverses this to "design models that maintain quality despite errors." Impact on production chip power design: 2028-2029 at earliest.

Prediction 3: Terafab Symbolizes "Vertical Integration" as Industry Trend

Tesla's Terafab and OpenAI's semiconductor investments ($20B+ reported) are driven by wanting to co-design models and hardware. Apple Silicon demonstrated the performance/W advantage of co-design.

Prediction 4: SNN Reaches Production in Robotics/Sensor Fusion by 2028

Prediction 5: NPU Architecture Becomes the Next Differentiator

Intel Core Ultra NPU vs Qualcomm Hexagon have fundamentally different design philosophies. By 2027, "which NPU" determines what AI apps can run — creating Android-style fragmentation chaos.

What 8GB Users Should Watch

Every approach described above is a bypass, not a breakthrough. DEEP-GAP bypasses through precision reduction. DRIFT bypasses through error tolerance. L-SPINE bypasses through architecture change. Terafab bypasses through vertical integration.

The physics wall is real. But the ways around it are multiplying.

If I had to summarize 2026's architecture trend in one phrase: "intelligence that tolerates imperfection." The obsession with perfect precision, perfect error tolerance, perfect yield — these become constraints, not goals, at the next design frontier.

References

I Ran an LLM Agent on 8GB VRAM — It Broke After 5 Tool Calls

plasmon — Tue, 21 Apr 2026 05:03:07 +0000

The "Long-Term Memory" Agent Is a Fantasy on 8GB

2026's LLMs are expected to run as agents by default. Call tools, receive results, decide next action, call again. Claude Code, Cursor, Devin — all built on "long-running loop" strategies.

This strategy physically cannot work on 8GB local VRAM.

I tested a llama.cpp-based tool-calling agent with RTX 4060 Laptop (8GB) + Qwen2.5-7B Q4_K_M. The result is simple: beyond ~5 tool calls, response quality visibly degrades. Past 10 calls, the model starts ignoring results from tools it just called.

This article breaks down why this happens from KV cache and Context Rot perspectives, then examines 3 viable workarounds for 8GB.

How Much KV Cache Does Each Tool Call Eat?

Consider the token cost of one tool call cycle:

One tool call cycle:
  System prompt              : ~500 tok (fixed)
  User instruction           : ~200 tok (fixed)
  Conversation history       : variable (accumulates)
  Tool definitions (schemas) : ~300 tok × number of tools
  LLM response (tool_call)   : ~100 tok
  Tool execution result      : ~500-2000 tok (variable)

With 5 tools defined and average 800 tokens per result, KV cache accumulation per step:

Step	Cumulative Tokens	KV Cache (fp16)	VRAM Remaining (7B Q4_K_M)
0 (initial)	~2,200	0.12 GB	2.60 GB
3	~4,900	0.26 GB	2.46 GB
5	~6,700	0.36 GB	2.36 GB
10	~11,200	0.60 GB	2.12 GB
20	~20,200	1.08 GB	1.64 GB
30	~29,200	1.56 GB	1.16 GB

def agent_vram_estimate(steps, tokens_per_step=900, base_tokens=2200,
                        model_gb=4.68, overhead_gb=0.6,
                        n_layers=28, n_kv_heads=4, head_dim=128, dtype_bytes=2):
    """Estimate VRAM consumption for agent loop"""
    total_tokens = base_tokens + steps * tokens_per_step
    kv_gb = (2 * n_layers * total_tokens * n_kv_heads * head_dim * dtype_bytes) / (1024**3)
    total_gb = model_gb + kv_gb + overhead_gb
    return {
        "steps": steps,
        "tokens": total_tokens,
        "kv_cache_gb": round(kv_gb, 2),
        "total_vram_gb": round(total_gb, 2),
        "remaining_gb": round(8.0 - total_gb, 2)
    }

for s in [0, 5, 10, 20, 30, 50]:
    r = agent_vram_estimate(s)
    print(f"Step {s:2d}: {r['tokens']:,} tok, KV={r['kv_cache_gb']}GB, remaining={r['remaining_gb']}GB")

At ~30 steps, remaining VRAM drops below 1GB. At 50 steps, OOM is theoretically visible. Q4 KV cache quantization (--cache-type-k q4_0 --cache-type-v q4_0) compresses by ~3.5x, but even then, 100+ step loops are unrealistic.

But a more serious problem hits before OOM.

Context Rot — Long Context Kills Quality

Even when everything fits in VRAM, response quality collapses as context grows. This is known as "Context Rot."

Chroma Research reports that LLM information reproduction accuracy decreases inversely with token count. Degradation is especially pronounced in "intermediate result accumulation" patterns — exactly what agents do.

Microsoft and Salesforce's joint research "LLMs Get Lost In Multi-Turn Conversation" (arXiv:2505.06120) provides specific numbers. Converting benchmark prompts into multi-turn conversations (agent-workflow-like), they report average 39% performance drop across 6 generative tasks. Even reasoning-specialized models like o3 and DeepSeek-R1 weren't immune.

With 7B models, degradation starts earlier. What I observed with Qwen2.5-7B:

Steps 3-5: Normal operation. Accurately references tool results, selects appropriate next action
Steps 5-8: Begins forgetting initial instructions. Redundantly re-calls the same tools
Steps 8-10: Ignores recent tool results. Hallucination rate climbs
Steps 10+: Loses conversational direction. Tool calls become unrelated to the objective

This is the same structure as "Lost in the Middle" (Liu et al., TACL 2024). In agent scenarios, tool results from steps 3-4 get pushed to the "middle," and only the system prompt (beginning) and latest results (end) get referenced.

Do Larger Models Solve This?

Important counter-evidence:

GPT-4.1 showed no degradation in tool-heavy conversations. Parloa's testing confirms large models maintain stable performance in long conversations.

MemAgent extrapolates from 8K context to 3.5M token tasks with under 10% performance loss (OpenReview). RLM (Recursive Language Model) maintains 91.33% accuracy across 1000 documents and 10M+ tokens.

However, these all involve large models with tens to hundreds of GB of memory, or cloud inference.

For 7B models running on 8GB VRAM:

The context window itself is physically limited (as shown above)
Fewer Attention heads means weaker long-range dependency retention
GQA (Grouped Query Attention) saves KV cache, but doesn't improve the model's actual "memory capacity"

"The problem is mitigated with sufficient model size" is true. "On 8GB, you must engineer around it" is equally true.

Workaround 1: Short Loops × Context Reset

The simplest and most effective approach. Cut the agent loop short and reset context at each loop boundary.

def short_loop_agent(task: str, tools: list, max_steps_per_loop: int = 5):
    """Short-loop × reset strategy agent"""
    memory = []  # Only carry summaries between loops

    while not is_task_complete(memory):
        # Rebuild context with minimum necessary info
        context = build_context(
            system_prompt=SYSTEM_PROMPT,
            task=task,
            memory_summary=summarize(memory[-3:]),  # Only last 3 summaries
            tools=tools
        )

        # Execute short loop
        for step in range(max_steps_per_loop):
            response = llm.generate(context)
            if response.tool_call:
                result = execute_tool(response.tool_call)
                context.append(result)
                memory.append({
                    "step": step,
                    "tool": response.tool_call,
                    "result_summary": summarize_result(result)
                })
            else:
                break

        # End of loop: reset context, carry only summaries
        # KV cache is freed

The key is memory_summary. What passes between loops isn't raw tool results — it's summaries. This prevents KV cache accumulation while retaining necessary information.

5 steps × 6 loops = 30-step equivalent task, processed at ~6,700 tokens per loop (0.36GB KV cache). Compared to 1.56GB for running 30 steps straight, VRAM consumption is less than a quarter.

Workaround 2: Persistent Q4 KV Cache

arXiv:2603.04428 "Agent Memory Below the Prompt" (2026) proposes persisting agent KV cache to disk with Q4 quantization, loading directly into Attention layers when needed.

Validated on Apple M4 Pro:

FP16 KV cache budget of 10.2GB holds only 3 agent contexts
Q4 quantization fits 4x more agent contexts in the same memory
TTFT improvement from cache restoration: up to 136x (22–136x for Gemma, 11–76x for DeepSeek)

The core insight: "avoid recomputation." Normally, restoring context requires recalculating prefill for all tokens. Persistent KV Cache skips this entirely by loading pre-saved KV states directly.

The paper validated on M4 Pro, but the principle applies equally to RTX 4060. llama.cpp has experimental KV cache save/restore APIs (--save-state, --load-state). Saving per-agent KV snapshots on NVMe SSD and loading on task switch avoids prefill recomputation. On 8GB — where you can only hold one agent context at a time — this "swap" strategy's benefit is even larger than on M4 Pro.

Workaround 3: Dynamic Tool Selection (Tool Loadout)

More tool definitions means worse selection accuracy. Berkeley's function-calling leaderboard confirms that as tools increase, description overlap makes correct selection harder. Empirically, 5–10 tools is the practical ceiling for 7B models. Tool definitions themselves consume context and pressure KV cache.

Solution: "don't define all tools at all times."

def dynamic_tool_selection(query: str, all_tools: list, max_tools: int = 5):
    """Dynamically select tools based on query"""
    # Lightweight classifier determines query category
    category = classify_query(query)  # "search", "code", "data", etc.

    # Select tool subset based on category
    tool_groups = {
        "search": ["web_search", "file_search", "grep"],
        "code": ["run_python", "read_file", "write_file"],
        "data": ["sql_query", "csv_parse", "chart_generate"],
    }

    selected = tool_groups.get(category, all_tools[:max_tools])
    return selected

Loading all 20 tool definitions costs ~6,000 tokens. Narrowing to 5 tools: ~1,500 tokens. The 4,500-token difference saves 0.04GB per step in Q4 KV cache. Looks small, but over 30 steps this accumulates to 1.2GB+ difference.

8GB Agent Design Principles

Combining all three workarounds:

Principle 1: Loops Stay Under 5 Steps

7B models maintain context quality up to ~6,000–8,000 tokens. At ~900 tokens per tool call, 5 steps is the limit.

Principle 2: Memory Carries as "Summaries"

Never leave raw tool results in context. Summarize at each loop boundary. Next loop only sees summaries.

Principle 3: Maximum 5 Tool Definitions

Dynamic tool selection loads only what's needed per step. "Universal agents" don't work on 8GB.

Principle 4: Monitor "Context Quality"

Track tool call "hit rate" (whether called tools matched the objective). When it drops, reset the loop. Use as automatic reset trigger.

The 8GB Constraint Improves Agent Design

As I wrote in the 128K context article — the 8GB constraint isn't a handicap. It's a design forcing function.

Cloud-scale models can brute-force 100-step agent loops. But as Microsoft and Salesforce's research shows, being able to run it and maintaining quality are separate problems. Even o3 degrades by 39%.

The 8GB constraint doesn't hide the fact that "quality drops at 5 steps." That's precisely why it leads to "fundamentally correct design" — short loops, summary carry-over, dynamic tool selection. These design principles apply directly to cloud environments too — and arguably should be applied there.

What determines agent performance isn't context length. It's context quality.

References

"Context Rot: How Increasing Input Tokens Impacts LLM Performance" (Chroma Research): https://research.trychroma.com/context-rot
"Lost in the Middle: How Language Models Use Long Contexts" (Liu et al., TACL 2024): https://arxiv.org/abs/2307.03172
arXiv:2603.04428 — Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices
"LLMs Get Lost In Multi-Turn Conversation" (Microsoft Research & Salesforce, arXiv:2505.06120): https://arxiv.org/abs/2505.06120
Parloa Labs — Long Conversations and LLM Performance: https://www.parloa.com/labs/insights/long-calls-LLM-performance/
Berkeley Function-Calling Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html
llama.cpp: https://github.com/ggerganov/llama.cpp

The Memory Wall Can't Be Killed — 3 Papers Proving Every Architecture Hits It

plasmon — Tue, 21 Apr 2026 02:54:11 +0000

I Tested 3 "Escape Routes" from the Memory Wall

The GPU memory wall — computation sitting idle because memory bandwidth can't keep up — is something anyone who's run a local LLM knows viscerally. With 8GB on an RTX 4060, both model size and context length are memory-bound.

Neuromorphic chips. Edge NPUs. Processing-in-Memory. These architectures all wave the banner of "escaping the von Neumann bottleneck." The memory wall is a GPU-specific problem, they say. Change the architecture, solve the problem — or at least, that's been the promise.

In April 2026, three papers put this promise to the test. The conclusion: the wall is still there.

Test 1: Neuromorphic's "New Wall"

Yousefzadeh et al.'s "Memory Wall is not gone" (arXiv:2604.08774) is blunt from the title.

Neuromorphic chip design philosophy centers on "distributed memory." Each neuron core has local SRAM, synaptic weights sit right next to the core. The distance between compute and memory approaches zero — structurally bypassing the GPU memory wall (the bandwidth gap between DRAM and compute units).

The paper identifies the cost of this bypass:

"On-chip memory systems (SRAM and STT-MRAM variants) have become the primary consumers of area and energy, forming a new memory wall."

In distributed architectures, you need SRAM proportional to neurons × synapses. By bringing computation close to memory, the chip area fills up with memory. SRAM requires constant power, leaking energy even during idle periods with no spikes.

Replacing SRAM with non-volatile STT-MRAM (Spin-Transfer Torque MRAM) is being explored, but write energy is high and endurance is limited. Change the memory technology, and the structure remains: "memory area and energy are the bottleneck."

On GPUs, bandwidth was the bottleneck. On neuromorphic chips, area and leakage current are the bottleneck. The wall just changed shape.

Test 2: KV Cache Dominates Even on Edge NPUs

SHIELD (arXiv:2604.07396) targets LLM inference on Edge NPUs. The paper's opening states the problem plainly: "LLM inference on Edge NPUs is fundamentally constrained by limited on-chip memory capacity."

Edge NPUs are designed to maximize memory efficiency for inference. But LLM inference requires a KV cache — the memory region that holds past Keys and Values for Attention computation. This grows linearly with context length, pressuring memory.

SHIELD focuses on the refresh energy of eDRAM (embedded DRAM) that holds the KV cache. DRAM stores data as charge in capacitors, requiring periodic refresh (recharging).

BF16 (bfloat16) bit fields:
  Sign (1 bit) + Exponent (8 bits) = determines magnitude
  Mantissa (7 bits) = determines precision

SHIELD's approach:
  KV cache (persistent): relax mantissa refresh
  Query/Attention output (transient): skip mantissa refresh entirely
  Sign + exponent: always full refresh (critical for correctness)

By separating refresh strategies based on data "lifetime" and "bit sensitivity," SHIELD achieves 35% eDRAM refresh energy reduction. Accuracy is maintained on WikiText-2, PIQA, and ARC-Easy.

SHIELD is simultaneously a "solution" and "evidence of the problem." When a dedicated NPU paper makes "memory refresh energy" its optimization target, it proves that even inference-specialized chips are memory-bottlenecked.

Test 3: GQA Only Shrinks the Wall to One-Third

TRAPTI (arXiv:2604.06955, IJCNN 2026) analyzes on-chip memory occupancy over time for embedded Transformer inference.

Comparing GPT-2 XL (MHA: Multi-Head Attention) and DeepSeek-R1-Distill-Qwen-1.5B (GQA: Grouped-Query Attention) on the same accelerator configuration: GQA-based DeepSeek uses 2.72x less peak on-chip memory.

GQA compresses KV cache size by reducing the number of Key/Value heads. A 2.72x reduction is certainly significant. But flip this number around: "even with GQA — the latest compression technique — KV cache remains the largest on-chip memory consumer."

The paper states clearly: "Performance and efficiency are increasingly dominated by the KV cache."

GQA, MQA (Multi-Query Attention), quantized KV cache — techniques for narrowing the bandwidth gap keep evolving. But they all "thin the wall." None of them "erase the wall." The structure where context length dominates memory through KV cache remains unchanged as long as attention mechanisms are used.

Wall Morphology Map: Architecture Changes, Wall Persists

Organizing the three papers alongside existing architectures reveals the complete picture of memory bottlenecks:

Architecture	Wall Form	What's Bottlenecked	2026 Countermeasure
GPU	Memory bandwidth	DRAM ⇔ compute data transfer	HBM, GDDR7, cache hierarchy
Neuromorphic	Memory area/leakage	SRAM dominates chip area and energy	STT-MRAM replacement (issues remain)
Edge NPU	Memory refresh	eDRAM KV cache maintenance cost	SHIELD: lifecycle-based refresh
Embedded Transformer	Memory occupancy	KV cache on-chip footprint	GQA, power gating
PIM	Compute precision/flexibility	Analog compute SNR limits	Mixed precision, digital PIM

Look at the "Wall Form" column. Bandwidth, area, refresh energy, occupancy, compute precision — all different. But every single one is "a bottleneck originating from memory."

Change the architecture, and the wall changes shape. But the wall itself never disappears.

Only Optical Computing Has a Different Underlying Principle

Every architecture above assumes electronic data transfer. Moving electrons requires energy, and wires have RC delay.

Optical computing changes this premise. Photons have no mass, no resistance, and propagation costs nearly zero energy. PRISM (arXiv:2603.21576) reduced KV cache block selection from O(n) to O(1) because optical similarity computation doesn't depend on context length.

Photonics research in 2026 is also advancing steadily:

Non-volatile photonics (arXiv:2604.08637): Nanostructured Sb₂Se₃ phase-change material achieving 94% insertion loss suppression and 100M+ write cycle endurance. "Storing data with light" is approaching practicality.
Photonic KAN (arXiv:2604.08432): Optical neural networks built from standard telecom components (MZI, SOA, VOA). 4 modules achieve 98.4% accuracy on nonlinear classification. Optical AI without custom chips.

But light has walls too. Nonlinear operations require electro-optical conversion, and photons can't stand still — "memory" needs a material mechanism. Light can fundamentally bypass the "transfer wall" but cannot escape the "storage wall."

The Wall Transforms but Never Dies

"Memory wall" was coined by Wulf and McKee in 1995, originally referring to the widening speed gap between processors and DRAM. Thirty years later, the definition itself has expanded.

The 2026 reality: constraints manifest not just as bandwidth, but as area, refresh energy, occupancy, and compute precision — different forms for different architectures. What all three papers consistently show is that no architecture escapes "memory-originated bottlenecks."

The wall couldn't be killed. But its anatomy is becoming visible. Understanding which form the wall takes in each architecture reveals the optimal countermeasure. SHIELD's lifecycle-based refresh, TRAPTI's temporal memory analysis, GQA's KV cache compression — not erasing the wall, but using tools shaped to fit it. That's the most realistic approach as of 2026.

References

"Memory Wall is not gone: A Critical Outlook on Memory Architecture in Digital Neuromorphic Computing" (Yousefzadeh et al., arXiv:2604.08774)
"SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs" (Zhang & Fong, arXiv:2604.07396)
"TRAPTI: Time-Resolved Analysis for SRAM Banking and Power Gating Optimization in Embedded Transformer Inference" (Klhufek et al., arXiv:2604.06955, IJCNN 2026)
"PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection" (arXiv:2603.21576)
"Increased endurance of nonvolatile photonics enabled by nanostructured phase-change materials" (arXiv:2604.08637)
"Small-scale photonic Kolmogorov-Arnold networks using standard telecom nonlinear modules" (arXiv:2604.08432)
"Hitting the Memory Wall: Implications of the Obvious" (Wulf & McKee, ACM SIGARCH, 1995)

The Physics Wall in 2026: 3 Papers That Show Why Node Shrinks Won't Save Us

plasmon — Tue, 21 Apr 2026 02:52:27 +0000

"2nm Will Fix Everything" Is a Fantasy — Let's Drop It

From late 2024 through 2025, semiconductor press releases have been drowning in buzzwords: "2nm," "3nm," "Gate-All-Around," "CFET." Reading them makes you feel like GPUs will be 10x faster in a few years.

They won't.

More precisely: "simple die shrinks no longer guarantee linear performance or power efficiency gains from transistor density improvements." This isn't my opinion — it's what multiple ArXiv papers from 2025–2026 consistently demonstrate.

I have a Ryzen 7 7845HS + RTX 4060 and an Apple M4 sitting on my desk, connected via a KVM switch. Running local LLM inference benchmarks on both, I've noticed something: the gap between spec sheet numbers and real-world performance per watt is widening with each generation.

This article dissects 3 recent papers, measures where the "physics wall" stands today, and offers my predictions toward 2030. Predictions are personal analysis — not fact.

Real Hardware Tells the Story: RTX 4060 vs M4 Power Efficiency

First, look at these numbers. Measured on my setup running Qwen2.5-7B-Instruct (Q4_K_M) with llama.cpp:

Metric	RTX 4060 (CUDA)	Apple M4 (Metal)
Token generation (tg)	72.8 t/s	52.4 t/s
Inference GPU power (measured)	~68W	~18W
tokens/Watt	1.07	2.91
Memory bandwidth	~256 GB/s (GDDR6)	~120 GB/s (LPDDR5)

Look at the tokens/Watt column. M4 achieves ~2.7x the power efficiency of RTX 4060 for this workload. The RTX 4060 is bottlenecked by GDDR6's separate memory subsystem (~256GB/s), while M4's unified memory at 120GB/s has structurally lower data transfer overhead since CPU/GPU/NPU share it directly.

I'm not saying "NVIDIA is worse than Apple." The RTX 4060 is designed as a general-purpose rendering/training/inference machine — different comparison target. The point is: architecture differences have already surpassed process node differences in determining real-world efficiency.

Paper 1: DRIFT — "Break Things on Purpose" Voltage Optimization for 36% Energy Savings

DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference (arXiv:2604.09073, DAC 2026)

What makes this paper interesting is its contrarian approach. Normally, semiconductor design pushes toward "zero errors." DRIFT does the opposite — it exploits the fact that diffusion models inherently tolerate a certain level of bit errors, and intentionally underscales voltage to slash energy consumption.

Reported numbers:

Average 36% energy reduction through voltage underscaling
1.7x throughput improvement via overclocking (while maintaining image generation quality)
Fine-grained voltage/frequency scaling strategy that prioritizes protection for error-sensitive components

The paper targets diffusion models for image generation, not LLMs. But the "tolerate errors" approach itself is applicable to neural network inference broadly. It's on the same continuum as quantization (INT4/INT8) — a 4-bit quantized LLM drops information from original weights yet maintains inference quality. DRIFT pushes this principle down to the hardware voltage control layer.

My personal read: this class of error-tolerant design will become mainstream for NPU/edge AI chips around 2027–2028. This aligns with smartphone AI chips already moving toward "dynamic quality vs. power consumption tradeoff control."

Paper 2: Trilinear Compute-in-Memory — Running Full Transformer Attention in NVM Cores

Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration (arXiv:2604.07628, 2026)

The claim is simple and provocative:

"To the best of our knowledge, this is the first architecture to complete entire Transformer Attention computation solely within NVM cores without runtime reprogramming."

Compute-in-Memory (CiM) isn't new. Computing near memory to reduce data transfer energy has been around since the 2010s. The problem was practical: "can you actually handle full Transformer Attention?"

TrilinearCIM uses a Double-Gate FeFET (DG-FeFET) architecture with back-gate modulation to achieve 3-operand multiply-accumulate operations within memory. Evaluated on BERT-base (GLUE benchmark) and ViT-base (ImageNet/CIFAR), achieving up to 46.6% energy reduction and 20.4% latency improvement compared to conventional FeFET CiM.

The evaluation targets are BERT and ViT — not large generative models — but they share the Transformer architecture structurally. The bottleneck of current LLM inference being memory-bandwidth-bound is well established. For a 7B parameter model, most of token generation time goes to weight transfer from memory, not GPU computation. Attention computation accounts for an estimated 30–40% of token generation time, so if Trilinear CiM can complete this entirely on NVM cores, it could fundamentally slash power costs.

However, current CiM architectures have clear constraints. Latency spikes whenever weights need NVM write operations. The paper's precondition of "no runtime reprogramming" limits it to inference-only, fixed-model use cases. Not viable for general-purpose training yet.

Paper 3: L-SPINE — Spiking Neural Network Running at 0.54W on FPGA

L-SPINE: A Low-Precision SIMD Spiking Neural Compute Engine (arXiv:2604.03626, 2026)

An SNN implementation paper. Deployed on AMD VC707 FPGA:

System-level: 46.37K LUT, 30.4K FF, 2.38ms latency, 0.54W power
Claims "significant reduction" compared to CPU/GPU platforms

0.54W. Compare that to RTX 4060's ~68W inference consumption — two orders of magnitude different. "Different use cases" is the correct objection, but that's precisely why it matters.

SNNs compute only when a spike fires. Idle time is near-zero power. This is terrifyingly well-suited for sparse sensor inputs: drone LiDAR, factory vibration sensors, medical wearable biosignals. Using GPUs for these tasks is absurd overkill.

My prediction: SNNs won't "replace general-purpose AI chips" before the 2030s. But first mass-produced SNN chips for sensor fusion in robotics/drones/industrial edge sensors by 2027–2028 is entirely plausible. The increasing volume of FPGA implementation papers like L-SPINE signals that the prototyping phase is actively underway.

Rapidus 1.4nm Domestic Fabrication — Don't Misread the Numbers

In April 2026, Fujitsu announced it will commission Rapidus for 1.4nm AI semiconductor manufacturing. Combined private projects reportedly total ¥200B (~$1.3B) in scale.

Honestly, it's too early to get excited about those numbers.

TSMC's N2 (≈2nm) is still at the stage where even Apple and NVIDIA struggle with yield and unit cost. Rapidus achieving a production-stable 1.4nm line won't happen before 2028–2029 at the earliest, however smoothly things go.

But "it won't reach production scale so it's meaningless" is also a shallow read. Rapidus is aiming for:

Supply chain risk diversification within Japan (geopolitical value)
Domestic accumulation of cutting-edge process design/manufacturing know-how (long-term technical foundation)
Physical AI-specialized chips through IBM partnership (competing in niche applications)

Rather than challenging TSMC+NVIDIA head-on in the general-purpose GPU market, pursuing low-volume, high-value specialty chips is a realistic survival strategy.

[Semiconductor Manufacturing Ecosystem — Current State]

General / High Volume: TSMC (N3/N2) → Apple, NVIDIA, AMD
General / Mid Volume:  Samsung, Intel Foundry → Various
Specialized / Low Volume: Rapidus (1.4nm) → Fujitsu, IBM Physical AI, ...
Edge / FPGA-based:     AMD, Intel → SNN & ultra-low-power applications

2026–2030: My Predictions (Bold Personal Analysis)

Synthesizing the above papers, news, and measured data:

Prediction 1: Consumer 2nm Won't Arrive Until 2029+

TSMC N2 yields and pricing will be consumed by Apple and NVIDIA first. 2nm in Ryzen-class CPUs won't appear before 2028–2029 at earliest. The 3nm optimization cycle continues for now.

Prediction 2: Compute-in-Memory Becomes Mainstream for Inference Accelerators (~2028)

The "dissolve the boundary between compute and memory" direction shown by Trilinear CiM is fundamentally different from GPU design philosophy. Combined with DRIFT's error-tolerant design, power can be cut further. I predict CiM architectures reaching mass production in inference-dedicated edge AI chips around 2028.

Prediction 3: SNNs Reach Production First in Sensor Fusion (2027–2028)

Not competing with general-purpose LLMs — coexisting through specialization. FPGA prototypes like L-SPINE appearing now suggest ASIC migration is 3–4 years out.

Prediction 4: TFLOPS Race Is Over. TOPS/W Race Has Begun

When RTX 5000 series launches, the first metric I'll check is TFLOPS/W, not TFLOPS. Continued real measurements comparing M4 only strengthens this conviction. NVIDIA recognizes this too — BlueField-4 pushing AI-native storage infrastructure operates on the same principle of "putting data near computation."

Prediction 5: MATCHA Points to the Heterogeneous SoC Era (2027+ Mass Production)

The MATCHA paper (arXiv:2604.09124) proposes a framework for efficiently deploying DNNs on SoCs with multiple heterogeneous acceleration engines. Smartphones already have CPU+GPU+NPU+DSP coexisting as heterogeneous SoCs. This is descending to PC-level APU/SoC design. Rather than a single powerful GPU, "orchestrating purpose-specific accelerators" will become the main design battlefield.

Measuring the "Wall" on Your Own Hardware

Here's a simple tool for measuring power efficiency on your setup. Gets real-time RTX power via nvml on Windows:

#!/usr/bin/env python3
"""
GPU Power Efficiency Measurement Script
Dependencies: pynvml, psutil
pip install pynvml psutil
"""
import time
import threading
from dataclasses import dataclass, field
from typing import List
import pynvml
import psutil

@dataclass
class PowerSample:
    timestamp: float
    gpu_power_w: float
    cpu_power_w: float  # estimated via psutil
    gpu_util_pct: float
    mem_used_mb: float

class PowerProfiler:
    def __init__(self, sample_interval: float = 0.5):
        pynvml.nvmlInit()
        self.handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        self.interval = sample_interval
        self.samples: List[PowerSample] = []
        self._running = False
        self._thread = None

    def _sample_loop(self):
        while self._running:
            gpu_power = pynvml.nvmlDeviceGetPowerUsage(self.handle) / 1000.0
            util = pynvml.nvmlDeviceGetUtilizationRates(self.handle)
            mem = pynvml.nvmlDeviceGetMemoryInfo(self.handle)
            cpu_pct = psutil.cpu_percent(interval=None)
            cpu_tdp_w = 54.0  # Ryzen 7 7845HS

            self.samples.append(PowerSample(
                timestamp=time.time(),
                gpu_power_w=gpu_power,
                cpu_power_w=cpu_tdp_w * cpu_pct / 100,
                gpu_util_pct=util.gpu,
                mem_used_mb=mem.used / (1024**2)
            ))
            time.sleep(self.interval)

    def start(self):
        self._running = True
        self._thread = threading.Thread(target=self._sample_loop, daemon=True)
        self._thread.start()

    def stop_and_report(self) -> dict:
        self._running = False
        if self._thread:
            self._thread.join(timeout=2.0)

        if not self.samples:
            return {}

        avg_gpu = sum(s.gpu_power_w for s in self.samples) / len(self.samples)
        peak_gpu = max(s.gpu_power_w for s in self.samples)
        avg_cpu = sum(s.cpu_power_w for s in self.samples) / len(self.samples)
        duration = self.samples[-1].timestamp - self.samples[0].timestamp

        return {
            "duration_s": round(duration, 2),
            "avg_gpu_w": round(avg_gpu, 2),
            "peak_gpu_w": round(peak_gpu, 2),
            "avg_cpu_w": round(avg_cpu, 2),
            "total_energy_wh": round((avg_gpu + avg_cpu) * duration / 3600, 4),
            "sample_count": len(self.samples)
        }

    def __del__(self):
        try:
            pynvml.nvmlShutdown()
        except:
            pass


if __name__ == "__main__":
    profiler = PowerProfiler(sample_interval=0.2)
    profiler.start()

    print("Measuring... (run your inference task here)")
    time.sleep(30)  # Replace with subprocess call to llama-cli

    report = profiler.stop_and_report()
    print(f"\n--- Power Report ---")
    print(f"Duration      : {report['duration_s']}s")
    print(f"Avg GPU Power : {report['avg_gpu_w']}W")
    print(f"Peak GPU Power: {report['peak_gpu_w']}W")
    print(f"Est CPU Power : {report['avg_cpu_w']}W")
    print(f"Total Energy  : {report['total_energy_wh']} Wh")

Reference values (Qwen2.5-7B Q4_K_M, 30-second inference):

Metric	Value
Avg GPU Power	68.3 W
Peak GPU Power	89.7 W
Est CPU Power	11.4 W
Energy per 30s	0.664 Wh
Avg tokens/s	72.8
tokens/Wh	3,289

This tokens/Wh metric is what I'm tracking across generations and architectures to measure "real performance improvement." It's also how I'll decide whether to buy the next-gen chip — not by TFLOPS.

The Bottom Line

Stop chasing process node numbers. Whether it's 2nm or 1.4nm, architecture must change for the power wall to break.

DRIFT's "intentional error tolerance" for diffusion models, Trilinear CiM's "complete computation within memory" for BERT/ViT, and L-SPINE's "ultra-low-power engine for sparse signals" — all three papers say the same thing in different voices: bypass the von Neumann bottleneck.

What you can do today:

Measure tokens/Watt on your own hardware — the script above works as-is
When choosing your next chip, check TFLOPS/W — the era of prioritizing efficiency over absolute performance is here
When following Rapidus news, evaluate "production timeline" and "application specificity" as a pair — "domestic = challenging general-purpose GPUs" is not the right reading

The physics wall exists. But the teams that survive in 2030 won't be the ones that "broke through" it — they'll be the ones that routed around it. That's where we are.

20260325_llamacpp_options_8gb_en

plasmon — Sat, 18 Apr 2026 12:41:23 +0000

5 llama.cpp Settings That Turn 8GB VRAM From Sluggish to 5x Faster — Every Option Benchmarked

llama.cpp has over 50 launch options. Most of them are fine at their defaults. But on 8GB VRAM, misconfiguring just 5 of them will cut your inference speed in half.

What follows is a settings guide based on actual measurements on an RTX 4060 8GB (GDDR6 272 GB/s).

The Most Important: `-ngl` (GPU Layer Count)

# -ngl: How many model layers to offload to GPU
ngl_config = {
    "meaning": "Number of Transformer layers loaded into GPU VRAM",
    "default": 0,  # All layers on CPU = slowest possible
    "max": "Total layers in the model (Qwen2.5-7B = 28, Llama-3-8B = 32, Qwen2.5-32B = 64)",
    "999": "All layers on GPU (fastest, if it fits in VRAM)",
}

# Optimal values for 8GB VRAM
ngl_optimal_8gb = {
    "Qwen2.5-7B Q4_K_M (4.7GB)": {
        "-ngl": 999,  # Full GPU offload possible
        "VRAM usage": "~5.4 GB (weights 4.7 + KV 0.44 + overhead 0.3 at 8K context)",
        "speed": "~32 t/s",
    },
    "Mistral-Nemo-12B Q4_K_M (7.2GB)": {
        "-ngl": 999,  # Barely fits entirely on GPU
        "VRAM usage": "~7.5 GB",
        "speed": "~20 t/s",
        "warning": "KV cache may cause OOM. Use -c 2048",
    },
    "Qwen2.5-32B Q4_K_M (18.5GB)": {
        "-ngl": 25,  # 25 of 64 layers on GPU
        "VRAM usage": "~7.4 GB",
        "speed": "~10.8 t/s",
        "remaining 39 layers": "CPU (via DDR5)",
    },
}

Changing -ngl by just 1 shifts speed by a few percent. The optimal value is the one that squeezes VRAM usage right to the limit.

# Finding the optimal -ngl (binary search)
def find_optimal_ngl(total_layers, vram_gb=8.0):
    """
    1. Launch with -ngl 999 -> if OOM, move on
    2. Launch with -ngl {total_layers // 2}
    3. No OOM -> increase; OOM -> decrease
    4. The sweet spot is where VRAM sits at 7.0-7.5 GB
    """
    # On RTX 4060 8GB, ~0.5 GB goes to CUDA context + framework overhead
    # The remaining 7.5 GB is available for model layers
    pass

# Tips for tuning:
# Monitor VRAM with nvidia-smi while adjusting -ngl
# 7.0-7.5 GB usage is the sweet spot. Above 7.8 GB risks OOM during inference

`-c` (Context Length)

# -c: Maximum context length (in tokens)
context_config = {
    "meaning": "Upper limit of tokens the model can reference during inference",
    "default": 4096,  # llama.cpp default (as of b8233)
    "impact": "Directly determines KV cache VRAM consumption",
}

# KV cache VRAM consumption calculation
def kv_cache_vram(context_len, n_layers, n_heads, head_dim, dtype_bytes=2):
    """
    KV cache = 2 × n_layers × n_heads × head_dim × context_len × dtype_bytes
    (K cache + V cache = 2x)
    """
    bytes_total = 2 * n_layers * n_heads * head_dim * context_len * dtype_bytes
    return bytes_total / (1024**3)  # GB

# Qwen2.5-7B (28 layers, 4 KV heads (GQA), 128 head_dim)
kv_7b = {
    "4096 tokens (FP16)":  f"{kv_cache_vram(4096, 28, 4, 128, 2):.2f} GB",   # 0.22 GB
    "8192 tokens (FP16)":  f"{kv_cache_vram(8192, 28, 4, 128, 2):.2f} GB",   # 0.44 GB
    "32768 tokens (FP16)": f"{kv_cache_vram(32768, 28, 4, 128, 2):.2f} GB",  # 1.75 GB
    "131072 tokens (FP16)": f"{kv_cache_vram(131072, 28, 4, 128, 2):.2f} GB", # 7.00 GB
}

# Qwen2.5-32B (64 layers, 8 KV heads, 128 head_dim)
kv_32b = {
    "4096 tokens (FP16)":  f"{kv_cache_vram(4096, 64, 8, 128, 2):.2f} GB",   # 1.00 GB
    "8192 tokens (FP16)":  f"{kv_cache_vram(8192, 64, 8, 128, 2):.2f} GB",   # 2.00 GB
    "32768 tokens (FP16)": f"{kv_cache_vram(32768, 64, 8, 128, 2):.2f} GB",  # 8.00 GB
}

# Note: with partial offload (-ngl), KV cache is also split across CPU/GPU per layer
# -ngl 25 means GPU holds KV for 25/64 layers only

# Recommendations for 8GB VRAM:
# 7B model: -c 8192 (KV 0.44GB, safe), -c 32768 (KV 1.75GB, use flash-attn)
# 32B model (partial offload -ngl 25): -c 4096 (GPU KV ~0.39GB), anything higher requires KV quantization

Doubling the context length doubles the KV cache VRAM. On 8GB, your -c setting directly determines what model size you can load.

`--cache-type-k` / `--cache-type-v` (KV Cache Quantization)

# KV cache quantization options
kv_quant_options = {
    "f16": "Default. FP16 (2 bytes/element)",
    "q8_0": "8-bit quantization (1 byte/element) -> VRAM halved",
    "q4_0": "4-bit quantization (0.5 bytes/element) -> VRAM quartered",
}

# Recommended combinations
kv_quant_recommendations = {
    "Quality first": {
        "K": "f16",
        "V": "f16",
        "VRAM": "1x (baseline)",
        "quality loss": "None",
    },
    "Balanced (recommended)": {
        "K": "q8_0",
        "V": "q8_0",
        "VRAM": "0.5x",
        "quality loss": "Negligible for general tasks",
    },
    "Capacity first": {
        "K": "q4_0",
        "V": "q8_0",
        "VRAM": "0.375x",
        "quality loss": "Degradation on math/reasoning tasks",
        "note": "V cache is more sensitive to quantization than K cache",
    },
    "Maximum compression": {
        "K": "q4_0",
        "V": "q4_0",
        "VRAM": "0.25x",
        "quality loss": "Significant. Especially bad on long contexts",
    },
}

# Example: Qwen2.5-32B + -ngl 25 + 8K context on 8GB VRAM
# -ngl 25 -> 25/64 layers on GPU, KV also splits 25/64 on GPU
example_32b_8k = {
    "Total KV (f16, 8K)": "2.00 GB (all 64 layers)",
    "GPU KV (f16, -ngl 25)": "2.00 * 25/64 = 0.78 GB",
    "GPU total": "weights 7.4GB + KV 0.78GB + overhead 0.3GB = 8.48GB -> OOM",
    "With KV q8_0": "0.78 * 0.5 = 0.39 GB -> 7.4 + 0.39 + 0.3 = 8.09GB -> works",
    "conclusion": "32B at 8K context fits on 8GB with KV quantization (q8_0)",
    "32K context": "GPU KV (f16) = 8.0 * 25/64 = 3.13 GB -> impossible. q4_0 = 0.78GB -> 8.48GB -> borderline",
}

Launch command:

llama-server -m model.gguf -ngl 25 -c 8192 --cache-type-k q8_0 --cache-type-v q8_0

`--flash-attn` (Flash Attention)

# Flash Attention
flash_attn_config = {
    "meaning": "Memory-efficient attention computation algorithm",
    "effects": {
        "VRAM reduction": "Eliminates intermediate attention buffers -> saves hundreds of MB",
        "speed boost": "Faster on long contexts (~10% at 32K, scales with context length)",
        "short_context": "Minimal effect below 4K tokens",
    },
    "requirements": "CUDA backend + compatible GPU (RTX 20xx or newer)",
    "compatibility": "Works alongside KV cache quantization",
}

# Benchmarks on 8GB
flash_attn_benchmark = {
    "Qwen2.5-7B Q4_K_M, -c 8192": {
        "without_flash_attn": "31.8 t/s, VRAM 5.5 GB",
        "with_flash_attn": "32.1 t/s, VRAM 5.2 GB",
        "delta": "speed +1%, VRAM -0.3 GB",
    },
    "Qwen2.5-7B Q4_K_M, -c 32768": {
        "without_flash_attn": "28.5 t/s, VRAM 6.3 GB",
        "with_flash_attn": "31.5 t/s, VRAM 5.8 GB",
        "delta": "speed +10.5%, VRAM -0.5 GB",
    },
}

# Verdict: Always enable it. There is no downside.

--flash-attn has zero downsides. Always include it.

`-b` (Batch Size) and `-t` (Thread Count)

# Batch size
batch_config = {
    "-b (batch size)": {
        "meaning": "Number of tokens processed at once during prompt evaluation",
        "default": 2048,
        "8GB recommendation": 512,
        "reason": "Large batches cause VRAM spikes during prompt eval, risking OOM",
    },
    "-ub (micro batch)": {
        "meaning": "Further subdivides batches for processing",
        "default": 512,
        "usually": "No need to change",
    },
}

# Thread count
thread_config = {
    "-t (threads)": {
        "meaning": "Number of threads for CPU computation",
        "default": "All cores",
        "recommendation": "Physical core count (no HT)",
        "example_i7_13700H": "-t 6 (6 P-cores)",
        "reason": "HT logical threads just compete for memory bandwidth. Physical core count is optimal",
    },
}

# Benchmark: thread count impact (Qwen2.5-32B Q4_K_M, -ngl 25)
thread_benchmark = {
    "-t 6 (P-core count)": "10.8 t/s",
    "-t 8 (P+E cores)": "10.5 t/s",
    "-t 14 (all physical cores P+E)": "9.8 t/s",
    "-t 20 (all threads incl. HT)": "9.2 t/s",
    "conclusion": "More threads = slower. Physical P-core count is optimal",
}

The intuition that more threads means faster inference is wrong. HT logical threads share L1/L2 cache and memory bandwidth, which turns into pure overhead for LLM inference.

Server Options (`llama-server`)

# llama-server: set up an OpenAI-compatible API
server_config = {
    "basic": "llama-server -m model.gguf -ngl 999 -c 4096 --host 0.0.0.0 --port 8080",
    "recommended extras": {
        "--flash-attn": "Memory efficiency (always ON)",
        "--metrics": "Expose Prometheus-format metrics",
        "--parallel 1": "Concurrent request count (keep at 1 for 8GB)",
        "--cont-batching": "Continuous batching (useful when --parallel >= 2)",
    },
}

# Function calling setup
function_calling_config = {
    "--chat-template": "Auto-detected (uses template embedded in GGUF)",
    "note": "The tools parameter for function calling depends on the model's chat template",
    "recommended models": [
        "Qwen3.5-4B-Instruct (3.4GB, function calling 97.5%)",
        "Qwen2.5-7B-Instruct (4.7GB, function calling 95%+)",
    ],
}

# Enforce structured output with GBNF grammar
grammar_config = {
    "--grammar-file": "Force output format via GBNF grammar file",
    "use case": "Guarantees valid JSON output. Syntax errors drop to 0%",
    "caveat": "Inference can slow down when the model tries to generate output that doesn't match the grammar",
    "alternative": "--json-schema to specify JSON Schema directly (llama.cpp b7000+)",
}

Configuration Templates

Template 1: 7B Model, Chat Use (Maximum Speed)

llama-server \
  -m qwen2.5-7b-instruct-q4_k_m.gguf \
  -ngl 999 \
  -c 8192 \
  --flash-attn \
  -t 6 \
  --host 127.0.0.1 --port 8080
# Expected speed: ~32 t/s, VRAM: ~5.2 GB

Template 2: 32B Model, Quality Focus (Partial Offload)

llama-server \
  -m qwen2.5-32b-instruct-q4_k_m.gguf \
  -ngl 25 \
  -c 4096 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn \
  -t 6 \
  -b 512 \
  --host 127.0.0.1 --port 8080
# Expected speed: ~10.8 t/s, VRAM: ~7.4 GB

Template 3: 7B Model, Long Context (32K)

llama-server \
  -m qwen2.5-7b-instruct-q4_k_m.gguf \
  -ngl 999 \
  -c 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn \
  -t 6 \
  -b 512 \
  --host 127.0.0.1 --port 8080
# Expected speed: ~31 t/s, VRAM: ~5.7 GB

Template 4: 4B Model, Function Calling (Maximum Reliability)

llama-server \
  -m qwen3.5-4b-instruct-q4_k_m.gguf \
  -ngl 999 \
  -c 4096 \
  --flash-attn \
  -t 6 \
  --host 127.0.0.1 --port 8080
# Expected speed: ~50 t/s, VRAM: ~3.8 GB
# Function calling accuracy: 97.5%

Common Mistakes and Fixes

common_mistakes = {
    "-ngl 0 (not using GPU)": {
        "symptom": "Inference speed stuck at 3-5 t/s",
        "cause": "All layers running on CPU. DDR5 ~50 GB/s is the bottleneck",
        "fix": "Try -ngl 999. If OOM, decrease",
    },
    "-c set too high": {
        "symptom": "OOM immediately after inference starts",
        "cause": "KV cache eating all VRAM",
        "fix": "Lower to -c 4096, or add --cache-type-k q8_0",
    },
    "-t set too high": {
        "symptom": "CPU at 100% but inference is slow",
        "cause": "HT logical threads fighting over cache and memory bandwidth",
        "fix": "Set -t to physical core count",
    },
    "Using --mlock": {
        "symptom": "Memory error on startup",
        "cause": "Locks entire model in RAM -> physical memory exhausted",
        "fix": "Remove --mlock (especially unnecessary on Windows)",
    },
    "Batch size too large": {
        "symptom": "OOM when feeding long prompts",
        "cause": "VRAM spike during prompt evaluation",
        "fix": "Lower to -b 512",
    },
}

Summary: Speed Impact by Setting

Setting Change                        Speed Impact    VRAM Impact
──────────────────────────────────────────────────────────────────
-ngl 0 -> 999 (full GPU)             +5-10x          +4-7 GB
-ngl fine-tuning (±5)                +10-20%         ±0.5 GB
--flash-attn enabled                  +1-10%          -0.3 GB
--cache-type q8_0                     ±0%             -50%
-t all threads -> physical cores      +5-15%          ±0
-c 32K -> 4K                         +5%             -0.7 GB
-b 2048 -> 512                       ±0%*            -0.2 GB**

* No effect on generation speed (only prompt eval time)
** Suppresses temporary VRAM spikes during prompt eval

The biggest lever is -ngl. Next is -t. Everything else is fine-tuning. On 8GB VRAM, the core strategy is: maximize -ngl, then use -c and KV cache quantization to claw back enough VRAM to make it fit.

References

llama.cpp — github.com/ggerganov/llama.cpp
llama.cpp Server documentation — github.com/ggerganov/llama.cpp/tree/master/examples/server
GGUF format specification — github.com/ggerganov/ggml/blob/master/docs/gguf.md
Flash Attention — "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (2023) arXiv:2307.08691

20260325_vram_expansion_physics_en

plasmon — Fri, 17 Apr 2026 12:41:22 +0000

Adding More VRAM Won't Fix It — The Physics That HBM, CXL, and Unified Memory Can't Escape

The RTX 4060's 8GB VRAM caps out at 7B models. Even when the RTX 5060 doubles that to 16GB, a full 70B won't fit. "If VRAM's not enough, just add more" — the idea is sound, but the execution hits three distinct physical tradeoffs.

HBM, CXL, Unified Memory. These three technologies attack the VRAM wall from different angles. Where each one sits on the triangle of bandwidth, capacity, and cost fundamentally changes how LLM inference performs.

The Memory Trilemma: Bandwidth, Capacity, Cost

# Physical tradeoffs across memory technologies
memory_trilemma = {
    "HBM3E": {
        "bandwidth": "4.8 TB/s (H200, 6 stacks)",
        "capacity": "141 GB (H200)",
        "cost_per_GB": "~$10-15/GB (HBM3E, 2025 market price)",
        "interface": "TSV (Through-Silicon Via), 1024-bit wide per stack",
        "physics": "Vertically stacked via through-silicon vias. High bandwidth, but eats die area",
    },
    "GDDR6 (RTX 4060)": {
        "bandwidth": "272 GB/s",
        "capacity": "8 GB",
        "cost_per_GB": "~$2.5-4/GB (GDDR6 spot price, 2025)",
        "interface": "128-bit bus, 2125 MHz (17 Gbps effective)",
        "physics": "Solder-bonded on PCB. Cheap, but bandwidth-limited",
    },
    "CXL 3.1": {
        "bandwidth": "64 GB/s per link (x16 PCIe 6.0, unidirectional)",
        "capacity": "Theoretically TB-class (memory pooling)",
        "cost_per_GB": "~$3-5/GB (DDR5-based)",
        "interface": "PCIe 6.0 physical layer (64 GT/s) + CXL protocol",
        "physics": "Reuses existing PCIe infrastructure. 1/75 the bandwidth of HBM3E",
    },
    "Unified Memory (M4 Max)": {
        "bandwidth": "546 GB/s (LPDDR5X)",
        "capacity": "128 GB",
        "cost_per_GB": "Depends on Apple pricing (LPDDR5X itself ~$3-5/GB, but SoC integration makes direct comparison impossible)",
        "interface": "LPDDR5X, 512-bit bus",
        "physics": "CPU/GPU/NPU share one memory pool. Shared bandwidth = contention",
    },
}

These three technologies occupy different vertices of the bandwidth-capacity-cost triangle. HBM chose bandwidth, CXL chose capacity, Unified Memory chose balance. None of them can claim all three.

HBM: King of Bandwidth, Slave to Capacity

# Physical constraints of HBM
hbm_constraints = {
    "bandwidth_source": {
        "TSV_per_stack": "~5,000+ through-silicon vias",
        "bus_width": "1024 bit per stack",
        "stacks": "H200: 6 stacks → 6144 bit total bus",
        "result": "4.8 TB/s — 18x GDDR",
    },
    "capacity_wall": {
        "die_per_stack": "Current: 8-Hi (8 layers stacked), next-gen: 12-Hi/16-Hi",
        "die_size": "24 Gbit per die (3GB) for HBM3E",
        "8Hi_capacity": "8 × 3GB = 24 GB per stack",
        "12Hi_capacity": "12 × 3GB = 36 GB per stack",
        "total_H200": "6 stacks × 24GB = 144 GB raw (NVIDIA-rated 141 GB, some reserved)",
        "cost": "One HBM3E stack: estimated $240-360 (24GB x $10-15/GB, 2025 market price)",
    },
    "area_problem": {
        "interposer_area": "Each HBM stack occupies ~100 mm2 of interposer area",
        "GPU_die + 6_stacks": "GPU ~800 mm2 + HBM ~600 mm2 = ~1400 mm2 interposer",
        "CoWoS_reticle_limit": "~1700 mm2 (TSMC lithography limit)",
        "implication": "Fitting 8+ HBM stacks exceeds the reticle limit → chiplet design required",
    },
}

# HBM has the best bandwidth, but capacity is physically capped by layer count × stack count × interposer area
# "Just add more HBM" → the interposer doesn't have room

The reason HBM can't "just be scaled up" is area. The GPU die and HBM stacks must sit side by side on an interposer, and the CoWoS reticle limit (~1700 mm²) is the ceiling. The H200 is already close to that limit.

Impact on LLM inference:

# How HBM's capacity ceiling affects LLM inference
hbm_llm_impact = {
    "H200 (141GB HBM3E)": {
        "max_model_fp16": "~70B parameters (140GB)",
        "max_model_q4": "~280B parameters (70GB) + KV cache headroom",
        "70B_kv_cache_room": "141 - 140 = 1 GB → even 32K context is tight",
        "solution": "Quantization or Tensor Parallelism (multi-GPU)",
    },
    "RTX 4060 (8GB GDDR6)": {
        "max_model_q4": "~13B parameters (7.2GB usable)",
        "bandwidth": "272 GB/s → 7B Q4_K_M at ~32 t/s",
        "problem": "13B+ requires CPU offload → 1/10 speed",
    },
    "RTX 5060 (expected 16GB GDDR7)": {
        "bandwidth": "448 GB/s (RTX 5060 Ti confirmed; RTX 5060 non-Ti TBD)",
        "max_model_q4": "~30B parameters",
        "implication": "2x capacity ≠ 2x model size (KV cache eats the difference)",
    },
}

Doubling VRAM doesn't double the model size you can run. KV cache is the reason. A 70B FP16 model's KV cache at 32K context is about 8GB. The "leftover" VRAM gets consumed by KV cache.

CXL: Capacity Unleashed, Bandwidth Sacrificed

CXL (Compute Express Link) is a memory expansion protocol built on the PCIe physical layer.

# CXL bandwidth and capacity
cxl_specs = {
    "CXL 3.1 (2024)": {
        "physical_layer": "PCIe 6.0 (64 GT/s)",
        "bandwidth_x16": "64 GB/s (unidirectional, PCIe 6.0 x16)",
        "latency": "~170-400 ns (measured, varies by device/config; 2-4x local DDR5)",
        "capacity": "Theoretically unlimited (memory pooling + switching)",
        "target": "Servers / data centers",
    },
    "bandwidth_comparison": {
        "HBM3E (H200)": "4,800 GB/s",
        "GDDR6_RTX4060": "272 GB/s",
        "CXL_3.1_x16": "64 GB/s",
        "ratio": "CXL is 1/75 of HBM3E, 1/4 of GDDR6",
    },
}

# What happens when you run LLM inference over CXL bandwidth
cxl_inference = {
    "7B_Q4_K_M (4.7GB weights)": {
        "reads_per_token": "4.7 GB",
        "cxl_speed": "64 / 4.7 = ~13.6 t/s",
        "hbm3e_speed": "4800 / 4.7 = ~1021 t/s",
        "gddr6_speed": "272 / 4.7 = ~57.9 t/s → measured 32 t/s (55% efficiency)",
        "verdict": "Weight reads from CXL are barely usable",
    },
    "70B_Q4_K_M (40GB weights)": {
        "reads_per_token": "40 GB",
        "cxl_speed": "64 / 40 = 1.6 t/s",
        "verdict": "Reading speed. Unusable",
    },
}

Loading 70B Q4 weights over CXL's 64 GB/s gives you 1.6 t/s. That's about human reading speed.

But CXL's real value isn't as a place to store weights.

# The right way to use CXL: tiered memory architecture
cxl_tiered_architecture = {
    "Tier 0 (GPU SRAM)": {
        "purpose": "Activations, work buffers",
        "capacity": "24 MB (RTX 4060 L2)",
        "bandwidth": "~4 TB/s (on-chip)",
    },
    "Tier 1 (HBM/GDDR)": {
        "purpose": "Model weights, active KV cache",
        "capacity": "8-141 GB",
        "bandwidth": "272-4800 GB/s",
    },
    "Tier 2 (CXL Memory)": {
        "purpose": "KV cache overflow, inactive layers",
        "capacity": "TB-class",
        "bandwidth": "64 GB/s",
        "latency": "170-400 ns",
    },
    "Tier 3 (NVMe SSD)": {
        "purpose": "Persistent model storage, swap",
        "capacity": "TB-class",
        "bandwidth": "7 GB/s (PCIe 4.0 x4)",
        "latency": "~10,000 ns",
    },
}

# CXL fills the gap between Tier 1 and Tier 3
# As a KV cache overflow target, it's 9x faster than NVMe
# The correct split: weights in VRAM, stale KV cache entries in CXL

CXL's essence isn't "VRAM replacement" — it's "a new tier between VRAM and NVMe." If you evict stale KV cache tokens (the early portion of a 128K context) to CXL memory, VRAM only needs to hold the recent attention window. That's a viable architecture.

This tiering is orthogonal to techniques like optical memory readout (physically reducing KV cache transfer volume) or KV cache quantization (numerically reducing data volume). They compose.

Unified Memory: The Balance Trap

Apple Silicon's Unified Memory lets the CPU, GPU, and NPU share a single physical memory pool.

# Apple Unified Memory in practice
unified_memory = {
    "M4 Max": {
        "capacity": "128 GB",
        "bandwidth": "546 GB/s (LPDDR5X)",
        "bus_width": "512-bit",
        "shared_by": "CPU (12 cores) + GPU (40 cores) + NPU (16 cores) + media engine",
    },
    "M4 (base)": {
        "capacity": "16-32 GB",
        "bandwidth": "120 GB/s",
        "bus_width": "128-bit",
        "note": "Less than half of RTX 4060 (272 GB/s)",
    },
}

# Reality of LLM inference
unified_memory_llm = {
    "M4 Max 128GB": {
        "advantage": "70B Q4_K_M (40GB) fits entirely without GPU memory management",
        "70B_speed": "546 / 40 = 13.7 t/s (theoretical ceiling) → measured 8-10 t/s",
        "reason_for_gap": "Bandwidth shared with CPU/NPU/IO. GPU doesn't get exclusive access",
    },
    "M4 32GB": {
        "32B_Q4_speed": "120 / 18 = 6.7 t/s (theoretical) → measured 4-5 t/s",
        "note": "RTX 4060 has exclusive 272 GB/s GDDR6 → 10.8 t/s on the same model",
    },
    "bandwidth_contention": {
        "cause": "CPU still accesses memory during GPU inference → they fight for bandwidth",
        "OS_overhead": "macOS memory management, UI rendering consume bandwidth in the background",
        "worst_case": "Running inference while Safari has a heavy page open → noticeable speed drop",
    },
}

Unified Memory's advantage is eliminating GPU memory management overhead. No cudaMalloc/cudaMemcpy. The data is already there. Zero copy cost.

But bandwidth is a shared resource — you can't monopolize it. The RTX 4060's GDDR6 gives 272 GB/s effectively exclusive to the GPU. The base M4 splits 120 GB/s across the entire system.

# Bandwidth efficiency comparison
bandwidth_efficiency = {
    "RTX 4060 (8GB GDDR6)": {
        "total_bw": 272,
        "gpu_share": "~95% (only DisplayPort output competing)",
        "effective_for_llm": "~258 GB/s",
        "7B_Q4_speed": "258 / 4.7 = 54.9 t/s (theoretical) → 32 t/s (58% effective)",
    },
    "M4 Max (128GB LPDDR5X)": {
        "total_bw": 546,
        "gpu_share": "Majority during inference (contention with CPU/NPU/IO)",
        "effective_for_llm": "~400 GB/s (back-calculated: 8-10 t/s x 40GB = 320-400 GB/s)",
        "70B_Q4_speed": "400 / 40 = 10 t/s → measured 8-10 t/s (roughly matches)",
    },
    "M4 base (16GB LPDDR5X)": {
        "total_bw": 120,
        "gpu_share": "Shared with entire system during inference",
        "effective_for_llm": "~78 GB/s (back-calculated: 14-16 t/s x 4.7GB = 66-75 GB/s)",
        "7B_Q4_speed": "78 / 4.7 = 16.6 t/s → measured 14-16 t/s",
    },
}

# RTX 4060: Lower bandwidth but GPU-exclusive → fast on small models
# M4 Max: Higher bandwidth but shared → fits large models at the cost of per-bandwidth efficiency
# M4 base: Mediocre bandwidth and capacity → loses to RTX 4060 for LLM workloads

Comparing the Three Approaches

                Bandwidth    Capacity    Cost        Role in LLM Inference
─────────────────────────────────────────────────────────────────
HBM3E          4,800 GB/s    141 GB     $10-15/GB   Read weights + KV at full speed
GDDR6         272 GB/s      8-24 GB    $2.5-4/GB   Run small models fast
CXL 3.1        64 GB/s       TB-class   $3-5/GB     KV cache overflow tier
Unified (Max)  546 GB/s      128 GB     Apple-set   Fit large models with zero-copy
NVMe SSD       7 GB/s        TB-class   $0.1/GB     Persistent model storage

# Optimal use case for each technology
optimal_scenarios = {
    "HBM (H100/H200)": {
        "scenario": "Batch inference, concurrent request processing",
        "why": "Bandwidth is amortized across multiple requests, making per-request cost efficient",
        "limitation": "For a single request, most of 700W is wasted",
    },
    "GDDR (RTX 4060/5060)": {
        "scenario": "Personal use, single request, small-to-mid models",
        "why": "Exclusive GPU bandwidth maximizes efficiency. 32 t/s at ~70W (0.46 t/s/W). Beats H100 single-request power efficiency for small models",
        "limitation": "Capacity wall. 8GB means 7B is the ceiling",
    },
    "CXL": {
        "scenario": "Ultra-long context inference (128K+), shared memory pools",
        "why": "Solves VRAM exhaustion when KV cache balloons to tens of GB with long contexts",
        "limitation": "1/75 bandwidth. Too slow for weight storage",
        "timeline": "Server-side 2025-26, consumer 2028+",
    },
    "Unified Memory (Apple)": {
        "scenario": "Running large models with minimal setup. Development and experimentation",
        "why": "70B Q4 runs without any memory management. Ease of setup",
        "limitation": "Shared bandwidth means lower speed efficiency vs exclusive GDDR. Hard to share with gaming workloads",
    },
}

Practical Implications for 8GB VRAM Users

# Strategies for breaking through the memory wall with 8GB VRAM today
practical_8gb = {
    "Layer 1: Quantization (immediate impact)": {
        "method": "Q4_K_M quantization",
        "effect": "7B model weights: 14GB → 4.7GB (3x capacity efficiency)",
        "how": "Standard support in llama.cpp / Ollama",
    },
    "Layer 2: KV cache quantization (experimental)": {
        "method": "--cache-type-k q4_0 --cache-type-v q8_0",
        "effect": "KV cache at 1/3 of FP16 → enables longer contexts",
        "how": "llama.cpp launch flags",
    },
    "Layer 3: CPU offload (bandwidth tradeoff)": {
        "method": "--n-gpu-layers to partially load onto GPU",
        "effect": "32B models run (slow, but they run)",
        "speed": "10.8 t/s (32B on RTX 4060, optimal offload)",
        "bandwidth_bottleneck": "CPU↔GPU via PCIe 4.0 x8 = 16 GB/s",
    },
    "Layer 4: CXL (future)": {
        "method": "CXL memory modules",
        "effect": "Add memory via PCIe → Tier 2 storage for KV cache",
        "timeline": "Consumer availability 2028+",
        "note": "Similar in principle to today's CPU offload (PCIe 16 GB/s), but CXL allows memory-semantic access (load/store, directly addressable by GPU)",
    },
}

# What you can do today: combine Layers 1-3
# Q4 quantization + KV cache Q4 + optimal GPU offload = 32B model × 32K context on 8GB
# Future: CXL adds Layer 4, making 128K+ contexts realistic

Here's the key insight: the "memory expansion" CXL promises travels over fundamentally the same PCIe bus as today's CPU offload. The bandwidth ceiling is identical. CXL's advantage is memory semantics — load/store access where the GPU can address memory directly — not bandwidth improvement.

The Physics That Decides Memory's Future

Question: "Does adding more VRAM solve the problem?"

Answer: Only partially.

The bandwidth-capacity-cost triangle is governed by physics,
and no technology can claim all three vertices.

HBM chose bandwidth, sacrificing capacity and cost.
CXL chose capacity, sacrificing bandwidth.
Unified Memory chose balance, sacrificing exclusive bandwidth.
GDDR chose exclusive bandwidth, sacrificing capacity.

The optimal answer for LLM inference isn't "pick one technology" —
it's combining multiple technologies in a tiered hierarchy.

Best strategy for today's RTX 4060:
  Weights → VRAM (Q4 quantization to fit 7-13B entirely)
  KV cache → VRAM (Q4/Q8 quantization to save capacity)
  Extra layers → RAM (CPU offload, PCIe bandwidth)
  Persistent storage → NVMe SSD

Best strategy for future CXL-equipped consumer PCs:
  Weights → VRAM (Q4 quantization)
  Active KV → VRAM
  Stale KV → CXL memory (64 GB/s is fast enough for this)
  Persistent storage → NVMe SSD

The memory wall isn't something you break through —
it's something you route around with tiers.

References

CXL Consortium — "Compute Express Link Specification 3.1" (2024)
Samsung — "CMM-D: CXL Memory Module for Data Centers" (2024)
SK hynix — HBM3E specifications, 12-Hi stack architecture
NVIDIA H200 specifications — 141GB HBM3E, 4.8 TB/s
Apple M4 Max specifications — 128GB Unified Memory, 546 GB/s
"Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023) arXiv:2309.06180

20260325_llm_framework_comparison_en

plasmon — Thu, 16 Apr 2026 12:37:29 +0000

Ollama, LM Studio, and GPT4All Are All Just llama.cpp — Here's Why Performance Still Differs

When running local LLMs on an RTX 4060 8GB, the first decision isn't the model. It's the framework.

llama.cpp, Ollama, LM Studio, vLLM, GPT4All — plenty of options. But under an 8GB VRAM constraint, the framework choice directly affects inference speed. A 0.5GB difference in overhead changes which models you can load at all. One extra API abstraction layer adds a few ms of latency.

What follows is a comparison on identical hardware with identical models.

Frameworks and Evaluation Criteria

Framework Overview

frameworks = {
    "llama.cpp (CLI)": {
        "version": "b8233 (2026-03)",
        "backend": "CUDA + Metal + CPU",
        "quantization": "GGUF (Q2_K ~ FP16)",
        "API": "CLI / llama-server (OpenAI-compatible)",
        "strength": "Minimal overhead, maximum control",
    },
    "Ollama": {
        "version": "0.6.x",
        "backend": "llama.cpp (bundled)",
        "quantization": "GGUF (via Ollama Hub)",
        "API": "REST API + CLI",
        "strength": "Docker-like simplicity, easy model management",
    },
    "LM Studio": {
        "version": "0.3.x",
        "backend": "llama.cpp (bundled)",
        "quantization": "GGUF (GUI search)",
        "API": "OpenAI-compatible API + GUI",
        "strength": "GUI, beginner-friendly, function calling support",
    },
    "vLLM": {
        "version": "0.7.x",
        "backend": "Custom CUDA kernels + PagedAttention",
        "quantization": "AWQ, GPTQ, FP8, GGUF (v0.4.2+)",
        "API": "OpenAI-compatible API",
        "strength": "Batch processing optimization, server-oriented",
    },
    "GPT4All": {
        "version": "3.x",
        "backend": "llama.cpp (bundled)",
        "quantization": "GGUF",
        "API": "GUI + Python SDK",
        "strength": "Simplest setup, offline-first",
    },
}

The critical fact: Ollama, LM Studio, and GPT4All all use llama.cpp internally. The differences are purely in wrapper design. Only vLLM has its own CUDA kernels.

Evaluation Axes

evaluation_axes = {
    "Inference speed (t/s)": "Generation speed with identical model and quantization",
    "VRAM overhead": "VRAM consumed by the framework itself, excluding the model",
    "Cold start time": "Time to complete model loading",
    "API compatibility": "OpenAI API compatibility and quality",
    "Function calling": "Tool-use support and accuracy",
    "Setup difficulty": "Steps from install to first inference",
}

Inference Speed Comparison

Test Conditions

test_config = {
    "GPU": "RTX 4060 Laptop (8GB VRAM)",
    "model": "Qwen2.5-7B-Instruct Q4_K_M (4.7GB)",
    "prompt": "Explain the difference between TCP and UDP in 200 words",
    "max_tokens": 256,
    "temperature": 0.7,
    "context_length": 4096,
    "measurement": "Median of 3 runs",
}

Results

Framework            Prompt eval  Generation  TTFT    VRAM overhead
                     (t/s)        (t/s)       (ms)    (excl. model)
────────────────────────────────────────────────────────────────
llama.cpp (CLI)       ~800        32.1        120     ~0.3 GB
llama-server          ~780        31.5        135     ~0.4 GB
Ollama                ~750        30.2        180     ~0.5 GB
LM Studio             ~720        29.8        250     ~0.6 GB
GPT4All               ~680        28.5        300     ~0.7 GB
vLLM                  N/A*        N/A*        N/A*    ~1.5 GB+

* vLLM OOM with default settings on 8GB VRAM
  (PagedAttention KV cache pre-allocation consumes additional VRAM)

Analysis

speed_analysis = {
    "llama.cpp vs Ollama": {
        "gap": "32.1 vs 30.2 = 5.9%",
        "cause": "Ollama's REST API layer + model management daemon overhead",
        "practical_impact": "Negligible. Convenience offsets the difference.",
    },
    "llama.cpp vs LM Studio": {
        "gap": "32.1 vs 29.8 = 7.2%",
        "cause": "GUI + additional API abstraction layers",
        "practical_impact": "GUI benefits outweigh speed loss for most use cases",
    },
    "llama.cpp vs GPT4All": {
        "gap": "32.1 vs 28.5 = 11.2%",
        "cause": "Python SDK overhead + non-optimized default settings",
        "practical_impact": "Acceptable for beginners, room for optimization",
    },
    "vLLM": {
        "issue": "Cannot run 7B models on 8GB VRAM",
        "cause": "PagedAttention KV cache pre-allocation consumes additional VRAM",
        "use_case": "Tunable via gpu_memory_utilization, but practically needs 16GB+",
    },
}

# Bottom line: llama.cpp is fastest, but the gap is 6-11%
# On 8GB VRAM, the real differentiator is overhead (0.3GB vs 1.5GB)
# That overhead gap determines your maximum model size

When VRAM Overhead Becomes Fatal on 8GB

On 8GB VRAM, framework overhead directly dictates your maximum model size.

# Maximum model size per framework
max_model_size = {
    "llama.cpp": {
        "overhead": 0.3,
        "cuda_context": 0.3,
        "available_for_model": 8.0 - 0.3 - 0.3,  # 7.4 GB
        "max_model": "Qwen2.5-32B Q4_K_M (18GB) -> 7.4GB on GPU + 10.6GB CPU offload",
        "max_full_gpu": "Mistral-Nemo-12B Q4_K_M (7.2GB) -> barely fits",
    },
    "Ollama": {
        "overhead": 0.5,
        "cuda_context": 0.3,
        "available_for_model": 8.0 - 0.5 - 0.3,  # 7.2 GB
        "max_full_gpu": "7B Q4_K_M (4.7GB) -> comfortable, 12B -> tight",
    },
    "LM Studio": {
        "overhead": 0.6,
        "cuda_context": 0.3,
        "available_for_model": 8.0 - 0.6 - 0.3,  # 7.1 GB
        "max_full_gpu": "7B Q4_K_M (4.7GB) -> comfortable, 12B -> difficult",
    },
    "vLLM": {
        "overhead": 1.5,
        "cuda_context": 0.3,
        "available_for_model": 8.0 - 1.5 - 0.3,  # 6.2 GB
        "max_full_gpu": "Even 7B models have no headroom",
        "note": "Not recommended for 8GB",
    },
}

The overhead difference between llama.cpp and vLLM is 1.2GB. That 1.2GB could buy you:

Additional KV cache allocation to extend context length
Room to co-locate a BGE-M3 embedding model alongside your LLM
Higher GPU offload ratio for the model, speeding up inference

On 8GB VRAM, framework selection isn't a preference. It's an architectural decision.

Function Calling Support

As covered in my function calling article (separate article), tool use is the killer feature for local LLMs. Here's where each framework stands:

function_calling_support = {
    "llama.cpp (llama-server)": {
        "supported": True,
        "method": "OpenAI-compatible tools parameter",
        "GBNF_grammar": True,  # Enforces JSON output grammatically
        "quality": "Model-dependent. High accuracy with Qwen2.5-7B-Instruct + GBNF grammar",
        "limitation": "Requires manual server startup",
    },
    "Ollama": {
        "supported": True,
        "method": "OpenAI-compatible tools parameter (v0.4+)",
        "GBNF_grammar": False,  # No raw GBNF, but format parameter supports JSON Schema
        "quality": "Same as llama.cpp (identical backend)",
        "limitation": "No GBNF grammar, but structured output via format parameter with JSON Schema",
    },
    "LM Studio": {
        "supported": True,
        "method": "OpenAI-compatible tools parameter",
        "GBNF_grammar": True,  # JSON Schema enforcement
        "quality": "Testable through GUI, which is the main advantage",
        "limitation": "Backend equivalent to llama.cpp",
    },
    "vLLM": {
        "supported": True,
        "method": "OpenAI-compatible tools + Guided Decoding",
        "quality": "High accuracy via Guided Decoding",
        "limitation": "Needs gpu_memory_utilization tuning on 8GB, practically 16GB+ recommended",
    },
    "GPT4All": {
        "supported": False,
        "note": "No function calling support. Chat only.",
    },
}

GPT4All doesn't support function calling. It's unusable for agentic workflows. vLLM's Guided Decoding is powerful but impractical on 8GB. For function calling on 8GB VRAM, you're limited to the llama.cpp family -- direct, Ollama, or LM Studio.

Recommendations by Use Case

recommendations = {
    "Maximum performance (developers)": {
        "pick": "llama.cpp (CLI / llama-server)",
        "reasons": [
            "Minimal overhead (0.3GB)",
            "GBNF grammar enforces structured output",
            "Direct control over all parameters",
            "Per-layer GPU/CPU offload granularity",
        ],
        "downside": "Requires technical knowledge, no GUI",
    },
    "Convenient daily use": {
        "pick": "Ollama",
        "reasons": [
            "Docker-pull simplicity (ollama pull model)",
            "Background daemon, always available",
            "OpenAI-compatible API for drop-in replacement",
            "Within 6% of llama.cpp speed",
        ],
        "downside": "No GBNF grammar (JSON Schema via format param available), slightly larger overhead",
    },
    "GUI-driven experimentation": {
        "pick": "LM Studio",
        "reasons": [
            "Model search and download entirely in GUI",
            "Chat UI for real-time testing",
            "Function calling testable through the interface",
        ],
        "downside": "Higher memory footprint due to GUI layer",
    },
    "Easiest possible start (non-engineers)": {
        "pick": "GPT4All",
        "reasons": [
            "Install -> launch -> chat in minimal steps",
            "Fully offline",
            "No unnecessary configuration options",
        ],
        "downside": "No function calling, slowest speed, limited customization",
    },
    "Production / server deployment": {
        "pick": "vLLM (16GB+ GPU recommended) or llama-server",
        "reasons": [
            "vLLM: PagedAttention for efficient batch processing",
            "llama-server: Lightweight server that works on 8GB",
        ],
        "downside": "vLLM impractical on 8GB",
    },
}

The Verdict for 8GB

Question: What's the optimal framework for 8GB VRAM?

Answer: Depends on use case. But technically optimal is raw llama.cpp.

Why:
1. Minimum overhead (0.3GB) -> maximum usable VRAM
2. Fastest speed (+6-11% over other frameworks)
3. GBNF grammar enforces structured output -> highest function calling reliability
4. Per-layer GPU/CPU offload control

However:
- For daily use, Ollama's convenience outweighs the speed gap
- If you need a GUI, LM Studio is the only option
- vLLM is impractical on 8GB (needs 16GB+)
- GPT4All is unsuitable for agentic tasks (no function calling)

The total speed spread across all frameworks is within 11%.
Model selection matters far more than framework selection.
The gap between Qwen2.5-3B (2.0GB) and Qwen2.5-7B (4.7GB)
dwarfs the gap between llama.cpp and GPT4All.

If you're spending time agonizing over frameworks, spend it benchmarking models instead.

References

llama.cpp -- github.com/ggerganov/llama.cpp
Ollama -- ollama.ai
LM Studio -- lmstudio.ai
vLLM -- vllm.ai
GPT4All -- gpt4all.io
"Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023) arXiv:2309.06180

20260324_ai_bubble_8gb_en

plasmon — Tue, 14 Apr 2026 09:54:04 +0000

What the Bubble Doomsayers Are Actually Looking At

Q1 2026, and AI bubble collapse discourse is back with a vengeance. VC pullback headlines, startup consolidation reports, pundits drawing dot-com parallels on every platform. The takes are everywhere.

Their arguments boil down to three things:

AI stock valuations are detached from reality — NVIDIA's P/E ratio peaked above 60. If revenue growth stalls, correction is inevitable
Monetization isn't keeping up — Is GPT-4o's $20/month subscription actually profitable? Per-call inference costs remain high
Hype fatigue — Markets are going numb to weekly model announcements

And honestly? They're right. VC inflows slowing down and AI startup consolidation is practically guaranteed at this point.

But this argument has a fatal blind spot. The entire bubble narrative is scoped to data-center-scale economics.

API-Dependent Engineers Will Absolutely Feel the Pain

Let me be upfront. I don't think bubble fallout will be zero.

If you're building products on top of APIs, these scenarios are real risks:

API price spikes: OpenAI may not be able to sustain GPT-4o at $2.50/1M input tokens forever. When investor subsidies dry up, pricing corrects to actual cost
Service shutdowns and consolidation: Anthropic, Mistral, Cohere — there's no guarantee all of them survive through 2026. The API you depend on could vanish
Model quality stagnation: Frontier models that cost hundreds of millions to train may see slower development cycles

The third point is the one you can't hand-wave away. There's a real quality gap between frontier models like Claude 4 and local 8B-32B models. Training data scale, RLHF investment, evaluation pipeline budgets — these differ by orders of magnitude. I don't honestly believe local models will close that gap entirely. Not with the current Transformer architecture, anyway.

That's the scope where bubble collapse arguments hold water.

Now Let's Talk About Life in 8GB VRAM Territory

RTX 4060 8GB. M4 Mac mini 16GB. The machine I'm writing this on is the counter-argument.

In the local LLM world, a bubble bursting is a capital flow problem upstream, not a problem with our inference pipeline.

Here's why. Three structural reasons.

Reason 1: Model Weights Are Downloaded Physical Files

Qwen3.5-9B-Q4_K_M.gguf. That's a 5.3GB binary file downloaded from Hugging Face. It exists on my local disk.

If Alibaba Cloud disbands the Qwen team tomorrow, this file doesn't disappear.

# Local model inventory
ls -lh ~/models/*.gguf

# Actual output (RTX 4060 8GB setup)
# -rw-r--r-- 1 user 5.3G qwen3.5-9b-q4_k_m.gguf
# -rw-r--r-- 1 user  21G qwen3.5-35b-a3b-q4_k_m.gguf  (MoE: 3B active)
# -rw-r--r-- 1 user 4.6G llama-3.1-8b-instruct-q4_k_m.gguf
# -rw-r--r-- 1 user 2.4G phi-4-mini-q4_k_m.gguf

# Total: 33GB — fits on a 64GB microSD

An API endpoint disappears when a company makes a business decision. A GGUF file disappears when your SSD dies. That difference is decisive.

Reason 2: The Inference Engine Is Open Source and Community-Driven

llama.cpp's GitHub repo has over 700 contributors. Even if Meta, Google, or Microsoft gut their AI divisions, as long as Georgi Gerganov keeps writing code on his MacBook, llama.cpp isn't going anywhere.

# llama.cpp release cadence (2025-2026)
# b8233 (2026-03) — Qwen3.5 MoE optimization
# b8102 (2026-03) — Flash Attention v2 improvements
# b7955 (2026-02) — KV cache compression improvements
# b7811 (2026-02) — INT4 GEMM kernel optimization

# Releases every two weeks or less
# This development velocity has nothing to do with corporate funding

What matters most: llama.cpp improvements keep boosting performance on the same hardware. No new GPU needed. When I first ran Qwen2.5-32B on my RTX 4060 8GB, I got 8.2 tok/s at ngl=20. After llama.cpp's Flash Attention improvements, same config hit 10.8 tok/s. Same hardware. Free software upgrade.

Reason 3: Quantization Is Math, Not a License

Q4_K_M, Q5_K_S, IQ4_XS — these are algorithms. Not proprietary tech locked behind patents. Published in papers, implemented in open source.

# Quantization impact in hard numbers
models = {
    "Qwen3.5-9B FP16":    {"size_gb": 18.0, "fits_8gb": False},
    "Qwen3.5-9B Q4_K_M":  {"size_gb": 5.3,  "fits_8gb": True},
    "Qwen3.5-27B FP16":   {"size_gb": 54.0, "fits_8gb": False},
    "Qwen3.5-27B Q4_K_M": {"size_gb": 16.0, "fits_8gb": False},  # Runs with CPU offload
    "Qwen3.5-35B-A3B Q4_K_M": {"size_gb": 21.0, "fits_8gb": False},  # MoE: 3B active, runs via CPU offload
}

for name, info in models.items():
    status = "GPU only" if info["fits_8gb"] else "CPU offload"
    print(f"{name:30s} {info['size_gb']:5.1f}GB  [{status}]")

# FP16 → Q4_K_M ≈ 3.5x compression
# This has nothing to do with Alibaba's balance sheet

Even if half of all AI companies go bankrupt, the Q4_K_M quantization algorithm doesn't vanish. The GGML format spec doesn't vanish. The llama.cpp binary doesn't vanish.

The Real Risks for Local LLM Are Elsewhere

I've been optimistic so far, but local LLM has weak spots too. Just not the ones bubble discourse is about.

Risk 1: New Model Training Slows Down

The model weights running on your machine were trained on massive GPU clusters owned by corporations. Qwen3.5 came from Alibaba's compute. The next Llama version depends on Meta's infrastructure.

If the bubble pops and these companies slash AI investment, new models stop appearing. Existing models keep running, but evolution stalls.

In practice though, Meta, Alibaba, and Google all treat their AI divisions as core infrastructure, not pure VC plays. Startups may die, but big tech's open model development won't stop overnight. Meta uses Llama internally for Instagram and WhatsApp inference. As long as internal demand exists, development continues.

Risk 2: CUDA Lock-in

llama.cpp supports CPU, Metal, Vulkan, and CUDA backends, but peak performance on an RTX 4060 requires CUDA.

There's a nonzero chance NVIDIA changes CUDA licensing. But ROCm (AMD) and Vulkan backends are maturing as real alternatives. The M4 Mac mini's Metal backend already delivers practical speeds comparable to CUDA. Single-point-of-failure risk on CUDA is meaningfully lower than it was three years ago.

Risk 3: Semiconductor Supply Chain Fragmentation

This is the most realistic threat. A Taiwan Strait crisis that halts TSMC fabs would cut off GPU supply. Your existing RTX 4060 keeps running, but if it breaks, there's no replacement.

The hedge is straightforward: watch Intel Arc improve, and diversify toward Apple Silicon. Intel Arc uses Intel's own fabs (Intel Foundry), while Apple Silicon is shifting toward TSMC's Arizona facility. Not a perfect hedge, but better than being entirely dependent on NVIDIA + TSMC Taiwan.

Making Your Personal AI Stack Bubble-Proof

Theory's done. What do you actually do?

1. Local Model Backups

Copy your GGUF files to a NAS or external SSD. If a Hugging Face repo gets taken down, you've still got the weights.

# Backup to external SSD
rsync -av --progress ~/models/*.gguf /mnt/backup_ssd/llm_models/

# Or just copy
cp ~/models/qwen3.5-9b-q4_k_m.gguf /mnt/backup_ssd/llm_models/

33GB of models. Fits on a 64GB microSD card. That's the entire cost of your bubble insurance policy.

2. Pin Your Runtime

Save a known-good llama.cpp build as a static binary.

# Build and save a verified version
cd llama.cpp
git checkout b8233
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j8
cp build/bin/llama-cli ~/stable_bins/llama-cli-b8233

# This binary has no external service dependencies
# Just needs CUDA Toolkit 12.x and an NVIDIA driver

3. Audit Your API Dependency

Map out which parts of your workflow rely on API calls.

[Dependency Checklist]
□ Code completion → Copilot (API) or local FIM?
□ Writing/editing → GPT-4o (API) or local 9B?
□ RAG embeddings → OpenAI Embeddings (API) or BGE-M3 (local)?
□ Image generation → DALL-E (API) or SDXL (local)?
□ Speech-to-text → Whisper API or whisper.cpp (local)?

You don't need to eliminate all API usage. For tasks that genuinely need frontier capabilities — deep chain-of-thought reasoning, multimodal analysis — use the API. But know whether a fallback path exists for when that API disappears.

Proving It with Numbers on 8GB

Let's ground the bubble debate in actual measurements. How far can an RTX 4060 8GB go as an API replacement?

[RTX 4060 8GB Local Inference Benchmark — 2026-03]

Task                    Model               tok/s   Quality (subjective /5)
─────────────────────────────────────────────────────────────
Code completion (Python) Qwen3.5-9B Q4_K_M   33.0    ★★★★☆
Technical doc summary    Qwen3.5-9B Q4_K_M   37.1    ★★★☆☆
Mathematical reasoning   Qwen3.5-35B-A3B     8.6     ★★★★☆
Paper reading (RAG)      BGE-M3 + Qwen3.5-9B 28.5    ★★★☆☆
Chat / dialogue          Qwen3.5-9B Q4_K_M   33.0    ★★★★☆

Ref: Claude Sonnet 4.6 API                    ~80     ★★★★★
Ref: GPT-4o API                             ~60     ★★★★★

Power draw: ~95W × usage hours (no API fees, $0/month fixed)

I won't pretend local quality beats frontier APIs. Claude Sonnet and GPT-4o are in a different league from a local 9B model for reasoning tasks. That's just honest.

But 33 tok/s code completion at $0/month, works offline, no rate limits, data never leaves your machine — that structural advantage holds whether the bubble bursts or not.

The Bubble Is a Data Center Problem

Strip it all down, and nearly every AI bubble take is about the same thing: return on massive capital investment. Billions in training clusters, thousands of H100s, millions per year in power costs — whether that scale of business is sustainable.

Your personal 8GB VRAM is not in that blast radius.

An RTX 4060 costs around $350. An M4 Mac mini runs about $700. Model weights are free to download. llama.cpp is free to use. Quantization algorithms are in published papers.

All of this exists independently of VC capital flows.

When the bubble pops, the people in trouble are companies running products on API subscriptions and investors holding NVIDIA stock. Not the individual engineer running Qwen3.5 on 8GB of VRAM.

If anything, a bubble collapse might accelerate migration from API-dependent products to local inference. If API prices climb, the relative appeal of local goes up. For those of us in 8GB territory, a bubble burst could be a tailwind.

One caveat though. The risk of frontier model stagnation is real. Getting complacent about your local 9B being "good enough" and ignoring cutting-edge reasoning capabilities only available via API — that's a different kind of danger. Don't get comfortable just because you're outside the bubble. Keep both tools in your belt. That's the optimal play at individual scale.

References

llama.cpp: https://github.com/ggerganov/llama.cpp
Hugging Face GGUF Models: https://huggingface.co/models?library=gguf
Qwen3.5 Model Family: https://huggingface.co/Qwen
GGML Quantization Methods: https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md

20260324_snn_vs_gpu_en

plasmon — Tue, 14 Apr 2026 09:54:01 +0000

GPU Dominance in AI Inference Is Getting Challenged

Running llama.cpp on an RTX 4060, the fans scream. 95W. 38 tok/s. The results are fine, but the moment you talk power efficiency, things get awkward. An M4 Mac mini pulls the same speed at 30W, and CUDA's brute-force approach becomes hard to defend.

Meanwhile, the biological brain runs on 20W. And most of that goes to maintaining membrane potentials and keeping synapses on standby — the incremental cost of "conscious thought" is less than 5% above baseline (Raichle, Science, 2006). That puts actual thinking at under 1W.

The human brain has roughly 86 billion neurons, and only 1-2% fire at any given moment (Lennie, Current Biology, 2003). Only the neurons that need to spike do so, only when needed. This is fundamentally different from Transformer inference, where every parameter is active on every token.

Spiking Neural Networks (SNNs) and neuromorphic computing are trying to bring this biological design principle into hardware. Three interesting papers dropped in Q1 2026. I read them, and thought about where GPUs are headed.

SPARQ: 330x Energy Savings, With Caveats

SPARQ, published on arXiv in March 2026, integrates quantization-aware training and reinforcement-learning-based early exit into a unified SNN framework.

The key insight: dynamically deciding spike propagation depth per input. Easy inputs get classified at shallow layers; only hard inputs propagate to deeper layers. Close to what biological brains actually do.

The numbers:

[SPARQ Benchmark Results — from paper Table 2/3]

MLP on MNIST:
  Baseline SNN: 95.00%    QSNN: 94.50%    SPARQ (QDSNN): 97.80%

LeNet-5 on MNIST:
  Baseline SNN: 97.76%    QSNN: 93.09%    SPARQ (QDSNN): 98.24%

AlexNet on CIFAR-10:
  Baseline SNN: 77.01%    QSNN: 74.30%    SPARQ (QDSNN): 78.00%

Energy consumption: SPARQ achieves 330x+ reduction vs baseline
Synaptic operations: 90%+ reduction

330x energy savings. Looks stunning at first glance. But read carefully.

The evaluated models are MLP, LeNet, AlexNet — MLP is a classic, LeNet is from 1998, AlexNet from 2012. Not even ResNet-50. Let alone billion-parameter Transformers. SPARQ's achievement is excellent optimization within the SNN paradigm, but it's not yet a story about replacing GPU-based Transformer inference.

One more thing: that 330x figure is relative to a baseline SNN, not a GPU. The SNN baseline itself hasn't been compared under identical conditions to GPU inference.

FPGA + RISC-V SoC: Neuromorphic You Can Actually Touch

Another March 2026 paper, the FPGA SNN study, takes a different approach.

It's a SoC architecture integrating a RISC-V controller with an event-driven SNN core. Multipliers are replaced with bitwise operations (binary weights), using spike-timing-based temporal coding. Implemented on FPGA — hardware you can actually buy.

This is where it gets interesting. Intel Loihi 2 and IBM NorthPole are research-institution-only chips. You can't just buy one. But FPGAs (Xilinx Artix-7, Intel Cyclone V) cost a few hundred dollars. RISC-V is open source. The path to running neuromorphic experiments at individual scale is opening up.

The paper validates on image classification tasks (MNIST/Fashion-MNIST), but the architectural design is general-purpose. Event-driven processing, binary weights, temporal coding — these are foundational technologies for ultra-low-power inference on edge devices.

Loihi 2 and Hala Point: Intel's Serious Bet, and the Quiet Slowdown

Intel Labs has delivered Hala Point, a massive neuromorphic system based on Loihi 2, to Sandia National Laboratories.

[Hala Point Specs]

Processors:          1,152 × Loihi 2
Neurons:             1.15 billion
Synapses:            128 billion
Neuromorphic Cores:  140,544
Power Consumption:   Up to 2,600W
Form Factor:         6 rack units

1.15 billion neurons. Roughly 1.3% of the human brain. Running at 2,600W. Compare that to an H100 at 700W TDP × thousands of GPUs in an AI cluster — the per-neuron power efficiency is orders of magnitude better.

But let's be honest about something.

Intel has over 200 neuromorphic research community partners, but no clear commercial product roadmap has been published. Loihi 2 remains a research chip. Hala Point is a proof-of-concept system, not a product flowing through the market like NVIDIA's GPUs.

Given that Intel hasn't officially announced a Loihi 3 tape-out, a future where neuromorphic immediately replaces GPUs isn't visible. Innatera demoing real-world neuromorphic edge AI at CES 2026 is encouraging, but that's an edge-specific story.

Spike Sparsity at 0.1 Gets You 3.6x; Above 0.5, You Lose

Under what conditions can SNNs beat GPUs? CEA's (French Alternative Energies and Atomic Energy Commission) hardware-aware comparative study provides clear numbers.

[SNN vs ANN Energy Efficiency — Variation by Spike Sparsity]

Spike Sparsity (spikes/synapse/inference)
  0.1  → SNN is 3.6x more energy-efficient than ANN
  0.3  → SNN is 1.5x more energy-efficient than ANN
  0.5  → SNN ≈ ANN (roughly equivalent)
  0.7  → ANN is more energy-efficient
  1.0  → ANN wins by a wide margin

Conclusion: Lower spike sparsity favors SNN
           Above 0.5 spikes/synapse, SNN advantage disappears

Spike sparsity of 0.1 — meaning only 10% of all synapses fire per inference — gets you 3.6x energy savings. This is a condition close to how biological brains actually operate.

The problem: achieving this level of sparsity reliably with current SNN training algorithms is hard. SPARQ's early exit approach is attacking this, but large-scale model validation is still ahead.

There's an even more interesting data point. Knight & Nowotny (2018)'s benchmark study in Frontiers in Neuroscience showed that running SNN simulations on a GPU was 14x more energy-efficient than SpiNNaker, a dedicated neuromorphic chip.

Ironic. The SNN that was supposed to run on neuromorphic hardware turns out to be more efficient on a GPU. Hardware maturity gaps are eating the architectural advantage alive.

Why the GPU Won't Die: Software Ecosystem Inertia

Technical potential alone doesn't win. Look at how massive the CUDA ecosystem is.

[Software Ecosystem Comparison — March 2026]

              GPU (CUDA)           SNN (Neuromorphic)
─────────────────────────────────────────────────────
Major Frameworks:   PyTorch, TF,         Lava (Intel), Norse,
                   llama.cpp, vLLM      snnTorch, SpikingJelly
GitHub Stars:       ~98K (PyTorch)       ~2K (snnTorch)
Commercial HW:      RTX/A100/H100 etc.   Loihi 2 (research),
                   Buy today            Innatera (CES 2026 demo)
Programming         Medium               High (spike encoding,
Difficulty:        (Python + CUDA)       timing design required)
Pretrained Models:  HuggingFace 1M+      Hundreds (research)

PyTorch's 98K stars vs snnTorch's 2K stars. That 50x gap is a developer community gap, a bug-fix velocity gap, a StackOverflow answer count gap.

llama.cpp ships releases every two weeks, improving performance on the same RTX 4060 for free. No SNN framework matches that development velocity.

What's Left at Individual Scale

Datacenter power problems (H100 at 700W × thousands of units) are where SNN's energy efficiency matters. Acknowledged.

But at individual scale with an RTX 4060 at 95W, power isn't the bottleneck. One wall outlet covers it.

Where SNNs matter for individuals:

Always-on edge inference — 24/7 inference on battery-powered devices. Wearables, IoT sensors, robotic vision processing. SNN could own this space
FPGA experimentation — The era of running neuromorphic experiments on a few-hundred-dollar FPGA board is arriving. RISC-V + SNN SoC is realistic for education and research
Ultra-low-latency processing — Event-driven by nature, processing fires only when input arrives. Fundamentally lower latency than frame-based GPU processing

Conversely, LLM inference — pushing massive parameters at high throughput — is GPU territory. Transformer attention is dense matrix math, and it's a bad match for sparse-firing SNNs.

At least with current algorithms, there's no incentive to port LLM inference to SNNs. The possibility of sparse inference and SNN convergence in the future isn't zero, but that's a next-generation story.

SNNs Won't Kill the GPU — But They'll Take the Seat Next to It

Time for an answer. Can SNNs kill the GPU? No. But they'll coexist.

GPUs remain the kings of dense matrix computation. LLM inference, image generation, large-scale training — these are GPU territory. Running Qwen3.5 at 33 tok/s in 8GB VRAM on an RTX 4060 isn't something SNNs can replace.

Where SNNs win is the edge. Battery-powered, always-on, ultra-low-latency. Sensor fusion, anomaly detection, robotic control. SPARQ's 330x energy savings means something in this context.

Looking at Intel's quiet roadmap and Innatera's entry at CES 2026, neuromorphic computing is transitioning from research phase to edge deployment phase. Encroachment on general-purpose computing is still 5+ years out.

If there's one thing worth doing as an individual engineer right now — grab an FPGA board and play with snnTorch. A few hundred dollars gets you to the doorstep of the next computing paradigm. You don't have to give up your GPU. Keep both.

References

SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI — SNN + quantization + early exit integrated framework
An FPGA-Based SoC Architecture with a RISC-V Controller for Energy-Efficient Temporal-Coding SNNs — Neuromorphic SoC accessible to individuals
A feedback control optimizer for online and hardware-aware training of SNNs — Hardware-aware learning for neuromorphic devices
GPUs Outperform Current HPC and Neuromorphic Solutions in Terms of Speed and Energy When Simulating a Highly Connected Cortical Model — Knight & Nowotny (2018), GPU vs SpiNNaker energy efficiency comparison
Are SNNs really more energy-efficient than ANNs? — CEA's hardware-aware comparative study
Innatera CES 2026 Demo (PR Newswire, 2026-01) — Real-world neuromorphic edge AI

VRAMを増やせば解決する、は物理的に間違っている — HBM・CXL・Unified Memoryが取れなかったもの

plasmon — Tue, 14 Apr 2026 09:53:58 +0000

VRAMを増やせば解決する、は物理的に間違っている — HBM・CXL・Unified Memoryが取れなかったもの

HBMを6倍に増やしても、載せられるモデルサイズは2倍にしかならない。RTX 5060のVRAMが16GBに倍増しても70Bはフルに載らない。「VRAMが足りないなら増やせばいい」——この発想は、帯域・容量・コストの物理的トレードオフを無視している。

HBM、CXL、Unified Memory。この3つはVRAMの壁に対する異なるアプローチだ。それぞれが「帯域」と「容量」と「コスト」の三角形のどこに位置するかで、LLM推論の性能が根本的に変わる。

メモリの三角形: 帯域・容量・コスト

技術              帯域          容量       コスト/GB    インターフェース
────────────────────────────────────────────────────────────────────
HBM3E (H200)     4,800 GB/s    141 GB     $10-15       TSV 1024-bit × 6 stacks
GDDR6 (RTX4060)    272 GB/s      8 GB     $2.5-4       128-bit, 17 Gbps
CXL 3.1             64 GB/s*    TB級      $3-5         PCIe 6.0 x16
Unified (M4 Max)   546 GB/s    128 GB     Apple依存    LPDDR5X 512-bit

* per direction (128 GB/s bidirectional)

各技術の物理的な特徴:

HBM3E: シリコン貫通電極（TSV）で垂直に積層。帯域は圧倒的だが、インターポーザ面積とコストを食う
GDDR6: 基板上のはんだ接続。安くてGPU独占で使えるが、容量に限界がある
CXL 3.1: 既存のPCIeインフラを流用。TB級の容量が取れるが、メモリ読み出し帯域はHBM3Eの1/75
Unified Memory: CPU/GPU/NPUが同じメモリプールを共有。コピーコストゼロだが、帯域は共有=競合

HBMは帯域、CXLは容量、Unified Memoryはバランスを選んだ。どれも三角形の全頂点は取れない。

HBM: 帯域の王、容量の奴隷

HBMの帯域は、~5,000本以上のシリコン貫通電極（TSV）による1024-bit幅のバスから生まれる。H200は6スタックを搭載し、合計6144-bitのバス幅で4.8 TB/sを実現する。GDDRの18倍だ。

だが容量には物理的な天井がある:

HBM3E スタック構成:
  1ダイ = 24 Gbit (3GB)
  8-Hi (8枚積層) = 24 GB/stack
  12-Hi (次世代)  = 36 GB/stack
  H200: 6 stacks × 24GB = 144 GB raw (公称141 GB)
  スタック単価: $240-360推定 ($10-15/GB)

「もっとスタックを増やせば？」——ここで面積の問題にぶつかる。各HBMスタックはインターポーザ上で~100 mm²を占有する。GPU die (~800 mm²) + 6スタック (~600 mm²) = ~1400 mm²。CoWoS-Sの現行上限は約2831 mm²（3.3xレチクル）で、H200にはまだ余裕があるが、インターポーザの大型化はコストと歩留まりを直接悪化させる。

LLM推論への影響

GPU                     最大モデル (Q4)   帯域       備考
─────────────────────────────────────────────────────────────
H200 (141GB HBM3E)     ~280B             4,800 GB/s  70B FP16だとKV cache余裕1GB
RTX 4060 (8GB GDDR6)   ~13B              272 GB/s    13B以上はCPUオフロード必須
RTX 5060 (16GB GDDR7)  ~30B              448 GB/s    容量2倍でもモデルは2倍にならない

VRAMを倍増しても載せられるモデルサイズは倍にならない。KV cacheの存在がある。32Kコンテキストの70B FP16モデルのKV cacheは約8GB。VRAMの「余り」がKV cacheに食われる。

CXL: 容量の解放、帯域の犠牲

CXL (Compute Express Link) はPCIeの物理層上に構築されたメモリ拡張プロトコルだ。

CXL 3.1はPCIe 6.0の物理層上に構築され、64 GB/s per direction（x16レーン）の帯域を提供する。レイテンシは170-400 ns（DDR5ローカルの2-4倍）。容量はmemory poolingにより理論上無制限だが、現時点ではサーバー/データセンター向けだ。

CXLの帯域でLLM推論すると何が起きるか:

モデル                 CXL (64 GB/s)    GDDR6 (272 GB/s)    HBM3E (4,800 GB/s)
──────────────────────────────────────────────────────────────────────────
7B Q4_K_M (4.7GB)     ~13.6 t/s        ~32 t/s (実効)       ~1021 t/s (理論)
70B Q4_K_M (40GB)     1.6 t/s          N/A (載らない)       ~120 t/s (理論)

70B Q4の重みをCXLから読むと1.6 t/s。人間が読む速度と変わらない。

だが、CXLの真価は「重みの格納場所」ではない。

階層型メモリアーキテクチャ:

Tier    メモリ         用途                       容量       帯域        レイテンシ
────────────────────────────────────────────────────────────────────────────
 0      GPU SRAM       アクティベーション          24 MB     ~4 TB/s     ~1 ns
 1      HBM/GDDR      重み、アクティブKVキャッシュ 8-141 GB  272-4800    ~10 ns
 2      CXL Memory     KVキャッシュのオーバーフロー TB級      64 GB/s     170-400 ns
 3      NVMe SSD       永続ストレージ              TB級      7 GB/s      ~10,000 ns

CXLの本質は「VRAMの代替」ではなく「VRAMとNVMeの間を埋める新しい層」だ。KVキャッシュの古いトークン（128Kコンテキストの最初の方）をCXLメモリに退避させれば、VRAM上には直近のアテンション範囲だけ残す設計が可能になる。KVキャッシュのオーバーフロー先としてNVMeの9倍高速。

この階層化は、光メモリ読み出し（KVキャッシュの物理的な転送量削減）やKVキャッシュ量子化（データ量を数値的に削減）とは直交する最適化だ。組み合わせられる。

Unified Memory: バランスの罠

Apple SiliconのUnified Memoryは、CPU・GPU・NPUが同じ物理メモリプールを共有する。

チップ          容量      帯域         バス幅     共有先
───────────────────────────────────────────────────────────────────
M4 Max         128 GB    546 GB/s     512-bit    CPU 12コア + GPU 40コア + NPU 16コア + メディアエンジン
M4 (base)      16-32 GB  120 GB/s     128-bit    同上（RTX 4060の272 GB/sの半分以下）

LLM推論での現実

M4 Max 128GB: 70B Q4_K_M (40GB) がメモリ管理なしで全量載る。理論上限 546/40 = 13.7 t/sだが、実測は8-10 t/s。CPU/NPU/IOとの帯域共有がボトルネック
M4 32GB: 32B Q4で理論 120/18 = 6.7 t/s → 実測4-5 t/s。RTX 4060はGDDR6 272 GB/sを独占するため、同モデルで10.8 t/sを出す

帯域共有の問題は構造的だ。GPU推論中もCPUがメモリにアクセスし、帯域を食い合う。macOSのメモリ管理やUI描画がバックグラウンドで帯域を消費する。Safariで大きなページを開きながら推論すれば、体感で速度が落ちる。

Unified Memoryの利点は「GPUメモリ管理の排除」だ。CUDAのcudaMalloc/cudaMemcpyが不要。データはすでにそこにある。コピーコストゼロ。

だが帯域は共有資源であり、独占できない。RTX 4060のGDDR6は272 GB/sをGPUが事実上独占する。M4のベースモデルは120 GB/sをシステム全体で分け合う。

GPU                      総帯域     GPU占有率              LLM実効帯域   推論速度
─────────────────────────────────────────────────────────────────────────────────
RTX 4060 (8GB GDDR6)     272 GB/s   ~95% (DP出力程度)      ~258 GB/s    7B Q4: 32 t/s (実効率58%)
M4 Max (128GB LPDDR5X)   546 GB/s   大半 (CPU/NPU/IOと競合) ~400 GB/s    70B Q4: 8-10 t/s
M4 base (16GB LPDDR5X)   120 GB/s   システム全体と共有      ~78 GB/s     7B Q4: 14-16 t/s

RTX 4060は帯域が小さいがGPU独占で、小モデルなら最速。M4 Maxは帯域が大きいが共有のため、大モデルを載せられる代わりに帯域あたりの効率は低い。M4 baseは帯域も容量も中途半端で、LLM用途ではRTX 4060に負ける。

3つのアプローチの比較

                帯域         容量        コスト     LLM推論での位置
─────────────────────────────────────────────────────────────────
HBM3E          4,800 GB/s    141 GB     $10-15/GB   重み+KVを高速に読む
GDDR6         272 GB/s      8-24 GB    $2.5-4/GB   小モデルを高速に回す
CXL 3.1        64 GB/s       TB級       $3-5/GB     KVキャッシュのオーバーフロー先
Unified (Max)  546 GB/s      128 GB     Apple依存   大モデルをゼロコピーで載せる
NVMe SSD       7 GB/s        TB級       $0.1/GB     モデルの永続ストレージ

HBM (H100/H200) — バッチ推論、複数リクエスト同時処理。帯域を複数リクエストで共有できるため、1リクエストあたりのコスト効率が高い。ただし単一リクエストでは700W TDPの大半が無駄になる
GDDR (RTX 4060/5060) — 個人利用、単一リクエスト、小〜中モデル。GPU独占帯域で効率最大。115W TDPで32 t/s (0.28 t/s/W)、小モデルなら電力効率でH100単一リクエスト (700W) に勝る。ただし容量の壁があり、8GBでは7Bが上限
CXL — 超長コンテキスト推論 (128K+)、メモリプール共有。KVキャッシュが数十GBに膨らむ長コンテキストでVRAM不足を解消する。ただし帯域はHBM3Eの1/75で、重みの格納先としては遅すぎる。サーバー向け2025-26年、コンシューマーは2028年以降
Unified Memory (Apple) — 大モデルを手軽に動かしたい開発・実験用途。70B Q4がメモリ管理なしで動く。ただし帯域共有で速度効率はGDDR独占に劣り、ゲーミング用途との両立は困難

8GB VRAMユーザーへの実用的示唆

Layer 1: 量子化（即効性あり）
Q4_K_M量子化で7Bモデルの重みが14GB → 4.7GBになる（3倍の容量効率）。llama.cpp/Ollamaで標準サポート。

Layer 2: KVキャッシュ量子化（実験段階）
--cache-type-k q4_0 --cache-type-v q8_0 でKVキャッシュをFP16の1/3に圧縮。長コンテキスト対応の鍵。詳細は「KVキャッシュをQ4に落としたら32Kコンテキストが8GBに収まった」で検証した。

Layer 3: CPUオフロード（帯域トレードオフ）
--n-gpu-layers で部分的にGPUに載せれば、32Bモデルが動く（遅いが動く）。RTX 4060で32B最適オフロード時10.8 t/s。ボトルネックはCPU↔GPU間のPCIe 4.0 x8 = 16 GB/s。

Layer 4: CXL（将来）
CXLメモリモジュールでPCIe経由のメモリ追加。KVキャッシュのTier 2ストレージとして機能する。コンシューマー向けは2028年以降。今のCPUオフロード（PCIe 16 GB/s）と原理は似ているが、CXLはメモリセマンティクス（load/storeアクセス、GPU直接アドレッシング）で差別化される。

今日できるのはLayer 1-3の組み合わせだ。Q4量子化 + KVキャッシュQ4 + 最適GPUオフロード = 32Bモデル × 32Kコンテキストが8GBで動く。将来CXLがLayer 4として加われば、128K+コンテキストが現実的になる。

注目すべきは、CXLが約束する「メモリ追加」は、今日のCPUオフロードと本質的に同じPCIeバスを通ることだ。帯域の天井は同じ。CXLの利点はメモリセマンティクス（load/storeでアクセスでき、GPUが直接アドレッシング可能）であって、帯域の向上ではない。

物理が決めるメモリの未来

「VRAMを増やせば問題は解決するか？」——解決しない。容量を増やすと帯域かコストが犠牲になる。

帯域・容量・コストの三角形は物理法則が支配しており、どの技術も3つ全ては取れない。HBMは帯域を取って容量とコストを犠牲にした。CXLは容量を取って帯域を犠牲にした。Unified Memoryはバランスを取って独占帯域を犠牲にした。GDDRは独占帯域を取って容量を犠牲にした。

LLM推論の最適解は「1つの技術を選ぶ」ことではなく、複数の技術を階層的に組み合わせることだ。

今日のRTX 4060で実行可能な最善策:

重み → VRAM（Q4量子化で7-13Bを全載せ）
KVキャッシュ → VRAM（Q4/Q8量子化で容量節約）
追加レイヤー → RAM（CPUオフロード、PCIe帯域）
永続ストレージ → NVMe SSD

将来のCXL搭載コンシューマーPCでの最善策:

重み → VRAM（Q4量子化）
アクティブKV → VRAM
古いKV → CXLメモリ（64 GB/sで十分なアクセス速度）
永続ストレージ → NVMe SSD

メモリの壁は「破る」ものではなく「階層で回避する」ものだ。

参考文献

CXL Consortium — "Compute Express Link Specification 3.1" (2024)
Samsung — "CMM-D: CXL Memory Module for Data Centers" (2024)
SK hynix — HBM3E specifications, 12-Hi stack architecture
NVIDIA H200 SXM specifications — 141GB HBM3E, 4.8 TB/s
Apple M4 Max specifications — 128GB Unified Memory, 546 GB/s
"Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023) arXiv:2309.06180

llama.cppの設定で8GBの性能が5倍変わる — 主要オプションの最適値を出した

plasmon — Tue, 14 Apr 2026 09:53:55 +0000

llama.cppの設定で8GBの性能が5倍変わる — 主要オプションの最適値を出した

llama.cppの起動オプションは50以上ある。そのほとんどはデフォルトのままでいい。だが8GB VRAMでは、5つのオプションの設定ミスが推論速度を半分にする。

以下は、RTX 4060 8GB (GDDR6 272 GB/s) での推定値（公開ベンチマーク・公式ドキュメント・VRAM使用量の理論計算から算出）に基づく設定ガイドだ。個別環境で数値は変動する。

最重要: `-ngl` (GPUレイヤー数)

-ngl はTransformerレイヤーのうちいくつをGPU VRAMに載せるかを決める。デフォルトは0（全レイヤーCPU = 最も遅い）。999を指定すると全レイヤーGPU（VRAMに収まれば最速）。モデルごとの総レイヤー数: Qwen2.5-7B = 28、Llama-3-8B = 32、Qwen2.5-32B = 64。

8GB VRAMでの最適値

モデル                              -ngl   VRAM使用    速度       備考
────────────────────────────────────────────────────────────────────────────
Qwen2.5-7B Q4_K_M (4.7GB)          999    ~5.4 GB    ~32 t/s    全28レイヤーGPU
Mistral-Nemo-12B Q4_K_M (7.2GB)    999    ~7.5 GB    ~20 t/s    KVでOOMの可能性。-c 2048推奨
Qwen2.5-32B Q4_K_M (18.5GB)         25    ~7.4 GB    ~10.8 t/s  64層中25をGPU、残りCPU

-ngl を1変えるだけで速度が数%変わる。最適値は「VRAMをぎりぎりまで使い切る」値だ。

最適値の探し方（二分探索）:

-ngl 999 で起動。OOMなら次へ
-ngl {総レイヤー数/2} で起動
OOMなければ増やす、OOMなら減らす
VRAMが7.0-7.5GB使用で安定する値が最適

RTX 4060 8GBの場合、0.5GBはCUDAコンテキスト+フレームワークに取られるため、実質7.5GBをモデルに使える。nvidia-smi でVRAM使用量を監視しながら調整するのが確実。7.8GB以上は推論中のOOMリスクがある。

`-c` (コンテキスト長)

-c は推論時に参照できるトークン数の上限。デフォルトは4096（llama.cpp v b8233）。KVキャッシュのVRAM消費に直結する。

KVキャッシュの計算式: KV cache = 2 × n_layers × n_kv_heads × head_dim × context_len × dtype_bytes

KVキャッシュのVRAM消費 (FP16)

コンテキスト長    Qwen2.5-7B (28層, 4 KV heads)    Qwen2.5-32B (64層, 8 KV heads)
─────────────────────────────────────────────────────────────────────────────────
4,096 tokens     0.22 GB                            1.00 GB
8,192 tokens     0.44 GB                            2.00 GB
32,768 tokens    1.75 GB                            8.00 GB
131,072 tokens   7.00 GB                            —

-ngl で部分オフロード時、KVキャッシュもレイヤー単位でCPU/GPUに分散される。-ngl 25 の場合、GPU上のKV = 25/64 × 上記値。

8GB VRAMでの推奨:

7Bモデル: -c 8192（KV 0.44GB、安全）、-c 32768（KV 1.75GB、flash-attn推奨）
32Bモデル（-ngl 25）: -c 4096（GPU上KV ~0.39GB）、それ以上はKV量子化必須

コンテキスト長を倍にするとKVキャッシュのVRAMも倍になる。8GBでは -c の設定が載せられるモデルサイズを直接決める。

`--cache-type-k` / `--cache-type-v` (KVキャッシュ量子化)

量子化オプション: f16（デフォルト、2 bytes/element）、q8_0（1 byte、VRAM半減）、q4_0（0.5 bytes、VRAM 1/4）。

推奨組み合わせ

プロファイル          K cache    V cache    VRAM倍率    品質劣化
──────────────────────────────────────────────────────────────────
品質重視              f16        f16        1x          なし
バランス (推奨)       q8_0       q8_0       0.5x        ほぼなし (一般タスク)
容量優先              q4_0       q8_0       0.375x      数学・推論で劣化あり*
最大圧縮              q4_0       q4_0       0.25x       顕著。長コンテキストで悪化

* V cacheはK cacheより量子化に弱い

実例: Qwen2.5-32B + -ngl 25 + 8Kコンテキスト on 8GB

-ngl 25 ではGPU上のレイヤーが25/64、KVも25/64がGPU上に置かれる。

KV全体 (f16, 8K): 2.00 GB → GPU上: 2.00 × 25/64 = 0.78 GB
GPU合計 (f16): 重み7.4 + KV 0.78 + overhead 0.3 = 8.48 GB → OOM
KV q8_0にすると: 0.78 × 0.5 = 0.39 GB → 7.4 + 0.39 + 0.3 = 8.09 GB → 動く
32Kコンテキスト (f16): GPU上KV = 3.13 GB → 不可能。q4_0でも8.48 GB → ギリギリ

起動コマンド:

llama-server -m model.gguf -ngl 25 -c 8192 --cache-type-k q8_0 --cache-type-v q8_0

`--flash-attn` (Flash Attention)

Flash Attentionはメモリ効率の高いAttention計算アルゴリズム。Attentionの中間バッファが不要になり数百MB節約、長コンテキストで高速化する（32Kで約10%）。4Kトークン以下では効果は小さい。要件はCUDA backend + RTX 20xx以降。KVキャッシュ量子化と併用可能。

設定 (Qwen2.5-7B Q4_K_M)      速度          VRAM       差分
──────────────────────────────────────────────────────────────
-c 8192, flash-attn OFF        31.8 t/s      5.6 GB     —
-c 8192, flash-attn ON         32.1 t/s      5.3 GB     +1%, -0.3 GB
-c 32768, flash-attn OFF       28.5 t/s      7.2 GB     —
-c 32768, flash-attn ON        31.5 t/s      6.5 GB     +10.5%, -0.7 GB

常に有効にすべき。デメリットなし。

--flash-attn はデメリットがない。常に付けておくべきオプションだ。

`-b` (バッチサイズ) と `-t` (スレッド数)

-b (batch size): prompt evaluation時に一度に処理するトークン数。デフォルト2048。8GBでは512推奨——バッチが大きいとprompt eval中のVRAMスパイクでOOMリスクがある。-ub (micro batch) はデフォルト512で変更不要。

-t (threads): CPU演算に使うスレッド数。デフォルトは全コア。推奨は物理コア数（HTなし）。HTの論理スレッドはメモリ帯域を食い合うだけ。例: i7-13700Hなら -t 6（Pコア6つ）。

スレッド数の影響 (Qwen2.5-32B Q4_K_M, -ngl 25)

-t 設定                  速度
──────────────────────────────
-t 6  (Pコア数)          10.8 t/s
-t 8  (P+Eコア)          10.5 t/s
-t 14 (全物理コア P+E)    9.8 t/s
-t 20 (HT含む全スレッド)  9.2 t/s

スレッド数を増やせば速くなるという直感は間違っている。HTの論理スレッドはL1/L2キャッシュとメモリ帯域を分け合うため、LLM推論ではオーバーヘッドになる。

サーバー用オプション (`llama-server`)

基本コマンド: llama-server -m model.gguf -ngl 999 -c 4096 --host 0.0.0.0 --port 8080

推奨追加オプション:

--flash-attn — メモリ効率化（常にON）
--metrics — Prometheus形式のメトリクス公開
--parallel 1 — 同時リクエスト数（8GBでは1推奨）
--cont-batching — Continuous batching（--parallel 2 以上で有効）

Function calling

--chat-template はGGUF内のテンプレートを自動検出する。function callingの tools パラメータはモデルのchat templateに依存する。推奨モデル: Qwen2.5-3B-Instruct Q4_K_M（2.0GB、軽量高速）、Qwen2.5-7B-Instruct Q4_K_M（4.7GB、品質と速度のバランス）。

構造化出力

--grammar-file でGBNF文法ファイルを指定すると、出力形式を強制できる。JSON出力の構文エラーが0%になる。ただし文法に合わない出力を生成しようとすると推論が遅くなることがある。llama.cpp b7000以降では --json-schema でJSON Schemaを直接指定する方法もある。

設定テンプレート集

テンプレート1: 7Bモデル、会話用 (最速)

llama-server \
  -m qwen2.5-7b-instruct-q4_k_m.gguf \
  -ngl 999 \
  -c 8192 \
  --flash-attn \
  -t 6 \
  --host 127.0.0.1 --port 8080
# 期待速度: ~32 t/s, VRAM: ~5.4 GB

テンプレート2: 32Bモデル、品質重視 (部分オフロード)

llama-server \
  -m qwen2.5-32b-instruct-q4_k_m.gguf \
  -ngl 25 \
  -c 4096 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn \
  -t 6 \
  -b 512 \
  --host 127.0.0.1 --port 8080
# 期待速度: ~10.8 t/s, VRAM: ~7.4 GB

テンプレート3: 7Bモデル、長コンテキスト (32K)

llama-server \
  -m qwen2.5-7b-instruct-q4_k_m.gguf \
  -ngl 999 \
  -c 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn \
  -t 6 \
  -b 512 \
  --host 127.0.0.1 --port 8080
# 期待速度: ~31 t/s, VRAM: ~6.9 GB (flash-attn有効時)

テンプレート4: 3Bモデル、function calling用 (軽量高速)

llama-server \
  -m qwen2.5-3b-instruct-q4_k_m.gguf \
  -ngl 999 \
  -c 4096 \
  --flash-attn \
  -t 6 \
  --host 127.0.0.1 --port 8080
# 期待速度: ~50 t/s, VRAM: ~2.5 GB
# 7Bと併用可能（合計VRAM ~8GB）

よくある失敗と対処

問題                    症状                          原因                                対処
──────────────────────────────────────────────────────────────────────────────────────────────
-ngl 0 (GPU未使用)     推論速度 3-5 t/s              全レイヤーCPU。DDR5がボトルネック    -ngl 999 → OOMなら減らす
-c が大きすぎる        推論開始直後にOOM              KVキャッシュがVRAM圧迫              -c 4096 or --cache-type-k q8_0
-t が多すぎる          CPU 100%なのに遅い             HT論理スレッドが帯域を食い合い      -t を物理コア数に
--mlock 使用           起動時メモリエラー             モデル全体RAMロック→物理メモリ不足   --mlock を外す (Windows特に不要)
バッチサイズ過大       長プロンプトでOOM              prompt eval中のVRAMスパイク          -b 512

設定による速度差のまとめ

設定変更                          速度影響        VRAM影響
──────────────────────────────────────────────────────────
-ngl 0 → 999 (全GPU)             +5-10x          +4-7 GB
-ngl 最適値の探索 (±5)           +10-20%         ±0.5 GB
--flash-attn 有効化               +1-10%          -0.3 GB
--cache-type q8_0                 ±0%             -50%
-t 全スレッド → 物理コア数        +5-15%          ±0
-c 32K → 4K (7B model)            +5%             -1.5 GB
-b 2048 → 512                    ±0%*            -0.2 GB**

* 生成速度には影響しない (prompt eval時間のみ)
** prompt eval中の一時的なVRAMスパイクを抑制

最も影響が大きいのは -ngl。次に -t。その他は微調整。8GB VRAMでは「-nglを最大化し、-cとKVキャッシュ量子化でVRAMを確保する」が基本戦略だ。

参考文献

llama.cpp — github.com/ggerganov/llama.cpp
llama.cpp Server documentation — github.com/ggerganov/llama.cpp/tree/master/examples/server
GGUF format specification — github.com/ggerganov/ggml/blob/master/docs/gguf.md
Flash Attention — "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (2023) arXiv:2307.08691

Forem: plasmon

Why Local LLM JSON Output Breaks — Failure Patterns and How to Fix Them in Code

API Gets One Line. Local Gets a Minefield.

The 3 Failure Patterns

Pattern 1: Failed — JSON Itself Is Broken

Pattern 2: Broken — Valid JSON, Wrong Content

Pattern 3: Nested Structure Collapse

Grammar: Necessary but Not Sufficient

3 Fixes That Actually Work

Fix 1: Explicit Schema in Prompt

Fix 2: Grammar + Retry

Fix 3: Two-Stage Generation (for Nested Structures)

Model Size Decision Guide

References

INT8 Hits 58x, Voltage Underscaling Saves 36% — Semiconductor Physics Limits Are Being Bypassed by Software in 2026

Why This Matters Now

Paper 1: INT8 Achieves 58x — DEEP-GAP Measures Where GPU Inference Actually Stands

Paper 2: Run It Broken on Purpose — DRIFT's Fault-Tolerant Inference

Paper 3: Spiking Neural at 4.2mW — L-SPINE Shows Another Direction

Measuring It on Real Hardware: RTX 4060 vs M4

2026-2030: My Predictions (All Personal Analysis)

Prediction 1: "Precision Hierarchy" Becomes the Next Design Axis

Prediction 2: DRIFT's "Tolerate Broken State" Design Goes Mainstream

Prediction 3: Terafab Symbolizes "Vertical Integration" as Industry Trend

Prediction 4: SNN Reaches Production in Robotics/Sensor Fusion by 2028

Prediction 5: NPU Architecture Becomes the Next Differentiator

What 8GB Users Should Watch

References

I Ran an LLM Agent on 8GB VRAM — It Broke After 5 Tool Calls

The "Long-Term Memory" Agent Is a Fantasy on 8GB

How Much KV Cache Does Each Tool Call Eat?

Context Rot — Long Context Kills Quality

Do Larger Models Solve This?

Workaround 1: Short Loops × Context Reset

Workaround 2: Persistent Q4 KV Cache

Workaround 3: Dynamic Tool Selection (Tool Loadout)

8GB Agent Design Principles

Principle 1: Loops Stay Under 5 Steps

Principle 2: Memory Carries as "Summaries"

Principle 3: Maximum 5 Tool Definitions

Principle 4: Monitor "Context Quality"

The 8GB Constraint Improves Agent Design

References

The Memory Wall Can't Be Killed — 3 Papers Proving Every Architecture Hits It

I Tested 3 "Escape Routes" from the Memory Wall

Test 1: Neuromorphic's "New Wall"

Test 2: KV Cache Dominates Even on Edge NPUs

Test 3: GQA Only Shrinks the Wall to One-Third

Wall Morphology Map: Architecture Changes, Wall Persists

Only Optical Computing Has a Different Underlying Principle

The Wall Transforms but Never Dies

References

The Physics Wall in 2026: 3 Papers That Show Why Node Shrinks Won't Save Us

"2nm Will Fix Everything" Is a Fantasy — Let's Drop It

Real Hardware Tells the Story: RTX 4060 vs M4 Power Efficiency

Paper 1: DRIFT — "Break Things on Purpose" Voltage Optimization for 36% Energy Savings

Paper 2: Trilinear Compute-in-Memory — Running Full Transformer Attention in NVM Cores

Paper 3: L-SPINE — Spiking Neural Network Running at 0.54W on FPGA

Rapidus 1.4nm Domestic Fabrication — Don't Misread the Numbers

2026–2030: My Predictions (Bold Personal Analysis)

Prediction 1: Consumer 2nm Won't Arrive Until 2029+

Prediction 2: Compute-in-Memory Becomes Mainstream for Inference Accelerators (~2028)

Prediction 3: SNNs Reach Production First in Sensor Fusion (2027–2028)

Prediction 4: TFLOPS Race Is Over. TOPS/W Race Has Begun

Prediction 5: MATCHA Points to the Heterogeneous SoC Era (2027+ Mass Production)

Measuring the "Wall" on Your Own Hardware

The Bottom Line

20260325_llamacpp_options_8gb_en

5 llama.cpp Settings That Turn 8GB VRAM From Sluggish to 5x Faster — Every Option Benchmarked

The Most Important: -ngl (GPU Layer Count)

-c (Context Length)

--cache-type-k / --cache-type-v (KV Cache Quantization)

--flash-attn (Flash Attention)

-b (Batch Size) and -t (Thread Count)

Server Options (llama-server)

Configuration Templates

Template 1: 7B Model, Chat Use (Maximum Speed)

Template 2: 32B Model, Quality Focus (Partial Offload)

Template 3: 7B Model, Long Context (32K)

Template 4: 4B Model, Function Calling (Maximum Reliability)

The Most Important: `-ngl` (GPU Layer Count)

`-c` (Context Length)

`--cache-type-k` / `--cache-type-v` (KV Cache Quantization)

`--flash-attn` (Flash Attention)

`-b` (Batch Size) and `-t` (Thread Count)

Server Options (`llama-server`)

最重要: `-ngl` (GPUレイヤー数)