Forem: gen

luckrig: a concept for tasting LLM rigs, not just models

gen — Fri, 22 May 2026 17:39:03 +0000

luckrig: a concept for tasting LLM rigs, not just models

HuggingFace Spaces lets you try models.
LMSys Arena lets you compare models.

Neither lets you try a specific rig.

Exact GPU. Exact quantization. Exact context length.
Someone's actual tuning notes — with your own prompt, right now.

That's the gap. luckrig is a concept to fill it.

If Arena maps models, luckrig maps the rigs.

Service	What you taste	Hardware visible?
HF Spaces	Author's model wrap	Whatever they printed
LMSys Arena	Blind A/B models	Model name. Nothing else.
AI Horde	Any worker that fits	Abstracted away
luckrig	A specific rig	GPU · quant · ctx · tuning

AI Horde abstracts the worker away.
luckrig makes the hardware the star.

Access earned by contribution, not money.

Inspired by Hotline Connect — the early-2000s Mac P2P tool where
contribution score, not payment, determined access rights.

Three seed nodes exist in the POC — not yet public.

first-5090-qwen3 — RTX 5090, Qwen3-35B-A3B, Q4_K_XL, 267 tok/s
weekend-m3max — Apple M3 Max, Qwen2.5-14B, Q5_K_M
shed-pi5 — Raspberry Pi 5, llama3.2-1B, 2.3 tok/s

These are local test nodes to demonstrate the concept.
Looking for early contributors who want to register a real node.

Rarity-first, not leaderboard.

The Pi node ranks higher than the 5090 because it's rarer.
Not a speed competition — a showcase of diversity.

Working POC. No external dependencies.

git clone github.com/prospectorlabs/luckrig
cd luckrig
npm start
→ http://127.0.0.1:8787

Concept + full spec + working code, all open.

https://github.com/prospectorlabs/luckrig
https://prospectorlabs.dev/luckrig/

I hid an entire webpage inside a cat face

gen — Mon, 18 May 2026 13:46:38 +0000

The source of this page is just a cat face:

https://nothing-to-see-here.surge.sh/

View source. You'll see (=･ω･=) and nothing else.
But the page runs. Rainbow animation, layout, everything.

The entire JavaScript is encoded as invisible Unicode
Variation Selectors attached to the cat emoji.

Unicode VS (U+FE00–FE0F and U+E0100–E01EF) map
precisely to 256 byte values. Any byte sequence can
ride inside normal text, invisible to readers,
surviving copy-paste across Slack, X, LINE, iMessage.

That's subtext.

https://prospectorlabs.dev/subtext

267 tok/s local inference on RTX 5090 – llama.cpp MTP + Qwen3-35B-A3B MoE

gen — Mon, 18 May 2026 12:56:14 +0000

Been running Qwen3-35B-A3B (MoE) with llama.cpp's Multi-Token Prediction
(MTP / speculative decoding) on an RTX 5090 under WSL2. Results surprised me:

Model	Speed
Ollama stock (35B MoE)	171 tok/s
27B Dense + MTP	104 tok/s
35B MoE + MTP	267 tok/s ← this

For context: Claude Haiku runs ~150 tok/s via API, billed at $150/MTok.
This setup runs on electricity only.

The interesting finding is that MoE and speculative decoding have unusual
synergy. With a dense model, MTP gave a modest speedup (or none).
With MoE, it nearly doubled throughput.

My hypothesis: MoE's sparse activation pattern leaves compute headroom that
speculative decoding can exploit. The draft tokens are cheap to verify because
most experts stay inactive during verification passes.

Setup:

RTX 5090, WSL2 (Ubuntu 24)
llama.cpp with MTP draft, n-max 2
Qwen3-35B-A3B-Instruct Q4_K_XL
ctx 65536, OpenAI-compatible API on localhost

Happy to share the exact llama-server launch flags if anyone wants to reproduce.