Just finished my GGUF-Shard

idkcallme — Tue, 22 Jul 2025 01:59:28 +0000

I'm happy to announce my latest open‑source project Sharded Suite
it's sharding + caching system that lets you load 70‑billion‑parameter language models on a single 24 GB GPU without quantization or multi‑GPU setups.

it works like this:

Page‑level sharding

The model file is sliced into tiny 4 KB “pages,” each with its own CRC checksum.

Atlas GPU cache
A CUDA LRU cache keeps only the hot pages in VRAM and streams the rest from disk.

Zero‑copy, hardware CRC
Integrity is verified on‑GPU at ~0.03 ms per page, so there’s practically no overhead.

allowing to save 7‑10× VRAM while holding 85‑95 % of the original throughput.

it's on my GitHub https://github.com/idkcallme/Sharded-Suite.git
I got a few ideas on what to do with it, but I would love if I can get ideas on what to improve or more of what I can use it for.

So, for I can use it for
Running LLaMA‑style model locally for research/tinkering and cutting cloud GPU costs by swapping gigantic instances for a single GPU box.

Forem: idkcallme

Just finished my GGUF-Shard