Reviving glyph-v8: From a Forgotten Prototype to STRIDE - a Field-Aware Integer Coder

contour — Mon, 25 May 2026 09:56:04 +0000

Executive Summary

STRIDE is a field‑aware integer coder that revives the abandoned glyph‑v8 prototype and turns it into a practical, measurable, deterministic compression primitive for binary protocols.
It profiles integer fields, builds per‑field models, selects optimal codecs, and outperforms general compressors like zstd on integer‑heavy data.

What I Built

STRIDE — Structured Integer Decoder/Encoder.

A field‑aware integer coder for binary protocols. Not a general compressor.
A primitive that does one thing extremely well: exploit the fact that integer fields in Protobuf, MessagePack, and Thrift are not random — they have highly skewed, predictable distributions.

zstd doesn’t know field boundaries.
STRIDE does.

Built on top of the revived glyph‑v8 prototype.

Demo

• GitHub: https://github.com/yasha1971-coder/glyph-v8 (github.com in Bing)
• Replit demo: https://replit.com/@yasha1971/Glyph-Search (replit.com in Bing)

Initial profiling on a Protobuf corpus shows:
60–70% of fields are integer‑type (timestamps, IDs, counters, enums).
Full benchmark results vs zstd will be added before June 7.

STRIDE Architecture (Why It Works)

┌──────────────────────────────────────────────┐
│ STRIDE │
│ Structured Integer Decoder / Encoder │
└──────────────────────────────────────────────┘

    ┌──────────────────────────────┐
    │ 1. Profiling Layer           │
    │------------------------------│
    │ • Parse corpus               │
    │ • Detect integer fields      │
    │ • Build per-field histograms │
    │ • Estimate entropy           │
    └──────────────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────┐
    │ 2. Model Builder             │
    │------------------------------│
    │ • Choose best codec per field│
    │   (Delta, Rice, Elias, Dict) │
    │ • Produce compact model.json │
    └──────────────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────┐
    │ 3. Encoder                   │
    │------------------------------│
    │ • Apply field-aware coding   │
    │ • Attach model header        │
    │ • Output compressed stream   │
    └──────────────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────┐
    │ 4. Decoder                   │
    │------------------------------│
    │ • Load model                 │
    │ • Decode deterministically   │
    │ • Reconstruct original data  │
    └──────────────────────────────┘

Before / After — The Revival Story

┌──────────────────────────────┐ ┌────────────────────────────────┐
│ BEFORE │ │ AFTER │
├──────────────────────────────┤ ├────────────────────────────────┤
│ • glyph-v8 abandoned │ │ • STRIDE implemented │
│ • no docs, no roadmap │ │ • profiling + encoding layers │
│ • no demo │ │ • Replit demo + GitHub release │
│ • no architecture │ │ • full architecture + context │
│ • code sitting on OVH │ │ • revived project with purpose │
└──────────────────────────────┘ └────────────────────────────────┘

Why STRIDE Matters

Binary protocols like Protobuf, Thrift, and MessagePack move billions of messages per day.
Most of these messages contain highly structured integer fields:

• timestamps
• counters
• IDs
• status codes
• enums

General compressors treat them as random bytes.
STRIDE treats them as predictable distributions.

This is where the compression gains come from.

STRIDE vs zstd — Conceptual Comparison

┌──────────────────────────────┬──────────────────────────────┬──────────────────────────────┐
│ Feature │ zstd │ STRIDE │
├──────────────────────────────┼──────────────────────────────┼──────────────────────────────┤
│ Field awareness │ No │ Yes │
│ Integer distribution model │ No │ Per-field adaptive │
│ Timestamp delta modeling │ No │ Yes │
│ Status code compression │ No │ Dictionary / RLE │
│ Schema-aware │ No │ Yes │
│ Deterministic decode │ Yes │ Yes │
│ Expected compression ratio │ 3–4× │ 6–8× (integer-heavy data) │
└──────────────────────────────┴──────────────────────────────┴──────────────────────────────┘

STRIDE Pipeline

Load Protobuf corpus
Extract integer fields
Build histograms
Compute entropy
Select codec per field
Generate model.json
Encode data
Decode deterministically
Benchmark vs zstd

Technical Highlights

• One‑pass profiling of integer fields
• Entropy estimation per field
• Adaptive codec selection (Delta, Rice, Elias, Dictionary)
• Compact model header
• Deterministic decode (no ML, no heuristics)
• Schema‑aware compression for Protobuf
• Benchmark pipeline with SHA256 verification

My Experience with GitHub Copilot

Copilot Contributions

✓ Reconstructed project context

✓ Designed STRIDE architecture

✓ Implemented integer field profiler

✓ Structured benchmark pipeline

✓ Helped write documentation

✓ Assisted in preparing the submission

Copilot didn’t just autocomplete code — it helped rebuild a forgotten project into a structured system.

What’s Next

STRIDE is the third primitive in a family:

• ACEAPEX — parallel LZ77 decode, 9,903 MB/s, merged into lzbench
• GLYPH — deterministic byte‑exact retrieval, 6,888× faster than grep
• STRIDE — field‑aware integer coding for binary protocols

Roadmap:

• Add full benchmark suite (STRIDE vs zstd vs LZ4)
• Add streaming encoder
• Add MessagePack and Thrift adapters
• Add visualization of field distributions
• Publish STRIDE as a standalone Python package

Conclusion

This challenge gave me the push to revive glyph‑v8 and transform it into STRIDE — a practical, measurable, deterministic compression primitive for structured integer data.

Thanks to GitHub, MLH, and Copilot for making this revival possible.

I built a retrieval engine that answers in 0.017ms where grep takes 115ms.

contour — Sat, 16 May 2026 23:33:31 +0000

I built a deterministic byte-exact retrieval engine. Here’s what I learned about correctness the hard way.

Not a search engine. Not a vector DB. Not a grep replacement. Something else.

Last year I started building something I couldn’t find anywhere else: a retrieval system that makes a hard guarantee.

Not “probably found it.” Not “semantically similar.” Not “ranked by relevance.”

Just: these exact bytes exist at these exact offsets. Every time. Same query, same result. No exceptions.

The project is called GLYPH. It’s built on suffix array + BWT + FM-index over raw bytes. It’s experimental. It has known limitations. And building it taught me more about correctness than anything I’ve worked on before.

This is the story of what went wrong, what I fixed, and what “determin... Читать далее

I built a retrieval engine that makes one hard guarantee: same bytes, same result, every time.

No ranking. No embeddings. No “probably found it.”

Just: these exact bytes exist at these exact offsets.

The bug that taught me the most: FM-index counts were wrong on HDFS 1GB. SA correct. BWT correct. C-table correct. The culprit was one missing byte — the terminal sentinel wasn’t physically appended to the corpus, only accounted for symbolically. Off by one byte. Wrong counts.

Fix: append a real 0x00. Verify against Python oracle. Formalize as an invariant. Write a regression test.

That shift — from “fixed a bug” to “formalized a contract” — changed how I think about correctness entirely.

Benchmark reality, honestly:

grep 1GB scan:          11.5 sec
GLYPH persistent FM:    0.0167 ms/query  ← index in RAM
GLYPH verified CLI:     ~19 ms/query     ← subprocess + integrity check

Two different systems. Most benchmarks show only the fast number. Both matter.

RAM cost: 9.4GB for 1GB corpus. Not hiding it. Compressed SA is next.

This isn’t a vector DB killer. It’s a verification layer beneath probabilistic systems — for when you need to know if a chunk was actually in the source, not just semantically similar.

git clone https://github.com/yasha1971-coder/glyph-engine
./examples/mini/build_mini.sh
# count: 2

Apache-2.0. Experimental. Critique welcome, especially on RAM economics.

glyph.rs · contact@glyph.rs