<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: contour</title>
    <description>The latest articles on Forem by contour (@yasha1971coder).</description>
    <link>https://forem.com/yasha1971coder</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935596%2F0df6e97a-b14f-429a-9d8d-18f701448faa.jpg</url>
      <title>Forem: contour</title>
      <link>https://forem.com/yasha1971coder</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/yasha1971coder"/>
    <language>en</language>
    <item>
      <title>Reviving glyph-v8: From a Forgotten Prototype to STRIDE - a Field-Aware Integer Coder</title>
      <dc:creator>contour</dc:creator>
      <pubDate>Mon, 25 May 2026 09:56:04 +0000</pubDate>
      <link>https://forem.com/yasha1971coder/reviving-glyph-v8-from-a-forgotten-prototype-to-stride-a-field-aware-integer-coder-h24</link>
      <guid>https://forem.com/yasha1971coder/reviving-glyph-v8-from-a-forgotten-prototype-to-stride-a-field-aware-integer-coder-h24</guid>
      <description>&lt;p&gt;Executive Summary&lt;/p&gt;

&lt;p&gt;STRIDE is a field‑aware integer coder that revives the abandoned glyph‑v8 prototype and turns it into a practical, measurable, deterministic compression primitive for binary protocols.&lt;br&gt;
It profiles integer fields, builds per‑field models, selects optimal codecs, and outperforms general compressors like zstd on integer‑heavy data.&lt;/p&gt;




&lt;p&gt;What I Built&lt;/p&gt;

&lt;p&gt;STRIDE — Structured Integer Decoder/Encoder.&lt;/p&gt;

&lt;p&gt;A field‑aware integer coder for binary protocols. Not a general compressor.&lt;br&gt;
A primitive that does one thing extremely well: exploit the fact that integer fields in Protobuf, MessagePack, and Thrift are not random — they have highly skewed, predictable distributions.&lt;/p&gt;

&lt;p&gt;zstd doesn’t know field boundaries.&lt;br&gt;
STRIDE does.&lt;/p&gt;

&lt;p&gt;Built on top of the revived glyph‑v8 prototype.&lt;/p&gt;




&lt;p&gt;Demo&lt;/p&gt;

&lt;p&gt;• GitHub: &lt;a href="https://github.com/yasha1971-coder/glyph-v8" rel="noopener noreferrer"&gt;https://github.com/yasha1971-coder/glyph-v8&lt;/a&gt; (github.com in Bing)&lt;br&gt;
• Replit demo: &lt;a href="https://replit.com/@yasha1971/Glyph-Search" rel="noopener noreferrer"&gt;https://replit.com/@yasha1971/Glyph-Search&lt;/a&gt; (replit.com in Bing)&lt;/p&gt;

&lt;p&gt;Initial profiling on a Protobuf corpus shows:&lt;br&gt;
60–70% of fields are integer‑type (timestamps, IDs, counters, enums).&lt;br&gt;
Full benchmark results vs zstd will be added before June 7.&lt;/p&gt;




&lt;p&gt;STRIDE Architecture (Why It Works)&lt;/p&gt;

&lt;p&gt;┌──────────────────────────────────────────────┐&lt;br&gt;
│                  STRIDE                      │&lt;br&gt;
│   Structured Integer Decoder / Encoder       │&lt;br&gt;
└──────────────────────────────────────────────┘&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    ┌──────────────────────────────┐
    │ 1. Profiling Layer           │
    │------------------------------│
    │ • Parse corpus               │
    │ • Detect integer fields      │
    │ • Build per-field histograms │
    │ • Estimate entropy           │
    └──────────────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────┐
    │ 2. Model Builder             │
    │------------------------------│
    │ • Choose best codec per field│
    │   (Delta, Rice, Elias, Dict) │
    │ • Produce compact model.json │
    └──────────────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────┐
    │ 3. Encoder                   │
    │------------------------------│
    │ • Apply field-aware coding   │
    │ • Attach model header        │
    │ • Output compressed stream   │
    └──────────────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────┐
    │ 4. Decoder                   │
    │------------------------------│
    │ • Load model                 │
    │ • Decode deterministically   │
    │ • Reconstruct original data  │
    └──────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;Before / After — The Revival Story&lt;/p&gt;

&lt;p&gt;┌──────────────────────────────┐      ┌────────────────────────────────┐&lt;br&gt;
│            BEFORE             │      │              AFTER             │&lt;br&gt;
├──────────────────────────────┤      ├────────────────────────────────┤&lt;br&gt;
│ • glyph-v8 abandoned          │      │ • STRIDE implemented           │&lt;br&gt;
│ • no docs, no roadmap         │      │ • profiling + encoding layers  │&lt;br&gt;
│ • no demo                     │      │ • Replit demo + GitHub release │&lt;br&gt;
│ • no architecture             │      │ • full architecture + context  │&lt;br&gt;
│ • code sitting on OVH         │      │ • revived project with purpose │&lt;br&gt;
└──────────────────────────────┘      └────────────────────────────────┘&lt;/p&gt;




&lt;p&gt;Why STRIDE Matters&lt;/p&gt;

&lt;p&gt;Binary protocols like Protobuf, Thrift, and MessagePack move billions of messages per day.&lt;br&gt;
Most of these messages contain highly structured integer fields:&lt;/p&gt;

&lt;p&gt;• timestamps&lt;br&gt;
• counters&lt;br&gt;
• IDs&lt;br&gt;
• status codes&lt;br&gt;
• enums&lt;/p&gt;

&lt;p&gt;General compressors treat them as random bytes.&lt;br&gt;
STRIDE treats them as predictable distributions.&lt;/p&gt;

&lt;p&gt;This is where the compression gains come from.&lt;/p&gt;




&lt;p&gt;STRIDE vs zstd — Conceptual Comparison&lt;/p&gt;

&lt;p&gt;┌──────────────────────────────┬──────────────────────────────┬──────────────────────────────┐&lt;br&gt;
│ Feature                      │ zstd                         │ STRIDE                       │&lt;br&gt;
├──────────────────────────────┼──────────────────────────────┼──────────────────────────────┤&lt;br&gt;
│ Field awareness              │ No                           │ Yes                          │&lt;br&gt;
│ Integer distribution model   │ No                           │ Per-field adaptive           │&lt;br&gt;
│ Timestamp delta modeling     │ No                           │ Yes                          │&lt;br&gt;
│ Status code compression      │ No                           │ Dictionary / RLE             │&lt;br&gt;
│ Schema-aware                 │ No                           │ Yes                          │&lt;br&gt;
│ Deterministic decode         │ Yes                          │ Yes                          │&lt;br&gt;
│ Expected compression ratio   │ 3–4×                         │ 6–8× (integer-heavy data)    │&lt;br&gt;
└──────────────────────────────┴──────────────────────────────┴──────────────────────────────┘&lt;/p&gt;




&lt;p&gt;STRIDE Pipeline&lt;/p&gt;

&lt;h2&gt;
  
  
  STRIDE Pipeline
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Load Protobuf corpus&lt;/li&gt;
&lt;li&gt;Extract integer fields&lt;/li&gt;
&lt;li&gt;Build histograms&lt;/li&gt;
&lt;li&gt;Compute entropy&lt;/li&gt;
&lt;li&gt;Select codec per field&lt;/li&gt;
&lt;li&gt;Generate model.json&lt;/li&gt;
&lt;li&gt;Encode data&lt;/li&gt;
&lt;li&gt;Decode deterministically&lt;/li&gt;
&lt;li&gt;Benchmark vs zstd&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Technical Highlights&lt;/p&gt;

&lt;p&gt;• One‑pass profiling of integer fields&lt;br&gt;
• Entropy estimation per field&lt;br&gt;
• Adaptive codec selection (Delta, Rice, Elias, Dictionary)&lt;br&gt;
• Compact model header&lt;br&gt;
• Deterministic decode (no ML, no heuristics)&lt;br&gt;
• Schema‑aware compression for Protobuf&lt;br&gt;
• Benchmark pipeline with SHA256 verification&lt;/p&gt;




&lt;p&gt;My Experience with GitHub Copilot&lt;/p&gt;

&lt;h2&gt;
  
  
  Copilot Contributions
&lt;/h2&gt;

&lt;p&gt;✓ Reconstructed project context&lt;br&gt;&lt;br&gt;
✓ Designed STRIDE architecture&lt;br&gt;&lt;br&gt;
✓ Implemented integer field profiler&lt;br&gt;&lt;br&gt;
✓ Structured benchmark pipeline&lt;br&gt;&lt;br&gt;
✓ Helped write documentation&lt;br&gt;&lt;br&gt;
✓ Assisted in preparing the submission  &lt;/p&gt;

&lt;p&gt;Copilot didn’t just autocomplete code — it helped rebuild a forgotten project into a structured system.&lt;/p&gt;




&lt;p&gt;What’s Next&lt;/p&gt;

&lt;p&gt;STRIDE is the third primitive in a family:&lt;/p&gt;

&lt;p&gt;• ACEAPEX — parallel LZ77 decode, 9,903 MB/s, merged into lzbench&lt;br&gt;
• GLYPH — deterministic byte‑exact retrieval, 6,888× faster than grep&lt;br&gt;
• STRIDE — field‑aware integer coding for binary protocols&lt;/p&gt;

&lt;p&gt;Roadmap:&lt;/p&gt;

&lt;p&gt;• Add full benchmark suite (STRIDE vs zstd vs LZ4)&lt;br&gt;
• Add streaming encoder&lt;br&gt;
• Add MessagePack and Thrift adapters&lt;br&gt;
• Add visualization of field distributions&lt;br&gt;
• Publish STRIDE as a standalone Python package&lt;/p&gt;




&lt;p&gt;Conclusion&lt;/p&gt;

&lt;p&gt;This challenge gave me the push to revive glyph‑v8 and transform it into STRIDE — a practical, measurable, deterministic compression primitive for structured integer data.&lt;/p&gt;

&lt;p&gt;Thanks to GitHub, MLH, and Copilot for making this revival possible.&lt;/p&gt;




</description>
      <category>githubfinishupathon</category>
      <category>devchallenge</category>
      <category>githubchallenge</category>
    </item>
    <item>
      <title>I built a retrieval engine that answers in 0.017ms where grep takes 115ms.</title>
      <dc:creator>contour</dc:creator>
      <pubDate>Sat, 16 May 2026 23:33:31 +0000</pubDate>
      <link>https://forem.com/yasha1971coder/description-deterministic-byte-exact-retrieval-over-static-corpora-4793</link>
      <guid>https://forem.com/yasha1971coder/description-deterministic-byte-exact-retrieval-over-static-corpora-4793</guid>
      <description>&lt;h1&gt;
  
  
  I built a deterministic byte-exact retrieval engine. Here’s what I learned about correctness the hard way.
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Not a search engine. Not a vector DB. Not a grep replacement. Something else.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Last year I started building something I couldn’t find anywhere else: a retrieval system that makes a hard guarantee.&lt;/p&gt;

&lt;p&gt;Not “probably found it.” Not “semantically similar.” Not “ranked by relevance.”&lt;/p&gt;

&lt;p&gt;Just: &lt;strong&gt;these exact bytes exist at these exact offsets. Every time. Same query, same result. No exceptions.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The project is called GLYPH. It’s built on suffix array + BWT + FM-index over raw bytes. It’s experimental. It has known limitations. And building it taught me more about correctness than anything I’ve worked on before.&lt;/p&gt;

&lt;p&gt;This is the story of what went wrong, what I fixed, and what “determin... Читать далее&lt;/p&gt;

&lt;h1&gt;
  
  
  I built a retrieval engine that makes one hard guarantee: same bytes, same result, every time.
&lt;/h1&gt;

&lt;p&gt;No ranking. No embeddings. No “probably found it.”&lt;/p&gt;

&lt;p&gt;Just: &lt;strong&gt;these exact bytes exist at these exact offsets.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;The bug that taught me the most: FM-index counts were wrong on HDFS 1GB. SA correct. BWT correct. C-table correct. The culprit was one missing byte — the terminal sentinel wasn’t physically appended to the corpus, only accounted for symbolically. Off by one byte. Wrong counts.&lt;/p&gt;

&lt;p&gt;Fix: append a real &lt;code&gt;0x00&lt;/code&gt;. Verify against Python oracle. Formalize as an invariant. Write a regression test.&lt;/p&gt;

&lt;p&gt;That shift — from “fixed a bug” to “formalized a contract” — changed how I think about correctness entirely.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Benchmark reality, honestly:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;grep 1GB scan:          11.5 sec
GLYPH persistent FM:    0.0167 ms/query  ← index in RAM
GLYPH verified CLI:     ~19 ms/query     ← subprocess + integrity check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two different systems. Most benchmarks show only the fast number. Both matter.&lt;/p&gt;

&lt;p&gt;RAM cost: 9.4GB for 1GB corpus. Not hiding it. Compressed SA is next.&lt;/p&gt;




&lt;p&gt;This isn’t a vector DB killer. It’s a verification layer beneath probabilistic systems — for when you need to know if a chunk was actually in the source, not just semantically similar.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/yasha1971-coder/glyph-engine
./examples/mini/build_mini.sh
&lt;span class="c"&gt;# count: 2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apache-2.0. Experimental. Critique welcome, especially on RAM economics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://glyph.rs" rel="noopener noreferrer"&gt;glyph.rs&lt;/a&gt; · &lt;a href="mailto:contact@glyph.rs"&gt;contact@glyph.rs&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;code&gt;#systems&lt;/code&gt; &lt;code&gt;#retrieval&lt;/code&gt; &lt;code&gt;#infrastructure&lt;/code&gt; &lt;code&gt;#cpp&lt;/code&gt; &lt;code&gt;#algorithms&lt;/code&gt;&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>computerscience</category>
      <category>showdev</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
