<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Buffer Overflow</title>
    <description>The latest articles on Forem by Buffer Overflow (@atomsrkuul).</description>
    <link>https://forem.com/atomsrkuul</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3859861%2F4ddd8a66-6ad8-44ed-b41e-4d793e193bdd.png</url>
      <title>Forem: Buffer Overflow</title>
      <link>https://forem.com/atomsrkuul</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/atomsrkuul"/>
    <language>en</language>
    <item>
      <title>The ESCAPE Byte Problem: How We Beat Brotli by Separating Token Streams</title>
      <dc:creator>Buffer Overflow</dc:creator>
      <pubDate>Tue, 14 Apr 2026 00:47:39 +0000</pubDate>
      <link>https://forem.com/atomsrkuul/the-escape-byte-problem-how-we-beat-brotli-by-separating-token-streams-2i6j</link>
      <guid>https://forem.com/atomsrkuul/the-escape-byte-problem-how-we-beat-brotli-by-separating-token-streams-2i6j</guid>
      <description>&lt;p&gt;Part 4 of the Glasik Notation series.&lt;/p&gt;

&lt;p&gt;Previous articles covered the sliding window tokenizer, Aho-Corasick O(n) matching, and GN's first verified benchmarks against gzip.&lt;/p&gt;

&lt;p&gt;The Waste Was Hidden in Plain Sight&lt;br&gt;
After implementing Aho-Corasick O(n) matching, GN was fast. Sub-millisecond per chunk, competitive with brotli on latency. But the ratio numbers kept coming back flat:&lt;/p&gt;

&lt;p&gt;gzip-6:   2.18x&lt;br&gt;
GN AC:    2.20x  (+0.9% vs gzip)&lt;br&gt;
brotli-6: 2.47x&lt;/p&gt;

&lt;p&gt;We were barely beating gzip. Brotli was 12% ahead. The vocabulary was real — 31,248 tokens per 200 chunks, 190 tokens per chunk average. The matches were happening. So where were the bits going?&lt;br&gt;
We ran a token stream entropy analysis:&lt;br&gt;
pythonfrom collections import Counter&lt;br&gt;
import math&lt;/p&gt;

&lt;p&gt;token_ids = []&lt;br&gt;
for c in sample:&lt;br&gt;
    raw = slider.encode_ac_raw(c)&lt;br&gt;
    i = 0&lt;br&gt;
    while i &amp;lt; len(raw):&lt;br&gt;
        if raw[i] == ESCAPE and i+1 &amp;lt; len(raw):&lt;br&gt;
            token_ids.append(raw[i+1])&lt;br&gt;
            i += 2&lt;br&gt;
        else:&lt;br&gt;
            i += 1&lt;/p&gt;

&lt;p&gt;counter = Counter(token_ids)&lt;br&gt;
total = sum(counter.values())&lt;br&gt;
entropy = -sum(c/total * math.log2(c/total) for c in counter.values())&lt;br&gt;
print(f"Token entropy: {entropy:.3f} bits/token")&lt;br&gt;
Result: 7.758 bits/token.&lt;/p&gt;

&lt;p&gt;We were encoding each token as 2 bytes: ESCAPE + id. That's 16 bits per token. The theoretical minimum was 7.758 bits. We were wasting 51.5% of every token encoding. That's where the bits were going.&lt;/p&gt;

&lt;p&gt;Why the Mixed Stream Was Hurting Us&lt;/p&gt;

&lt;p&gt;Our tokenized output looked like this:&lt;/p&gt;

&lt;p&gt;[ESCAPE][id][ESCAPE][id][lit][lit][lit][ESCAPE][id][lit][ESCAPE][id]...&lt;/p&gt;

&lt;p&gt;Every token costs 2 bytes: an ESCAPE byte (0x01) followed by the ID. We fed this into deflate expecting it to compress well. But deflate uses LZ77 — it looks for repeated byte sequences in a sliding window. The ESCAPE bytes were fragmenting every pattern.&lt;/p&gt;

&lt;p&gt;Where deflate might have seen:&lt;br&gt;
" the " " the " " the "   ← repeating 6-byte sequence, compresses well&lt;br&gt;
It was instead seeing:&lt;br&gt;
[01][04] " t" "he" [01][04] " t" ...   ← ESCAPE bytes breaking the pattern&lt;br&gt;
The ESCAPE byte was acting like static on a radio signal. Present in every token, making the mixed stream look noisier than it actually was.&lt;/p&gt;

&lt;p&gt;The Insight: Separate the Streams&lt;br&gt;
What if we just... didn't mix them?&lt;br&gt;
Instead of one interleaved stream, emit two:&lt;/p&gt;

&lt;p&gt;Token stream: just the IDs — [04][04][38][20][04][07]...&lt;br&gt;
Literal stream: just the literal bytes — "t" "h" "e" " " "a" ...&lt;/p&gt;

&lt;p&gt;Then compress each independently with raw deflate.&lt;br&gt;
The token stream is pure symbols. Token ID 4 (" the") fires 483 times in 200 chunks. That's a highly skewed distribution — deflate loves it. The literal stream is clean text with no ESCAPE pollution. It compresses the way text is supposed to compress.&lt;/p&gt;

&lt;p&gt;pythontoks, lits = slider.encode_ac_split(chunk)&lt;br&gt;
dt = zlib.compress(toks, 6)[2:-4]  # raw deflate&lt;br&gt;
dl = zlib.compress(lits, 6)[2:-4]&lt;br&gt;
frame = struct.pack('&amp;gt;H', len(dt)) + dt + dl&lt;/p&gt;

&lt;p&gt;This is the same insight behind why PNG separates prediction from entropy coding, why video codecs separate motion vectors from residual — when you have structurally different data, compress the structures separately.&lt;/p&gt;

&lt;p&gt;The Numbers&lt;/p&gt;

&lt;p&gt;We ran this across 4 corpora, 3 seeds each — 12 independent measurements. Standard protocol: warm 500 chunks, test next 300.&lt;br&gt;
Batch size matters. Each chunk has ~37 token IDs. Deflate header overhead (~18 bytes) dominates a tiny stream. Batching solves this — concatenate N chunks before compressing the token stream:&lt;/p&gt;

&lt;p&gt;GN split b=1:   2.226x   0.043ms   -6.6% vs brotli   ← header overhead dominates&lt;br&gt;
GN split b=4:   2.385x   0.036ms   +0.1% vs brotli   ← already matching brotli&lt;br&gt;
GN split b=8:   2.456x   0.036ms   +3.1% vs brotli   ← production sweet spot&lt;br&gt;
GN split b=16:  2.542x   0.037ms   +6.7% vs brotli   ← diminishing returns&lt;/p&gt;

&lt;p&gt;b=8 is the production choice. Beyond b=16 the marginal gain flattens and you're accumulating more latency budget than the ratio improvement justifies.&lt;/p&gt;

&lt;p&gt;Full 12-measurement verification at b=8:&lt;br&gt;
CorpusGN split b=8vs gzipvs brotlip50p99ShareGPT2.49–2.52x+15%+2%0.043ms0.061msWildChat2.48–2.51x+15%+2%0.042ms0.073msLMSYS2.50–2.56x+14%+2%0.044ms0.079msUbuntu-IRC2.06–2.09x+49%+28%0.008ms0.013ms&lt;/p&gt;

&lt;p&gt;Every single measurement beats both gzip and brotli.&lt;br&gt;
And on tail latency: GN split b=8 p99 never exceeds 0.123ms. Brotli-6 p99 reaches 0.226ms. GN has 2–4x better tail latency than brotli while achieving better compression ratio.&lt;/p&gt;

&lt;p&gt;Why This Works (The Information Theory)&lt;br&gt;
The mixed tokenized stream had:&lt;/p&gt;

&lt;p&gt;Token entropy: 7.758 bits/token&lt;br&gt;
Encoding cost: 16 bits/token&lt;br&gt;
Waste: 51.5%&lt;/p&gt;

&lt;p&gt;The split stream:&lt;/p&gt;

&lt;p&gt;Token stream: pure symbols, deflate compresses ~2–3x on its own&lt;br&gt;
Literal stream: clean text, no structural noise, deflate compresses ~1.9x&lt;/p&gt;

&lt;p&gt;Combined result: 2.49–2.56x on the original input&lt;/p&gt;

&lt;p&gt;The separation lets each compressor do what it was designed to do. This isn't a trick — it's giving deflate the data structure it can actually exploit.&lt;/p&gt;

&lt;p&gt;The Frame Format&lt;/p&gt;

&lt;p&gt;Simple and self-contained:&lt;/p&gt;

&lt;p&gt;[2B tok_deflated_len][tok_deflated][lit_deflated]&lt;br&gt;
Two bytes of length prefix for the token stream, then the two compressed streams concatenated. Given the vocabulary, you can decode it without any other external state.&lt;br&gt;
The Rust implementation in codon.rs:&lt;br&gt;
rustpub fn encode_ac_split(buf: &amp;amp;[u8], ac: &amp;amp;AhoCorasick) -&amp;gt; (Vec, Vec) {&lt;br&gt;
    let mut tok_ids: Vec = Vec::new();&lt;br&gt;
    let mut literals: Vec = Vec::new();&lt;br&gt;
    let mut pos = 0usize;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for m in ac.find_iter(buf) {
    for &amp;amp;b in &amp;amp;buf[pos..m.start()] {
        literals.push(b);
    }
    let pat_idx = m.pattern().as_usize();
    if pat_idx &amp;lt; 254 {
        tok_ids.push((pat_idx + 1) as u8);
    } else {
        for &amp;amp;b in &amp;amp;buf[m.start()..m.end()] {
            literals.push(b);
        }
    }
    pos = m.end();
}
for &amp;amp;b in &amp;amp;buf[pos..] { literals.push(b); }
(tok_ids, literals)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;}&lt;br&gt;
O(n) scan, single pass, clean split.&lt;/p&gt;

&lt;p&gt;Lossless Round-Trip&lt;/p&gt;

&lt;p&gt;The split-stream is lossless when encoder and decoder share the same vocabulary. Token IDs are indices — to decode, you need to know what pattern each ID maps to.&lt;br&gt;
GN uses a stateful model in production. Encoder and decoder share a synchronized sliding window; each frame carries a 2-byte dict_version. If they diverge, the decoder requests a resync. This keeps frames small while guaranteeing correctness.&lt;br&gt;
Round-trip verified: 5/5 test cases pass including empty buffers, raw ESCAPE bytes in input, and 10,000-byte repetitive inputs.&lt;/p&gt;

&lt;p&gt;What's Next: Fractal Dictionary Sharding&lt;/p&gt;

&lt;p&gt;The split-stream insight revealed something deeper: token and literal streams have fundamentally different statistical structure. Taking that further — different types of content have different vocabulary entirely.&lt;br&gt;
Code blocks repeat function, return, const. System messages repeat role definitions. User messages repeat question structures. Compressing them with a single shared vocabulary leaves ratio on the table.&lt;br&gt;
We're implementing fractal dictionary sharding: four vocabulary tiers (L0 universal, L1 domain, L2 session, L3 chunk) with per-shard-type routing and deterministic crystal identity per shard — same content always produces the same compressed shape. The FractalCompressor is implemented, wired into the napi production path, and passing roundtrip verification across all shard types.&lt;br&gt;
More on that in Article 5.&lt;/p&gt;

&lt;p&gt;Code and Paper&lt;/p&gt;

&lt;p&gt;GitHub: github.com/atomsrkuul/glasik-core (MIT)&lt;br&gt;
npm: &lt;a href="mailto:gni-compression@1.0.0"&gt;gni-compression@1.0.0&lt;/a&gt;&lt;br&gt;
arXiv: pending cs.IR endorsement — if you're a qualified author (3+ cs papers): code 7HWUBA&lt;/p&gt;

&lt;p&gt;Robert Rider is an independent researcher building Glasik, an open-source compression and context management system for LLM deployments.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>compression</category>
      <category>algorithms</category>
      <category>llm</category>
    </item>
    <item>
      <title>GN Beats Gzip and Brotli: How a Learning Sliding Window Outperforms Static Compressors</title>
      <dc:creator>Buffer Overflow</dc:creator>
      <pubDate>Tue, 07 Apr 2026 19:14:49 +0000</pubDate>
      <link>https://forem.com/atomsrkuul/gn-beats-gzip-and-brotli-how-a-learning-sliding-window-outperforms-static-compressors-2dg8</link>
      <guid>https://forem.com/atomsrkuul/gn-beats-gzip-and-brotli-how-a-learning-sliding-window-outperforms-static-compressors-2dg8</guid>
      <description>&lt;p&gt;When we published our last article, GN was within 10% of gzip on LLM conversation data. We said the remaining gap was in the entropy backend. We were wrong about the solution — but right about the problem.&lt;br&gt;
This week GN beats gzip on every corpus we tested. And on all three corpora, it beats brotli.&lt;br&gt;
Here is what we learned.&lt;/p&gt;

&lt;p&gt;The ANS Dead End&lt;br&gt;
Our first instinct was to improve the entropy coder. Gzip uses Huffman coding. zstd uses ANS (Asymmetric Numeral Systems). We implemented byte-renorm ANS, bit-renorm ANS, and Order-1 ANS from scratch in Rust.&lt;/p&gt;

&lt;p&gt;Results on ShareGPT:&lt;br&gt;
Codec   Ratio&lt;br&gt;
gzip-6  2.082x&lt;br&gt;
byte-ANS    1.233x&lt;br&gt;
bit-ANS 1.212x&lt;br&gt;
O1-ANS  0.551x&lt;/p&gt;

&lt;p&gt;ANS without an LZ-style preprocessing pass is worse than gzip. Every time. The reason is fundamental: entropy coders compress symbol frequency distributions. But gzip's real advantage comes from LZ77 — the sliding window that eliminates repeated byte sequences before entropy coding runs. ANS cannot fix what LZ77 needs to do first.&lt;br&gt;
We kept ANS in the codebase as a primitive for future work and moved on.&lt;br&gt;
The Real Problem: Per-Frame Dictionary Overhead&lt;br&gt;
GN has a sliding window tokenizer — it learns domain vocabulary across batches and compresses using that vocabulary. But there was a critical architectural flaw: the dictionary was serialized into every compressed frame.&lt;/p&gt;

&lt;p&gt;200 entries × ~10 bytes = ~2KB overhead per chunk. On 500-byte chunks, the dictionary cost more than the compression saved.&lt;/p&gt;

&lt;p&gt;v1 on 1000 LLM chunks: 0.502x  (expanding the data)&lt;br&gt;
The fix: stop putting the dictionary in the frame. Keep it in shared state, reference it by version number. This is exactly how brotli's static dictionary and zstd's dictionary mode work.&lt;/p&gt;

&lt;p&gt;Frame v1: magic + full_dictionary + payload  (~2KB overhead)&lt;/p&gt;

&lt;p&gt;Frame v2: magic + dict_version(4 bytes) + payload  (8 bytes overhead)&lt;/p&gt;

&lt;p&gt;The Corpus Window (Level 2)&lt;br&gt;
With the overhead fixed, we increased the window to 10,000 entries and made it global — one sliding window shared across all compression calls in the process. Every session, every shard, every conversation feeds the same accumulating vocabulary.&lt;/p&gt;

&lt;p&gt;Results immediately improved:&lt;br&gt;
Corpus  L1 (per-call)   L2 (corpus window)  gzip    brotli&lt;br&gt;
ShareGPT    2.191x  2.402x  2.178x  2.453x&lt;br&gt;
WildChat    2.035x  2.145x  2.025x  2.234x&lt;br&gt;
LMSYS   2.094x  2.231x  2.079x  2.322x&lt;/p&gt;

&lt;p&gt;L2 beats gzip on every corpus. The gap to brotli narrowed to 2-4%.&lt;br&gt;
Retrieval-Warmed Compression (Level 3)&lt;br&gt;
The insight: before compressing a new chunk, feed similar prior chunks through the sliding window first. This warms the dictionary with related vocabulary so the new chunk compresses better. The act of retrieval changes the compression state.&lt;/p&gt;

&lt;p&gt;We benchmarked warm_k (number of prior chunks used for warming) on WildChat — the hardest corpus due to topic diversity:&lt;br&gt;
pressurize_k    L3 ratio    vs brotli&lt;br&gt;
0 (no pressurize)   2.164x  +3.54% gap&lt;br&gt;
1   2.199x  +1.89% gap&lt;br&gt;
2   2.251x  +0.5% ahead&lt;br&gt;
3   2.207x  +1.51% gap&lt;/p&gt;

&lt;p&gt;pressurize_k=2 is optimal for WildChat. For ShareGPT and LMSYS, pressurize_k=3 is optimal.&lt;/p&gt;

&lt;p&gt;The optimal pressurization depth varies by corpus vocabulary diversity — more diverse corpora benefit from shallower pressurization to avoid dictionary dilution.&lt;/p&gt;

&lt;p&gt;Final Results: L3 Beats Brotli on All Three Corpora&lt;br&gt;
Verified across 3 independent corpora, 3 random seeds each:&lt;br&gt;
Corpus  GN L3   gzip-6  brotli-6    margin&lt;br&gt;
ShareGPT    2.526x  2.145x  2.429x  +4.0% vs brotli&lt;br&gt;
LMSYS   2.401x  2.031x  2.291x  +4.8% vs brotli&lt;br&gt;
WildChat    2.251x  2.023x  2.240x  +0.5% vs brotli&lt;/p&gt;

&lt;p&gt;All three beat gzip by 11-18%. All three beat brotli. Results verified across 3 random seeds per corpus.&lt;/p&gt;

&lt;p&gt;GN beats gzip on 100% of runs across all seeds and corpora. GN beats brotli on all three corpora when the window is sufficiently warmed.&lt;/p&gt;

&lt;p&gt;Why This Works&lt;/p&gt;

&lt;p&gt;Brotli ships with a 120KB static dictionary of common web phrases. It never changes. GN's sliding window learns the specific vocabulary of your data stream as it runs. LLM conversations have crystalline structure — repeated role markers, prompt scaffolding, tool call formats, JSON patterns, reasoning templates. After seeing a few thousand examples, GN knows these patterns better than any generic dictionary ever could.&lt;/p&gt;

&lt;p&gt;The critical property: GN's compression ratio improves with stream length. Gzip and brotli are static — they cannot improve.&lt;/p&gt;

&lt;p&gt;ShareGPT at 500 chunks:  GN 2.304x  brotli 2.363x  (behind)&lt;/p&gt;

&lt;p&gt;ShareGPT at 2000 chunks: GN 2.440x  brotli 2.436x  (pulls ahead)&lt;/p&gt;

&lt;p&gt;ShareGPT at 5000 chunks: GN 2.517x  brotli 2.429x  (+3.6%)&lt;/p&gt;

&lt;p&gt;The longer GN runs on a domain-specific stream, the wider the gap grows.&lt;/p&gt;

&lt;p&gt;What Comes Next&lt;/p&gt;

&lt;p&gt;The current warming uses sequential proximity — the last N chunks before the current one. The next level uses semantic similarity — retrieve the most topically related prior chunks via embedding search, regardless of when they appeared.&lt;/p&gt;

&lt;p&gt;A conversation about JWT authentication should be warmed by other authentication conversations, not by whatever happened to come before it in the stream. This is Semantic Level 3, and it should further improve results on diverse corpora like WildChat where topic jumps are common.&lt;br&gt;
Beyond that: dictionary compression (compress the dictionary itself, fractal self-similarity), cross-session persistence (window state survives restarts), and pre-trained domain dictionaries (ship a base window trained on 50k LLM conversations).&lt;/p&gt;

&lt;p&gt;The goal is to make GN the brotli of LLMs — purpose-built, measurably better, and invisible infrastructure.&lt;/p&gt;

&lt;p&gt;GN is MIT licensed. Code: github.com/atomsrkuul/glasik-core&lt;/p&gt;

&lt;p&gt;npm: &lt;a href="mailto:gni-compression@1.0.0"&gt;gni-compression@1.0.0&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;NLNet NGI Zero Commons Fund application #2026-06-023&lt;/p&gt;

</description>
      <category>compression</category>
      <category>rust</category>
      <category>algorithms</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Within 10% of gzip: What GN’s Semantic Compression Teaches Us</title>
      <dc:creator>Buffer Overflow</dc:creator>
      <pubDate>Sun, 05 Apr 2026 04:20:54 +0000</pubDate>
      <link>https://forem.com/atomsrkuul/within-10-of-gzip-what-gns-semantic-compression-teaches-us-4cp1</link>
      <guid>https://forem.com/atomsrkuul/within-10-of-gzip-what-gns-semantic-compression-teaches-us-4cp1</guid>
      <description>&lt;p&gt;When we first started building, the goal was never to make another gzip clone. Generic compression already does that job incredibly well.&lt;/p&gt;

&lt;p&gt;The real question was different:&lt;/p&gt;

&lt;p&gt;What happens if the compressor understands the shape of the data before it ever starts packing bytes?&lt;/p&gt;

&lt;p&gt;That question led us from the original JavaScript prototype into Glasik Core, a Rust implementation focused on semantic tokenization, rolling vocabulary windows, and domain-aware preprocessing for message and agent streams.&lt;/p&gt;

&lt;p&gt;This week we hit a milestone that feels small on paper but huge architecturally:&lt;/p&gt;

&lt;p&gt;GN is now within 10% of gzip on every benchmark corpus we tested.&lt;/p&gt;

&lt;p&gt;Not better. Not faster. Not “production solved.”&lt;/p&gt;

&lt;p&gt;Just consistently close, which is exactly why this stage is exciting.&lt;/p&gt;

&lt;p&gt;The benchmark reality&lt;/p&gt;

&lt;p&gt;Current corpus results:&lt;/p&gt;

&lt;p&gt;Corpus  Glasik Core gzip    Relative&lt;br&gt;
MEMORY.md   1.849x  2.075x  89%&lt;br&gt;
ShareGPT-1k 3.752x  3.945x  95%&lt;br&gt;
Ubuntu-IRC-1k   2.122x  2.357x  90%&lt;/p&gt;

&lt;p&gt;The most important one is ShareGPT-1k hitting 95% of gzip. That corpus is extremely close to the data GN was designed for:&lt;/p&gt;

&lt;p&gt;Repeated assistant roles&lt;br&gt;
Prompt scaffolding&lt;br&gt;
Tool formatting&lt;br&gt;
Structured JSON-like patterns&lt;br&gt;
Recurring conversational templates&lt;/p&gt;

&lt;p&gt;Even though we have not passed gzip yet, nearly matching it on LLM-native streams is a strong validation signal.&lt;/p&gt;

&lt;p&gt;Why being close matters more than winning right now&lt;/p&gt;

&lt;p&gt;The remaining gap is not where many would assume. The weak point is not semantic understanding anymore.&lt;/p&gt;

&lt;p&gt;The weak point is the final entropy backend. gzip still has decades of advantage in:&lt;/p&gt;

&lt;p&gt;Huffman tuning&lt;br&gt;
Backreference heuristics&lt;br&gt;
Lazy match parsing&lt;br&gt;
Highly optimized bit packing&lt;br&gt;
Mature DEFLATE edge cases&lt;/p&gt;

&lt;p&gt;That last 5–10% is the part generic compressors are legendary at.&lt;/p&gt;

&lt;p&gt;But the semantic layer is already doing the harder thing: understanding the structure of the stream before compression begins. That’s where the long-term leverage is.&lt;/p&gt;

&lt;p&gt;The real architectural lesson&lt;/p&gt;

&lt;p&gt;The simplest way to explain the difference:&lt;/p&gt;

&lt;p&gt;gzip remembers bytes. GN remembers meaning.&lt;/p&gt;

&lt;p&gt;As the rolling vocabulary fills, repeated structures stop being treated like raw strings and start being treated as stable semantic units. That includes:&lt;/p&gt;

&lt;p&gt;Timestamps&lt;br&gt;
Speaker roles&lt;br&gt;
Repeated tool calls&lt;br&gt;
Theorem blocks&lt;br&gt;
JSON keys&lt;br&gt;
Repeated prompt shells&lt;br&gt;
Agent trace scaffolding&lt;br&gt;
Channel metadata&lt;/p&gt;

&lt;p&gt;Performance improves the longer the stream runs. Instead of relying only on a fixed byte-history window, GN reinforces the vocabulary of the domain itself. That’s the core bet.&lt;/p&gt;

&lt;p&gt;Why Rust changed the debugging loop&lt;/p&gt;

&lt;p&gt;The JavaScript prototype proved the idea. Rust made it possible to trust the measurements.&lt;/p&gt;

&lt;p&gt;One concrete example: during corpus benchmarking we hit a rolling-frequency bug silently inflating token counts over long windows. Compression ratios looked “better,” but only because the vocabulary statistics were wrong.&lt;/p&gt;

&lt;p&gt;The fix only became obvious because Rust forced us to reason explicitly about integer width, overflow behavior, and ownership boundaries inside the rolling state machine.&lt;/p&gt;

&lt;p&gt;Fixing it tightened the corpus results and gave us confidence that the “within 10%” milestone is real, not a measurement artifact. That debugging loop alone justified the rewrite.&lt;/p&gt;

&lt;p&gt;What makes this exciting now&lt;/p&gt;

&lt;p&gt;The missing performance is now localized. We know exactly where the gap is:&lt;/p&gt;

&lt;p&gt;Residual encoding&lt;br&gt;
Entropy refinement&lt;br&gt;
Better state models&lt;br&gt;
Adaptive codon dictionaries&lt;br&gt;
Specialized chat residual codecs&lt;/p&gt;

&lt;p&gt;That is a much better place to be than wondering whether the entire idea works. The semantic layer is clearly competitive. Now it’s about tightening the backend until the semantic advantage outweighs gzip’s entropy maturity.&lt;/p&gt;

&lt;p&gt;What’s next&lt;/p&gt;

&lt;p&gt;Tonight’s most interesting work was deeper in the backend: we now have a reference-safe ANS entropy coder implemented from scratch in Rust, using the same family of techniques that powers zstd.&lt;/p&gt;

&lt;p&gt;The current version uses correctness-first binary renormalization so we can prove round-trip behavior before optimizing. Next step: bit-level state refinement and faster renormalization transforms.&lt;/p&gt;

&lt;p&gt;This work directly targets the exact 5–10% gap the benchmarks are still showing.&lt;/p&gt;

&lt;p&gt;The path forward is finally clear:&lt;/p&gt;

&lt;p&gt;Semantic understanding is already competitive&lt;br&gt;
Entropy packing is the remaining frontier&lt;br&gt;
The architecture now tells us exactly where to push&lt;/p&gt;

&lt;p&gt;At this point, GN (our semantic agent layer) and Glasik Core (the compression engine) feel less like an experiment and more like a real compression architecture.&lt;/p&gt;

</description>
      <category>compression</category>
      <category>rust</category>
      <category>algorithms</category>
      <category>opensource</category>
    </item>
    <item>
      <title>We Built Domain-Specific Compression for Messages. Here's What We Learned.</title>
      <dc:creator>Buffer Overflow</dc:creator>
      <pubDate>Fri, 03 Apr 2026 17:35:19 +0000</pubDate>
      <link>https://forem.com/atomsrkuul/we-built-domain-specific-compression-for-messages-heres-what-we-learned-2o28</link>
      <guid>https://forem.com/atomsrkuul/we-built-domain-specific-compression-for-messages-heres-what-we-learned-2o28</guid>
      <description>&lt;p&gt;Why gzip loses to custom compression on chat data — and what we learned building a lossless message codec from scratch.&lt;br&gt;
The Problem&lt;br&gt;
Message data is expensive at scale. Discord servers, Slack workspaces, OpenClaw chat logs — each message is ~500 bytes. Generic compression gets 2-3x. That's good but messages have structure generic algorithms ignore.&lt;br&gt;
[2026-04-03T11:00:00Z] user: Hello, can you check the repo?&lt;br&gt;
[2026-04-03T11:00:15Z] bot: Checking repository...&lt;br&gt;
Timestamp format, role prefixes, platform names — these repeat thousands of times. A domain-specific dictionary front-loads that knowledge instead of discovering it slowly.&lt;br&gt;
The Architecture&lt;br&gt;
We built two layers that work together:&lt;br&gt;
GN (Glasik Notation) — semantic compression. Extracts structure before compression, maps repeated values to IDs, recognizes message templates, reduces entropy before any algorithm touches the data.&lt;br&gt;
GNI (Glasik Notation Interface) — transmission codec. Handles serialization, framing, integrity verification, wire protocol.&lt;br&gt;
Together they form a complete pipeline. Neither is useful alone — GN without GNI has no reliable wire protocol, GNI without GN has no semantic advantage over gzip.&lt;br&gt;
What We Shipped: GNI v1&lt;br&gt;
Phase 1 delivers the foundation:&lt;/p&gt;

&lt;p&gt;Canonical binary serialization (varint encoding)&lt;br&gt;
Versioned frame format (backward compatible forever)&lt;br&gt;
CRC32 integrity verification&lt;br&gt;
100% lossless round-trip recovery verified on 2,000+ real messages&lt;br&gt;
Zero external dependencies, 482 lines of JavaScript&lt;/p&gt;

&lt;p&gt;Compression ratios from semantic tokenization are a Phase 2 deliverable. The tokenizer is currently stubbed — Phase 2 implements the domain-specific dictionary that gives GN its advantage.&lt;br&gt;
The Bug We Caught&lt;br&gt;
During validation on 1,038,324 real dialogue messages we hit a CRC32 mismatch:&lt;br&gt;
Stored CRC:   1428394006&lt;br&gt;
Computed CRC: -1889366573&lt;br&gt;
Mismatch!&lt;br&gt;
The bug: checksum computed over (header + payload) instead of just (payload). Three-line fix, proper unsigned arithmetic (&amp;gt;&amp;gt;&amp;gt; 0). Full corpus re-validated in 25 seconds. 6/6 tests passed.&lt;br&gt;
Caught before production. Caught by us, not a user. This is why you validate before you ship.&lt;br&gt;
Try It&lt;br&gt;
javascriptconst GNLz4V2 = require('./src/gn-lz4-v2-complete');&lt;br&gt;
const codec = new GNLz4V2();&lt;/p&gt;

&lt;p&gt;const messages = [&lt;br&gt;
  { templateId: 0, ts: 1743744000, author: 1, channel: 1, payload: 'hello world' },&lt;br&gt;
  { templateId: 0, ts: 1743744001, author: 2, channel: 1, payload: 'how are you?' }&lt;br&gt;
];&lt;/p&gt;

&lt;p&gt;const result = codec.compress(messages);&lt;br&gt;
const recovered = codec.decompress(result.compressed);&lt;br&gt;
console.log(recovered.length + ' messages recovered losslessly');&lt;br&gt;
npm test&lt;/p&gt;

&lt;h1&gt;
  
  
  37/37 passing
&lt;/h1&gt;

&lt;p&gt;What We Are Not Claiming&lt;/p&gt;

&lt;p&gt;Compression ratios are Phase 2. Phase 1 is serialization and framing.&lt;br&gt;
Not production-proven at scale. Validated on our own systems.&lt;br&gt;
No external users yet. Looking for third-party benchmarks and feedback.&lt;/p&gt;

&lt;p&gt;What We Are Claiming&lt;/p&gt;

&lt;p&gt;Solid foundation: lossless, versioned, integrity-verified&lt;br&gt;
Real validation: 1M+ messages, caught our own bugs before release&lt;br&gt;
Clear roadmap: Phase 2 connects GN semantic compression into GNI transmission layer&lt;br&gt;
Applied for NLNet NGI Zero funding (application 2026-06-023) to deliver Phase 2&lt;/p&gt;

&lt;p&gt;Links&lt;br&gt;
GitHub: &lt;a href="https://github.com/atomsrkuul/glasik-notation" rel="noopener noreferrer"&gt;https://github.com/atomsrkuul/glasik-notation&lt;/a&gt;&lt;br&gt;
License: MIT&lt;br&gt;
If you compress messages and want to share results, open an issue. That kind of external validation is what makes this real.&lt;/p&gt;

</description>
      <category>compression</category>
      <category>algorithms</category>
      <category>opensource</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
