<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Naveen Badiger</title>
    <description>The latest articles on Forem by Naveen Badiger (@naveen_badiger_47b35b3f61).</description>
    <link>https://forem.com/naveen_badiger_47b35b3f61</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3811044%2F6c254aad-2908-4fb6-9ccc-aa7fad6af6f5.jpg</url>
      <title>Forem: Naveen Badiger</title>
      <link>https://forem.com/naveen_badiger_47b35b3f61</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/naveen_badiger_47b35b3f61"/>
    <language>en</language>
    <item>
      <title>How I built a 39x compression pipeline with AES-256-GCM in Python (and why the dictionary is everything)</title>
      <dc:creator>Naveen Badiger</dc:creator>
      <pubDate>Sat, 07 Mar 2026 06:10:31 +0000</pubDate>
      <link>https://forem.com/naveen_badiger_47b35b3f61/how-i-built-a-39x-compression-pipeline-with-aes-256-gcm-in-python-and-why-the-dictionary-is-10nc</link>
      <guid>https://forem.com/naveen_badiger_47b35b3f61/how-i-built-a-39x-compression-pipeline-with-aes-256-gcm-in-python-and-why-the-dictionary-is-10nc</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvcep1r77zs6e5gao2nd.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvcep1r77zs6e5gao2nd.gif" alt="QUANTUM-PULSE — docker up, seal, unseal, benchmark" width="640" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I store LLM training data. Every tool I found either compresses it or encrypts it — nothing did both. So I built QUANTUM-PULSE.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;payload → MsgPack → Zstd-L22 + corpus dict → AES-256-GCM → SHA3-256 Merkle
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: MsgPack over JSON
&lt;/h2&gt;

&lt;p&gt;Before compression, MsgPack shrinks the payload by ~22%:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;msgpack&lt;/span&gt;
&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msgpack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;packb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;use_bin_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 22% smaller than json.dumps().encode() — better input = better downstream ratio
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: The dictionary insight
&lt;/h2&gt;

&lt;p&gt;Standard Zstd builds a probability model from scratch every time. For training records sharing the same schema, this is wasted work.&lt;/p&gt;

&lt;p&gt;Train once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;zstandard&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;zstd&lt;/span&gt;

&lt;span class="n"&gt;dict_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zstd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train_dictionary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;131072&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corpus_samples&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;cctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zstd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ZstdCompressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dict_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dict_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;compressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result: &lt;strong&gt;28.46× with dict vs 14.64× vanilla — +94.4% improvement, 29% faster.&lt;/strong&gt;&lt;br&gt;
The dictionary retrains automatically every 24h via APScheduler as new data arrives.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 3: AES-256-GCM with per-blob HKDF keys
&lt;/h2&gt;

&lt;p&gt;One passphrase → master key → unique key per blob:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cryptography.hazmat.primitives.kdf.hkdf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HKDF&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cryptography.hazmat.primitives.ciphers.aead&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AESGCM&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="c1"&gt;# Unique key per blob — one compromise reveals nothing about others
&lt;/span&gt;&lt;span class="n"&gt;blob_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HKDF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;algorithm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hashes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SHA256&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;salt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;blob_salt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pulse_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;derive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;master_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;nonce&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urandom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# fresh per seal
&lt;/span&gt;&lt;span class="n"&gt;ciphertext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AESGCM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blob_key&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encrypt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nonce&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compressed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: SHA3-256 Merkle tree
&lt;/h2&gt;

&lt;p&gt;Every unseal verifies a Merkle proof before returning any data. Silent corruption — bit rot, tampered storage, partial writes — is caught cryptographically, not by hoping checksums match.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Algorithm&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Enc&lt;/th&gt;
&lt;th&gt;Integrity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;snappy&lt;/td&gt;
&lt;td&gt;12×&lt;/td&gt;
&lt;td&gt;1.3ms&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gzip-9&lt;/td&gt;
&lt;td&gt;62×&lt;/td&gt;
&lt;td&gt;9.9ms&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;zstd-L3&lt;/td&gt;
&lt;td&gt;76×&lt;/td&gt;
&lt;td&gt;1.6ms&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;QUANTUM-PULSE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;95×&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;590ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✓&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✓&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;zstd-L22&lt;/td&gt;
&lt;td&gt;99×&lt;/td&gt;
&lt;td&gt;1745ms&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;brotli-11&lt;/td&gt;
&lt;td&gt;112×&lt;/td&gt;
&lt;td&gt;1441ms&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;QUANTUM-PULSE is the only option with both encryption and integrity — and it's 3× faster than vanilla zstd-L22.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;No formal third-party crypto audit yet (private reporting in SECURITY.md)&lt;/li&gt;
&lt;li&gt;PBKDF2-SHA256 over Argon2 — Argon2 planned for v1.1&lt;/li&gt;
&lt;li&gt;MongoDB-first — S3/GCS backends on the roadmap&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Naveenub/quantum-pulse
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env   &lt;span class="c"&gt;# set QUANTUM_PASSPHRASE&lt;/span&gt;
docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;

qp seal dataset.json &lt;span class="nt"&gt;--tag&lt;/span&gt; &lt;span class="nv"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;v1
qp unseal &amp;lt;pulse-id&amp;gt;

python scripts/benchmark_demo.py   &lt;span class="c"&gt;# reproduce the numbers&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;MIT license. 277 tests.&lt;br&gt;
→ &lt;a href="https://github.com/Naveenub/quantum-pulse" rel="noopener noreferrer"&gt;https://github.com/Naveenub/quantum-pulse&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>security</category>
      <category>dataengineering</category>
    </item>
  </channel>
</rss>
