<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Alankrit Verma</title>
    <description>The latest articles on Forem by Alankrit Verma (@alankritverma).</description>
    <link>https://forem.com/alankritverma</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3810413%2F534cb6dc-3366-4fa4-b44a-49ba12793a1b.jpg</url>
      <title>Forem: Alankrit Verma</title>
      <link>https://forem.com/alankritverma</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/alankritverma"/>
    <language>en</language>
    <item>
      <title>The Last Pivot: Why Quality Gates Killed My Final KV-Cache Speedup</title>
      <dc:creator>Alankrit Verma</dc:creator>
      <pubDate>Mon, 27 Apr 2026 04:40:54 +0000</pubDate>
      <link>https://forem.com/alankritverma/the-last-pivot-why-quality-gates-killed-my-final-kv-cache-speedup-3m0f</link>
      <guid>https://forem.com/alankritverma/the-last-pivot-why-quality-gates-killed-my-final-kv-cache-speedup-3m0f</guid>
      <description>&lt;p&gt;I wanted to answer one question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;After packed-codebook TurboQuant failed, was there still a credible latency path?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The short answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;there was a real speed ceiling, but no stable quality-preserving implementation path.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Hardware-friendly int4 K/V passed byte gates but failed real-KV logit quality.&lt;/li&gt;
&lt;li&gt;Qwen2.5-7B work reduction had a real speed ceiling: &lt;code&gt;p_attn=0.334&lt;/code&gt;, with &lt;code&gt;1.20x&lt;/code&gt; to &lt;code&gt;1.21x&lt;/code&gt; projected speedup at 5% selector overhead.&lt;/li&gt;
&lt;li&gt;Oracle quality failed anyway: no implementable selector passed all 4 decode steps.&lt;/li&gt;
&lt;li&gt;The lesson was strict: a speed ceiling is only permission to run a quality gate, not permission to implement.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Evidence
&lt;/h2&gt;

&lt;p&gt;I put the detailed benchmark notes in the public evidence repo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Results ledger: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Hardware-friendly int4 K/V quality probe: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/hfkv-quality-k0-prep-summary.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/hfkv-quality-k0-prep-summary.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Work-reduction speed and oracle-quality probe: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/work-reduction-oracle-summary.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/work-reduction-oracle-summary.md&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The rule after the pivot
&lt;/h2&gt;

&lt;p&gt;At this point, another TurboQuant variant would have been circular.&lt;/p&gt;

&lt;p&gt;So the rule changed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;no implementation before a speed ceiling and an oracle-quality gate pass on the same target.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That rule matters because each partial result can otherwise be overinterpreted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;byte compression can pass while quality fails&lt;/li&gt;
&lt;li&gt;synthetic quality can pass while real-KV quality fails&lt;/li&gt;
&lt;li&gt;attention-only speed can pass while full decode speed cannot move enough&lt;/li&gt;
&lt;li&gt;a speed ceiling can pass while no stable selector exists&lt;/li&gt;
&lt;li&gt;a row-level oracle pass can hide step-to-step instability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For me, this became the gate-discipline part of the series: deciding when not to build.&lt;/p&gt;

&lt;p&gt;The setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;storage compression had already separated itself from latency&lt;/li&gt;
&lt;li&gt;eager value-path approximations had failed&lt;/li&gt;
&lt;li&gt;fused packed-codebook logits had not beaten dense logits by enough&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By this point, the original packed-codebook TurboQuant latency path was closed.&lt;/p&gt;

&lt;p&gt;The evidence was not subtle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;eager value-path variants failed&lt;/li&gt;
&lt;li&gt;exact cleanup did not move latency enough&lt;/li&gt;
&lt;li&gt;primitive feasibility failed badly&lt;/li&gt;
&lt;li&gt;fused packed-codebook logits beat eager but missed dense-speed bars&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the next move could not be:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;one more TurboQuant variant&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It had to change the hypothesis.&lt;/p&gt;

&lt;p&gt;I tested two pivots:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;hardware-friendly int4 K/V&lt;/li&gt;
&lt;li&gt;long-context work reduction&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both taught useful things.&lt;/p&gt;

&lt;p&gt;Neither justified a runtime latency implementation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhhot5ccus61zt6znf7l7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhhot5ccus61zt6znf7l7.png" alt="Final pivot gates: HFKV failed quality, work reduction passed speed ceiling, oracle quality failed" width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pivot 1: hardware-friendly int4 K/V
&lt;/h2&gt;

&lt;p&gt;The packed-codebook representation was expensive to consume.&lt;/p&gt;

&lt;p&gt;So the next idea was simpler:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What if the representation is less clever but more hardware-friendly?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of codebooks, rotations, and residual machinery, use blockwise int4 K/V.&lt;/p&gt;

&lt;p&gt;Two formats were tested:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;symmetric int4&lt;/li&gt;
&lt;li&gt;affine int4&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both quantized over last-dimension blocks with &lt;code&gt;block_size=32&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The hope was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;simpler unpack/dequant path&lt;/li&gt;
&lt;li&gt;predictable memory layout&lt;/li&gt;
&lt;li&gt;fewer exotic operations&lt;/li&gt;
&lt;li&gt;easier future kernel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was not integrated into the public cache API.&lt;/p&gt;

&lt;p&gt;It was a quality and K0-prep probe only.&lt;/p&gt;

&lt;p&gt;The rule was strict:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;if real model K/V quality fails, do not write the kernel.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  HFKV passed bytes and failed quality
&lt;/h2&gt;

&lt;p&gt;On the real-KV check with &lt;code&gt;HuggingFaceTB/SmolLM2-135M-Instruct&lt;/code&gt;, both formats compressed KV substantially.&lt;/p&gt;

&lt;p&gt;They also preserved next-token argmax.&lt;/p&gt;

&lt;p&gt;But both failed the hard decode-logit MSE gate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Top-k@10&lt;/th&gt;
&lt;th&gt;Argmax&lt;/th&gt;
&lt;th&gt;Decode-Logit MSE&lt;/th&gt;
&lt;th&gt;Required MSE&lt;/th&gt;
&lt;th&gt;KV Ratio&lt;/th&gt;
&lt;th&gt;Gate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;symmetric int4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.800&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.739284&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;lt;=0.25&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3.56x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;affine int4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.800&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.360282&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;lt;=0.25&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3.20x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fail&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tempting interpretation would be:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;argmax survived, so maybe this is fine.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is not a good enough bar.&lt;/p&gt;

&lt;p&gt;Top-k overlap only exactly hit the minimum, and logit MSE was far outside the target.&lt;/p&gt;

&lt;p&gt;The correct decision was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;do not build HFKV-K0.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The reusable lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;byte compression is not quality.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The synthetic K0-prep numbers looked fine, but synthetic random tensors were not predictive enough. Real model K/V was the gate, and it failed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pivot 2: work reduction
&lt;/h2&gt;

&lt;p&gt;After that, I stopped asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;can I compress the historical values?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and asked:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;does the model actually need all historical tokens for stable decode?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a different latency hypothesis.&lt;/p&gt;

&lt;p&gt;It is not cache compression.&lt;/p&gt;

&lt;p&gt;It is dense-attention work reduction.&lt;/p&gt;

&lt;p&gt;The idea is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;full attention over history
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;attention over a selected subset of history
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If only a fraction &lt;code&gt;f&lt;/code&gt; of historical tokens are active, the idealized attention work shrinks.&lt;/p&gt;

&lt;p&gt;But this only matters if attention is a large enough part of full decode.&lt;/p&gt;

&lt;p&gt;So the first gate was a speed ceiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speed ceiling math
&lt;/h2&gt;

&lt;p&gt;Let:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;p_attn&lt;/code&gt; be the fraction of decode time spent in attention&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;f&lt;/code&gt; be the active historical fraction&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;h&lt;/code&gt; be selector/masking overhead as a fraction of the original decode step&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rough projected speedup is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;speedup = 1 / ((1 - p_attn) + p_attn * f + h)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This equation is intentionally simple.&lt;/p&gt;

&lt;p&gt;It asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;even if the selector existed, is there enough attention work to remove?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On small models, the answer was mostly no.&lt;/p&gt;

&lt;p&gt;For SmolLM2-135M, attention was not a large enough share of full decode. The quality signal was real, but the latency ceiling was too low.&lt;/p&gt;

&lt;p&gt;So I moved to a larger real target:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Qwen/Qwen2.5-7B-Instruct
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;at roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;8192 prompt tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Qwen had a real speed ceiling
&lt;/h2&gt;

&lt;p&gt;The Qwen2.5-7B speed-ceiling result was the strongest latency signal in the whole project.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;bfloat16&lt;/code&gt;, the dense decode step was about:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;20.447 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SDPA real-model projection estimated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;p_attn = 0.334
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Projected speedups:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Active Fraction&lt;/th&gt;
&lt;th&gt;Projected Speedup, 0% Overhead&lt;/th&gt;
&lt;th&gt;Projected Speedup, 5% Overhead&lt;/th&gt;
&lt;th&gt;Projected Speedup, 10% Overhead&lt;/th&gt;
&lt;th&gt;Gate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0.337&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.28x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.21x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.14x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;pass at 5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0.350&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.28x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.20x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.13x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;pass at 5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0.376&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.26x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.19x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.12x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;near miss&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This was not a fake result.&lt;/p&gt;

&lt;p&gt;There was real room.&lt;/p&gt;

&lt;p&gt;But speed ceiling is only half the story.&lt;/p&gt;

&lt;p&gt;The next question was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;can an implementable selector preserve quality while keeping only about 34-38% of history?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Oracle quality failed
&lt;/h2&gt;

&lt;p&gt;The official quality gate used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model: &lt;code&gt;Qwen/Qwen2.5-7B-Instruct&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;context: &lt;code&gt;8192&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;decode steps: &lt;code&gt;4&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;dtype: &lt;code&gt;bfloat16&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;active fraction gate: &lt;code&gt;0.376&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dense reference was finite on all steps.&lt;/p&gt;

&lt;p&gt;So this was a valid quality run.&lt;/p&gt;

&lt;p&gt;The headline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;15 selector-step rows passed,
but 0 selector configurations passed all 4 decode steps.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;A row-level pass says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this selector worked on this step.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An implementation pass needs:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this selector family worked consistently across decode steps.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;No implementable selector did.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Selector&lt;/th&gt;
&lt;th&gt;Active Hist Fraction&lt;/th&gt;
&lt;th&gt;Passed Steps&lt;/th&gt;
&lt;th&gt;Failed Steps&lt;/th&gt;
&lt;th&gt;Max MSE&lt;/th&gt;
&lt;th&gt;Main Failure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;global_block_mass:f=0.3760:b=16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.374&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0,1,3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.585789&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;step &lt;code&gt;2&lt;/code&gt; MSE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;global_block_mass:f=0.3500:b=16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.348&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0,3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1,2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.900147&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;steps &lt;code&gt;1,2&lt;/code&gt; MSE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;global_block_mass:f=0.3370:b=16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.335&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0,2,3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.919634&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;step &lt;code&gt;1&lt;/code&gt; MSE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;recent_sink:sink=4:recent=3072&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.375&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0,1,2,3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.163465&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;layer-local relative L2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most tempting result was &lt;code&gt;recent_sink:sink=4:recent=3072&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;max decode-logit MSE &lt;code&gt;0.163465&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;top-k overlap at least &lt;code&gt;0.8&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;stable argmax&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it failed every step on layer-local median post-&lt;code&gt;o_proj&lt;/code&gt; relative L2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.1394 &amp;gt; 0.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point, the dangerous move would be:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;relax the quality gate because the result is close.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is how projects go in circles.&lt;/p&gt;

&lt;p&gt;The gate existed before the result.&lt;/p&gt;

&lt;p&gt;The result failed the gate.&lt;/p&gt;

&lt;p&gt;So the decision was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;do not build the runtime selector.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What survived
&lt;/h2&gt;

&lt;p&gt;The final result is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;everything was useless.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The surviving lessons are more precise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dense GPU decode is a serious baseline
&lt;/h3&gt;

&lt;p&gt;Dense attention is not naive.&lt;/p&gt;

&lt;p&gt;It has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;simple data layout&lt;/li&gt;
&lt;li&gt;optimized kernels&lt;/li&gt;
&lt;li&gt;clean tensor operations&lt;/li&gt;
&lt;li&gt;no unpack/reconstruct overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Any compressed path has to beat that, not just beat its own prototype.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory and latency are separate scorecards
&lt;/h3&gt;

&lt;p&gt;Cache compression can be valuable even if it does not reduce latency.&lt;/p&gt;

&lt;p&gt;The right memory/capacity metrics are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cache bytes per token&lt;/li&gt;
&lt;li&gt;maximum context before OOM&lt;/li&gt;
&lt;li&gt;batch size at fixed VRAM&lt;/li&gt;
&lt;li&gt;throughput under memory pressure&lt;/li&gt;
&lt;li&gt;quality at long context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a different product goal from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;faster decode when dense already fits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Speed ceiling is necessary but not sufficient
&lt;/h3&gt;

&lt;p&gt;Qwen2.5-7B proved there can be enough attention share for work reduction to matter.&lt;/p&gt;

&lt;p&gt;But the selector also has to preserve quality.&lt;/p&gt;

&lt;p&gt;The oracle could not find one stable implementable selector under the hard gate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Paper claims and integration claims are different
&lt;/h3&gt;

&lt;p&gt;A paper can make a valid primitive or memory claim.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;transformers&lt;/code&gt; integration needs a different proof:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;full decode timing&lt;/li&gt;
&lt;li&gt;real model quality&lt;/li&gt;
&lt;li&gt;update cost&lt;/li&gt;
&lt;li&gt;value path&lt;/li&gt;
&lt;li&gt;generation overhead&lt;/li&gt;
&lt;li&gt;target hardware&lt;/li&gt;
&lt;li&gt;target dtype baseline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are not interchangeable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final state
&lt;/h2&gt;

&lt;p&gt;The current honest state is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No active GPU decode-latency implementation path.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Closed as latency paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;eager TurboQuant-family variants&lt;/li&gt;
&lt;li&gt;packed-codebook fused K1/residual/value integration&lt;/li&gt;
&lt;li&gt;hardware-friendly int4 K/V kernel work&lt;/li&gt;
&lt;li&gt;Qwen work-reduction selector implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Still potentially useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;storage/capacity compression&lt;/li&gt;
&lt;li&gt;exact cleanup as baseline hygiene&lt;/li&gt;
&lt;li&gt;the measurement discipline&lt;/li&gt;
&lt;li&gt;the failure map&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The next reasonable project, if the goal continues, is not:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;one more latency variant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a separate memory/capacity plan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;with its own scorecard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The whole project started with a simple hope:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;smaller KV cache, faster transformers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The actual lesson was harder:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;compression only speeds up decode if the compressed representation is cheap to consume on the target execution path.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That condition failed repeatedly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;in eager value approximations&lt;/li&gt;
&lt;li&gt;in packed-codebook primitive timing&lt;/li&gt;
&lt;li&gt;in fused logits upper bounds&lt;/li&gt;
&lt;li&gt;in simple int4 K/V quality&lt;/li&gt;
&lt;li&gt;in long-context work reduction quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not wasted work.&lt;/p&gt;

&lt;p&gt;It is a map of where the obvious traps are.&lt;/p&gt;

&lt;p&gt;And for performance engineering, a good negative map is often the thing that prevents the next six months of bad work.&lt;/p&gt;

&lt;p&gt;The final lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A speed ceiling is only permission to run a quality gate. It is not permission to implement.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Qwen2.5-7B had enough attention share for work reduction to matter. The oracle still failed to find one stable implementable selector. That is why the latency path stopped.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>research</category>
      <category>benchmarking</category>
    </item>
    <item>
      <title>Beating Eager TurboQuant Was Not Enough: Why Dense GPU Attention Still Won</title>
      <dc:creator>Alankrit Verma</dc:creator>
      <pubDate>Mon, 27 Apr 2026 04:37:16 +0000</pubDate>
      <link>https://forem.com/alankritverma/beating-eager-turboquant-was-not-enough-why-dense-gpu-attention-still-won-adn</link>
      <guid>https://forem.com/alankritverma/beating-eager-turboquant-was-not-enough-why-dense-gpu-attention-still-won-adn</guid>
      <description>&lt;p&gt;I wanted to answer one question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If I remove eager overhead, can a TurboQuant-style compressed primitive beat dense GPU logits?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The short answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;it beat eager TurboQuant, but it did not beat dense FP16 logits by enough.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Exact weighted value decode was mathematically clean, but only improved &lt;code&gt;value_decode_sec&lt;/code&gt; by about &lt;code&gt;2.9%&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A fused packed-codebook logits kernel removed most eager overhead and beat eager TurboQuant main logits by &lt;code&gt;7x&lt;/code&gt; to &lt;code&gt;18x&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;It still missed the dense FP16 logits gate: the best K0.2 result was &lt;code&gt;1.56x&lt;/code&gt; at 8192 and &lt;code&gt;1.99x&lt;/code&gt; at 16384, where the gate required &lt;code&gt;&amp;gt;=2x&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The lesson was strict: beating your own eager prototype is not enough. The compressed path must beat the real dense baseline with room left for softmax, values, residuals, and updates.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Evidence
&lt;/h2&gt;

&lt;p&gt;I put the detailed benchmark notes in the public evidence repo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Results ledger: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Exact reference value-decode rewrite: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/exact-reference-value-decode-summary.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/exact-reference-value-decode-summary.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Primitive feasibility probe: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/primitive-feasibility-summary.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/primitive-feasibility-summary.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Fused packed-codebook proof: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/fused-kernel-proof-summary.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/fused-kernel-proof-summary.md&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The question after eager failed
&lt;/h2&gt;

&lt;p&gt;After the eager value-path failure, there were two possible interpretations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The compressed-attention idea was weak.&lt;/li&gt;
&lt;li&gt;The eager implementation level was weak.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those are different claims, so they needed different tests.&lt;/p&gt;

&lt;p&gt;The first test was exact cleanup: remove algebraic waste without changing the algorithm. If that produced a large win, I could keep improving the stable path.&lt;/p&gt;

&lt;p&gt;It did not.&lt;/p&gt;

&lt;p&gt;The second test was a primitive upper bound: remove most eager overhead and ask whether the compressed representation could beat dense GPU logits before adding softmax, values, residuals, and model integration.&lt;/p&gt;

&lt;p&gt;That is why the fused proof was intentionally narrow. A narrow upper bound is useful because it can kill a bad integration path before the integration work starts.&lt;/p&gt;

&lt;p&gt;The setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;earlier eager value-path experiments failed&lt;/li&gt;
&lt;li&gt;that did not prove compressed attention was impossible&lt;/li&gt;
&lt;li&gt;it only proved the eager implementation level was weak&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The eager value-path work ended at a clean stop:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the eager value-path family failed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That did not yet prove the compressed-attention idea was bad.&lt;/p&gt;

&lt;p&gt;It proved that my eager implementation shape was bad.&lt;/p&gt;

&lt;p&gt;So the next question was narrower:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If I remove the obvious eager overhead, can the compressed representation beat dense attention primitives by enough to justify real integration?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer was still no.&lt;/p&gt;

&lt;p&gt;But the reason is more interesting than "the kernel was slow."&lt;/p&gt;

&lt;p&gt;The fused path got much faster than eager TurboQuant. It just did not get fast enough versus dense GPU logits.&lt;/p&gt;

&lt;h2&gt;
  
  
  First, an exact cleanup
&lt;/h2&gt;

&lt;p&gt;Before kernel work, there was one exact mathematical cleanup to test.&lt;/p&gt;

&lt;p&gt;The stable compressed-key baseline had a value path that effectively did this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;decoded_values_t = R^-1(z_t)
o = sum_t a_t decoded_values_t
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;z_t&lt;/code&gt; is the value representation in rotated space.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;R^-1&lt;/code&gt; is the inverse rotation.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;a_t&lt;/code&gt; is the attention weight for token &lt;code&gt;t&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;o&lt;/code&gt; is the weighted value output.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because &lt;code&gt;R^-1&lt;/code&gt; is linear:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum_t a_t R^-1(z_t) = R^-1(sum_t a_t z_t)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So I can first compute the weighted sum in rotated space:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z_weighted = sum_t a_t z_t
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and then inverse-rotate once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;o = R^-1(z_weighted)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a heuristic.&lt;/p&gt;

&lt;p&gt;It is exactly equivalent to the existing codec math, up to normal floating-point accumulation details.&lt;/p&gt;

&lt;p&gt;For a decode step, this changes the rough output-side rotation count from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;H_kv * T
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;H_kv * G * Q
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;H_kv&lt;/code&gt; is the number of key/value heads.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T&lt;/code&gt; is the historical context length.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;G&lt;/code&gt; is the number of query groups per KV head.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Q&lt;/code&gt; is the number of query positions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a single-token decode with long history, that is a large algebraic reduction.&lt;/p&gt;

&lt;p&gt;It was worth doing.&lt;/p&gt;

&lt;p&gt;It was not enough.&lt;/p&gt;

&lt;p&gt;The cleanup passed correctness and made the value decode path slightly cleaner, but it did not become a real latency win. In the focused benchmark, &lt;code&gt;value_decode_sec&lt;/code&gt; improved only about &lt;code&gt;2.9%&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;exact algebraic cleanup is good engineering, but it is not automatically a product-level speedup.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After that, the only serious TurboQuant-family latency question left was a primitive question.&lt;/p&gt;

&lt;h2&gt;
  
  
  The primitive question
&lt;/h2&gt;

&lt;p&gt;The dense key-logit computation is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = Q K^T
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For decode, this means comparing the current query against all historical keys.&lt;/p&gt;

&lt;p&gt;The compressed-key hope is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L ~= compressed_logits(Q, codes(K), scales(K), residuals(K))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;without materializing full dense historical keys.&lt;/p&gt;

&lt;p&gt;This is the part that sounds like the headline TurboQuant promise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;store fewer key bytes&lt;/li&gt;
&lt;li&gt;compute attention logits from the compressed representation&lt;/li&gt;
&lt;li&gt;avoid dense historical key reads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the compressed representation has its own costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;unpacking low-bit codes&lt;/li&gt;
&lt;li&gt;codebook lookup&lt;/li&gt;
&lt;li&gt;radius or scale multiplication&lt;/li&gt;
&lt;li&gt;residual correction&lt;/li&gt;
&lt;li&gt;query rotation or transformed query math&lt;/li&gt;
&lt;li&gt;extra metadata reads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the real primitive question was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Is the compressed representation cheaper to consume than dense FP16 keys?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not "is it smaller?"&lt;/p&gt;

&lt;p&gt;Not "is it faster than my Python/eager version?"&lt;/p&gt;

&lt;p&gt;The target was dense GPU logits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Eager primitive feasibility failed hard
&lt;/h2&gt;

&lt;p&gt;The first primitive feasibility check used synthetic large GQA-style shapes.&lt;/p&gt;

&lt;p&gt;It measured whether the current compressed primitive could compete with dense attention/logits at the operation level.&lt;/p&gt;

&lt;p&gt;The result was not close.&lt;/p&gt;

&lt;p&gt;The compressed reference primitive was roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;12x to 24x slower than dense
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And encode-one-token cost alone was around:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.65 ms to 0.84 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That killed the idea that more eager PyTorch variants would solve the problem.&lt;/p&gt;

&lt;p&gt;But it still left one fair objection:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Of course eager PyTorch lost. What if a fused kernel removes the unpack and codebook overhead?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That objection was valid.&lt;/p&gt;

&lt;p&gt;So I wrote the fused proof.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fused K0 proof
&lt;/h2&gt;

&lt;p&gt;The fused proof intentionally started small.&lt;/p&gt;

&lt;p&gt;It did not implement full attention.&lt;/p&gt;

&lt;p&gt;It did not implement values.&lt;/p&gt;

&lt;p&gt;It did not implement residual correction.&lt;/p&gt;

&lt;p&gt;It did not integrate into model generation.&lt;/p&gt;

&lt;p&gt;It only implemented packed-codebook main logits.&lt;/p&gt;

&lt;p&gt;That made it an upper-bound test.&lt;/p&gt;

&lt;p&gt;If main logits alone could not beat dense logits by enough, then full attention would not have enough room either.&lt;/p&gt;

&lt;p&gt;The kernel did:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;packed int4 unpack&lt;/li&gt;
&lt;li&gt;codebook lookup&lt;/li&gt;
&lt;li&gt;radius multiplication&lt;/li&gt;
&lt;li&gt;dot products against query groups&lt;/li&gt;
&lt;li&gt;output logits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benchmark compared:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dense logits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;against:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fused packed-codebook main logits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important target was not eager TurboQuant.&lt;/p&gt;

&lt;p&gt;The important target was dense logits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The kernel worked, but the path still failed
&lt;/h2&gt;

&lt;p&gt;The fused kernel did remove a lot of eager overhead.&lt;/p&gt;

&lt;p&gt;On the main-logits operation, it was roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;7x to 18x faster than eager TurboQuant main logits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is a real engineering improvement.&lt;/p&gt;

&lt;p&gt;But the dense baseline was extremely strong.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3bayl9q7avlrp6mu0ji.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3bayl9q7avlrp6mu0ji.png" alt="Fused kernel proof: fused main logits beat eager TurboQuant but missed the dense-speed gate" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the &lt;code&gt;llama70b_gqa&lt;/code&gt; synthetic profile, the corrected &lt;code&gt;float16&lt;/code&gt; output K0.2 sweep looked like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Block Tokens&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Dense Logits&lt;/th&gt;
&lt;th&gt;Fused Main Logits&lt;/th&gt;
&lt;th&gt;Fused vs Dense&lt;/th&gt;
&lt;th&gt;Required&lt;/th&gt;
&lt;th&gt;Gate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;32&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;8192&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.061 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.048 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.27x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;gt;=2.0x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;32&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;16384&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.093 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.049 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.90x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;gt;=2.0x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;64&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;8192&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.060 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.039 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.56x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;gt;=2.0x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;64&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;16384&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.092 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.046 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.99x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;gt;=2.0x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;near miss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;128&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;8192&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.058 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.108 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.54x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;gt;=2.0x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;256&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;8192&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.059 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.043 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.38x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;gt;=2.0x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fail&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The best tile was &lt;code&gt;block_tokens=64&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It nearly reached &lt;code&gt;2x&lt;/code&gt; at &lt;code&gt;16k&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It did not clear the bar at &lt;code&gt;8k&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;And this was still only main logits.&lt;/p&gt;

&lt;p&gt;Full attention would add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;online softmax&lt;/li&gt;
&lt;li&gt;value accumulation&lt;/li&gt;
&lt;li&gt;residual correction&lt;/li&gt;
&lt;li&gt;cache update cost&lt;/li&gt;
&lt;li&gt;model integration overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So a near miss on logits-only at &lt;code&gt;16k&lt;/code&gt; was not enough.&lt;/p&gt;

&lt;p&gt;The disciplined decision was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;stop the packed-codebook fused-kernel path.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why dense logits were so hard to beat
&lt;/h2&gt;

&lt;p&gt;Dense FP16 logits have a boring shape, and that is exactly why they are strong.&lt;/p&gt;

&lt;p&gt;They use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;contiguous FP16 data&lt;/li&gt;
&lt;li&gt;regular tensor operations&lt;/li&gt;
&lt;li&gt;optimized GPU paths&lt;/li&gt;
&lt;li&gt;no unpacking&lt;/li&gt;
&lt;li&gt;no codebook lookup&lt;/li&gt;
&lt;li&gt;no token-specific scale reconstruction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The packed-codebook path stores fewer bytes, but each byte is not immediately useful math.&lt;/p&gt;

&lt;p&gt;It has to be unpacked and interpreted.&lt;/p&gt;

&lt;p&gt;That interpretation cost is the whole fight.&lt;/p&gt;

&lt;p&gt;In my kernel, there was another practical issue: query-group padding.&lt;/p&gt;

&lt;p&gt;The kernel used &lt;code&gt;tl.dot&lt;/code&gt;, which wants a tile dimension of at least &lt;code&gt;16&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For GQA shapes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;llama70b_gqa&lt;/code&gt; had &lt;code&gt;G=8&lt;/code&gt;, so half the tile was padding.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama8b_gqa&lt;/code&gt; had &lt;code&gt;G=4&lt;/code&gt;, so three quarters of the tile was padding.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The lower-level kernel removed Python overhead, but it did not change the fact that the dense baseline had a cleaner GPU execution shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  What about the TurboQuant claim?
&lt;/h2&gt;

&lt;p&gt;This is where benchmark language matters.&lt;/p&gt;

&lt;p&gt;The external TurboQuant claim is not the same as what I was trying to ship.&lt;/p&gt;

&lt;p&gt;For context, the public references are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google Research blog: &lt;a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/" rel="noopener noreferrer"&gt;https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;TurboQuant paper: &lt;a href="https://arxiv.org/abs/2504.19874" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2504.19874&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The public TurboQuant framing includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;extreme KV/vector compression&lt;/li&gt;
&lt;li&gt;quality retention at low bits&lt;/li&gt;
&lt;li&gt;faster attention-logit computation under a specialized setup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not identical to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;drop a cache implementation into Hugging Face generate()
and beat dense FP16/BF16 end-to-end decode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those are different bars.&lt;/p&gt;

&lt;p&gt;In particular:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;attention logits are only one part of full decode&lt;/li&gt;
&lt;li&gt;a primitive result does not include cache update, value path, model layers, or generation overhead&lt;/li&gt;
&lt;li&gt;comparing against FP32 unquantized keys is not the same as comparing against an optimized FP16/BF16 dense path&lt;/li&gt;
&lt;li&gt;H100/JAX-style specialized kernels are not the same environment as a general local &lt;code&gt;transformers&lt;/code&gt; fork&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So this work does not prove:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;TurboQuant is wrong.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It proves:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;my packed-codebook TurboQuant-family path did not have enough room to become a full decode-latency win in this repo.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That distinction is important.&lt;/p&gt;

&lt;p&gt;It is also why the failure was useful.&lt;/p&gt;

&lt;p&gt;It stopped me from turning a near-miss primitive into months of residual/value/model integration work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reusable lesson
&lt;/h2&gt;

&lt;p&gt;The main lesson from this stage was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Beating your own unoptimized implementation is not enough.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A compressed path has to beat the real target baseline.&lt;/p&gt;

&lt;p&gt;For GPU decode, the real target baseline is dense optimized attention/logits, not the first eager prototype.&lt;/p&gt;

&lt;p&gt;The second lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A logits-only win needs enough margin to pay for the rest of attention.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the best logits-only result barely reaches the required threshold at one context and misses at another, full attention is not going to improve the situation for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this left the project
&lt;/h2&gt;

&lt;p&gt;After this stage, the packed-codebook TurboQuant latency path was closed.&lt;/p&gt;

&lt;p&gt;Not because compression was fake.&lt;/p&gt;

&lt;p&gt;Not because the math was useless.&lt;/p&gt;

&lt;p&gt;Because the compressed representation was not cheap enough to consume compared with dense FP16/BF16 GPU math.&lt;/p&gt;

&lt;p&gt;That left two possible pivots:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Try a more hardware-friendly KV representation.&lt;/li&gt;
&lt;li&gt;Stop compressing values and instead reduce dense attention work.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those became the final experiments.&lt;/p&gt;

&lt;p&gt;They also had to pass the same discipline: bytes, speed ceilings, and quality are separate gates.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;If you only remember one thing from this post, it should be this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A compressed kernel has to beat the real dense baseline, not just the unoptimized compressed prototype.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The fused logits kernel was useful because it answered that question before I spent time on residual correction, value accumulation, and model integration.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>gpu</category>
      <category>research</category>
      <category>transformers</category>
    </item>
    <item>
      <title>When A Good Approximation Still Loses</title>
      <dc:creator>Alankrit Verma</dc:creator>
      <pubDate>Sun, 26 Apr 2026 07:29:57 +0000</pubDate>
      <link>https://forem.com/alankritverma/when-a-good-approximation-still-loses-29c7</link>
      <guid>https://forem.com/alankritverma/when-a-good-approximation-still-loses-29c7</guid>
      <description>&lt;p&gt;I wanted to answer one question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Why did a mathematically reasonable value approximation still fail as a runtime optimization?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The short answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;active fraction is not runtime.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I tried to make historical value mixing cheaper after compressed-key attention was stable.&lt;/li&gt;
&lt;li&gt;Chunk summaries were cheap because they removed information; active chunks had a plausible error argument but terrible eager runtime.&lt;/li&gt;
&lt;li&gt;The final vectorized eager path kept active fraction low, but value decode got worse: &lt;code&gt;11.906 ms&lt;/code&gt; became about &lt;code&gt;26 ms&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The lesson was concrete: active token fraction is not runtime. Count gathers, decodes, reductions, scatters, and bookkeeping.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Evidence
&lt;/h2&gt;

&lt;p&gt;I put the detailed benchmark notes in the public evidence repo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Results ledger: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Active-chunk postmortem: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/active-chunk-background-postmortem.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/active-chunk-background-postmortem.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Final vectorized eager spike summary: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/spike-a-vectorized-active-background-summary.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/spike-a-vectorized-active-background-summary.md&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  My value-side bet
&lt;/h2&gt;

&lt;p&gt;My working hypothesis was not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;randomly approximate values and hope generation survives.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It was more structured:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Keep the compressed-key path stable.&lt;/li&gt;
&lt;li&gt;Preserve recent and sink values exactly because they are high-risk.&lt;/li&gt;
&lt;li&gt;Summarize low-risk historical values only if the quality loss is bounded.&lt;/li&gt;
&lt;li&gt;Select active historical chunks only when the approximation risk is high.&lt;/li&gt;
&lt;li&gt;Require the implementation to reduce real hot-path work, not just theoretical token count.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The mistake was that step &lt;code&gt;5&lt;/code&gt; was weaker than steps &lt;code&gt;1-4&lt;/code&gt; for too long.&lt;/p&gt;

&lt;p&gt;That is why this post is useful beyond this specific cache implementation. It is a case study in the difference between:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a reasonable approximation argument&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a runtime shape that the GPU and framework can execute cheaply.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a smaller KV cache is not automatically a faster attention path&lt;/li&gt;
&lt;li&gt;compressed keys can reduce part of the problem&lt;/li&gt;
&lt;li&gt;values still have to be mixed across history&lt;/li&gt;
&lt;li&gt;that value path became the bottleneck&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The earlier architecture lesson was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A smaller KV cache is not automatically a faster attention path.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That pushed me toward compressed-attention execution instead of storage-only cache compression.&lt;/p&gt;

&lt;p&gt;I built a stable compressed-key baseline. It was not fast enough to be the final answer, but it was coherent. It gave me a way to separate the key side from the value side.&lt;/p&gt;

&lt;p&gt;This distinction matters for the rest of the post:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The experiments below are not a verdict on the official TurboQuant paper or every possible fused implementation. They are a verdict on the eager value-path family I built in this &lt;code&gt;transformers&lt;/code&gt; fork.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then the real problem became clear:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;even with compressed keys, the model still has to mix historical values.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The rest of this piece follows the experiments that tried to make that value path cheaper.&lt;/p&gt;

&lt;p&gt;None became the final answer.&lt;/p&gt;

&lt;p&gt;But each one taught something useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The value-side problem
&lt;/h2&gt;

&lt;p&gt;After attention weights are computed, the model still needs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;o = sum_t a_t v_t
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The expensive part is that &lt;code&gt;t&lt;/code&gt; ranges over the history.&lt;/p&gt;

&lt;p&gt;If the implementation still decodes or processes most historical values every step, then it has not really escaped the long-context cost.&lt;/p&gt;

&lt;p&gt;So the value-path goal was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;keep enough value information for quality, but stop paying full historical value cost everywhere.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The hard part is that "enough" is query-dependent, head-dependent, and quality-sensitive.&lt;/p&gt;

&lt;h2&gt;
  
  
  What survived early: exact recent and sink values
&lt;/h2&gt;

&lt;p&gt;One value-side ingredient consistently made sense:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;keep a small exact window for the most recent tokens and a few sink tokens.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Internally this was called &lt;code&gt;exact_recent_sink&lt;/code&gt;. The reader-facing name is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;exact recent/sink value window&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The reason is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;recent tokens often matter a lot&lt;/li&gt;
&lt;li&gt;sink tokens can have special attention behavior&lt;/li&gt;
&lt;li&gt;preserving them exactly helps stability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was not a speed breakthrough in eager execution, but it was the only value-side ingredient that kept looking defensible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What failed early: more hybrid logic
&lt;/h2&gt;

&lt;p&gt;The first broad hybrid value experiment mixed several ideas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;exact recent/sink windows&lt;/li&gt;
&lt;li&gt;delayed value quantization&lt;/li&gt;
&lt;li&gt;saliency bookkeeping&lt;/li&gt;
&lt;li&gt;anchors&lt;/li&gt;
&lt;li&gt;selective residual logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Internally this was &lt;code&gt;legacy_hybrid_fast&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It lost.&lt;/p&gt;

&lt;p&gt;It added complexity without removing enough dominant work. It was slower than the stable compressed-key baseline and not cleanly better on fidelity.&lt;/p&gt;

&lt;p&gt;The reusable lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;hybrid logic must remove dominant work, not just add selective corrections.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I also tested saliency-heavy background variants and anchors.&lt;/p&gt;

&lt;p&gt;Those did not become forward paths either.&lt;/p&gt;

&lt;p&gt;The pattern was the same:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;more bookkeeping&lt;/li&gt;
&lt;li&gt;more branches&lt;/li&gt;
&lt;li&gt;unclear runtime win&lt;/li&gt;
&lt;li&gt;not enough quality/runtime payoff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These branches are part of the work, but they do not each need their own long section. Their shared lesson was the same, and the later chunk experiments explain it more cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chunk summaries were fast because they were wrong
&lt;/h2&gt;

&lt;p&gt;The next idea was to summarize historical values by chunks.&lt;/p&gt;

&lt;p&gt;For each chunk &lt;code&gt;C_j&lt;/code&gt;, define a mean value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mu_j = (1 / c) sum_(t in C_j) v_t
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;where &lt;code&gt;c&lt;/code&gt; is the chunk size.&lt;/p&gt;

&lt;p&gt;And define the attention mass on that chunk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;M_j = sum_(t in C_j) a_t
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then approximate the chunk contribution as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;M_j mu_j
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum_(t in C_j) a_t v_t
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is attractive because it replaces many token-level value contributions with one chunk-level contribution.&lt;/p&gt;

&lt;p&gt;But the blanket summary version lost too much information.&lt;/p&gt;

&lt;p&gt;It could be faster, but it was fast for the wrong reason: it threw away too much of the task.&lt;/p&gt;

&lt;p&gt;Reusable lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a speedup that comes from destroying quality is not an optimization.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Active chunks had better math and terrible runtime
&lt;/h2&gt;

&lt;p&gt;The next idea tried to keep the good part of chunk summaries without summarizing everything.&lt;/p&gt;

&lt;p&gt;Instead of treating every historical chunk the same:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;keep important chunks active and exact&lt;/li&gt;
&lt;li&gt;summarize the inactive chunks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For inactive chunks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;o_j_hat = M_j mu_j
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For active chunks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;o_j = sum_(t in C_j) a_t v_t
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To choose active chunks, I used a score:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;score_j = mass_j * spread_j
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The intuition was reasonable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;high attention mass means the chunk matters&lt;/li&gt;
&lt;li&gt;high value spread means the mean may be a bad summary&lt;/li&gt;
&lt;li&gt;high &lt;code&gt;mass * spread&lt;/code&gt; means the chunk is risky to approximate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was a real approximation argument.&lt;/p&gt;

&lt;p&gt;It was not a runtime argument.&lt;/p&gt;

&lt;p&gt;The eager implementation was disastrous.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Decode 2048&lt;/th&gt;
&lt;th&gt;Memory Context 2048&lt;/th&gt;
&lt;th&gt;Long Next-Token MSE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;stable compressed-key baseline&lt;/td&gt;
&lt;td&gt;&lt;code&gt;54.607 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;13.717 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.349&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;exact recent/sink window&lt;/td&gt;
&lt;td&gt;&lt;code&gt;70.205 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;14.485 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.326&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;active-chunk value approximation&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1148.843 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;311.692 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.494&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa08rv0uvnpp23o10llv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa08rv0uvnpp23o10llv7.png" alt="Active chunk approximation runtime blow-up" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The profile explained the failure.&lt;/p&gt;

&lt;p&gt;The active-chunk implementation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;selected active chunks&lt;/li&gt;
&lt;li&gt;looped over those chunks in Python&lt;/li&gt;
&lt;li&gt;decoded tiny slices repeatedly&lt;/li&gt;
&lt;li&gt;scattered small contributions back repeatedly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At decode &lt;code&gt;2048&lt;/code&gt;, value decode time went from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;9.080 ms -&amp;gt; 1079.421 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is not a small miss. That is an implementation shape failure.&lt;/p&gt;

&lt;p&gt;The major lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I had proved an approximation bound, not a runtime win.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The process lesson was just as important:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Before a large run, every approximation needs a written hot-path shape: how many gathers, decodes, reductions, scatters, and Python-level loops are actually in the decode step.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdtqtsvw3rvvyvq1q9mai.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdtqtsvw3rvvyvq1q9mai.png" alt="Experiment timeline" width="800" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The final eager test: vectorize active chunks
&lt;/h2&gt;

&lt;p&gt;After that failure, the next question was narrow:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Was the active-chunk idea bad, or was the eager implementation shape bad?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I ran one final eager viability test.&lt;/p&gt;

&lt;p&gt;Internal name:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vectorized_active_background
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reader-facing name:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;final vectorized active-background value test&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The goal was to keep the same compressed-key baseline and redesign only historical value participation.&lt;/p&gt;

&lt;p&gt;The design used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;exact sink tokens&lt;/li&gt;
&lt;li&gt;exact recent tokens&lt;/li&gt;
&lt;li&gt;a dense staging buffer for values that had aged out of the recent window&lt;/li&gt;
&lt;li&gt;fixed-size historical chunks&lt;/li&gt;
&lt;li&gt;active exact chunks selected by attention mass and spread&lt;/li&gt;
&lt;li&gt;inactive chunk summaries&lt;/li&gt;
&lt;li&gt;vectorized gather/decode/reduction instead of Python loops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The intended output decomposition was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;o = o_sink + o_recent + o_staging + o_history
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For historical chunks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;o_history ~= sum_(j in A) sum_(t in C_j) a_t v_t
           + sum_(j not in A) M_j mu_j
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;where &lt;code&gt;A&lt;/code&gt; is the active chunk set.&lt;/p&gt;

&lt;p&gt;This test had a hard benchmark gate.&lt;/p&gt;

&lt;p&gt;It needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;at least &lt;code&gt;20%&lt;/code&gt; better decode step latency than the stable compressed-key baseline&lt;/li&gt;
&lt;li&gt;at least &lt;code&gt;40%&lt;/code&gt; lower value-decode time&lt;/li&gt;
&lt;li&gt;active token fraction at most &lt;code&gt;0.25&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;at least &lt;code&gt;20%&lt;/code&gt; better long-context latency&lt;/li&gt;
&lt;li&gt;no material fidelity collapse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benchmark grid was intentionally small:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Reader Label&lt;/th&gt;
&lt;th&gt;Summary Chunk Size&lt;/th&gt;
&lt;th&gt;Active Chunk Ratio&lt;/th&gt;
&lt;th&gt;Min Active Chunks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chunk 16, 12.5% active&lt;/td&gt;
&lt;td&gt;&lt;code&gt;16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.125&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;chunk 16, 25% active&lt;/td&gt;
&lt;td&gt;&lt;code&gt;16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.25&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;chunk 32, 12.5% active&lt;/td&gt;
&lt;td&gt;&lt;code&gt;32&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.125&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This was not a sweep for the sake of sweeping. It was a falsification test.&lt;/p&gt;

&lt;h2&gt;
  
  
  The result: active fraction was not runtime
&lt;/h2&gt;

&lt;p&gt;The stable compressed-key baseline for this run was:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Decode &lt;code&gt;2048&lt;/code&gt; step latency&lt;/td&gt;
&lt;td&gt;&lt;code&gt;70.604 ms&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode &lt;code&gt;2048&lt;/code&gt; value-decode time&lt;/td&gt;
&lt;td&gt;&lt;code&gt;11.906 ms&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefill &lt;code&gt;2048&lt;/code&gt; latency&lt;/td&gt;
&lt;td&gt;&lt;code&gt;58.128 ms&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-context latency&lt;/td&gt;
&lt;td&gt;&lt;code&gt;14.212 s&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fidelity MSE vs dense baseline&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.2326&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fidelity top-k overlap&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The candidates:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Reader Label&lt;/th&gt;
&lt;th&gt;Decode Step&lt;/th&gt;
&lt;th&gt;Decode vs Baseline&lt;/th&gt;
&lt;th&gt;Value Decode&lt;/th&gt;
&lt;th&gt;Active Token Fraction&lt;/th&gt;
&lt;th&gt;Long-Context Latency&lt;/th&gt;
&lt;th&gt;Fidelity MSE&lt;/th&gt;
&lt;th&gt;Top-k&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chunk 16, 12.5% active&lt;/td&gt;
&lt;td&gt;&lt;code&gt;68.726 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+2.7%&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;26.461 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.125&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;17.756 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.3115&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;chunk 16, 25% active&lt;/td&gt;
&lt;td&gt;&lt;code&gt;68.768 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+2.6%&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;26.149 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.25&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;17.884 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.3314&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.7&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;chunk 32, 12.5% active&lt;/td&gt;
&lt;td&gt;&lt;code&gt;68.386 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+3.1%&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;26.341 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.1333&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;17.780 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2.1598&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.4&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No configuration passed.&lt;/p&gt;

&lt;p&gt;The active token fraction stayed within budget.&lt;/p&gt;

&lt;p&gt;But the actual value-decode time got much worse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;11.906 ms -&amp;gt; about 26 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Long-context latency got about &lt;code&gt;25%&lt;/code&gt; worse.&lt;/p&gt;

&lt;p&gt;Fidelity regressed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fugdzn49it1en04twa081.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fugdzn49it1en04twa081.png" alt="Final eager gate results" width="800" height="412"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The core lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;active token fraction is not runtime.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The runtime is closer to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;total_cost = selection
           + gather
           + active_decode
           + active_reduce
           + inactive_reduce
           + bookkeeping
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;not:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;total_cost = active_fraction * dense_cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final vectorized eager path removed the catastrophic Python loop, but it still did not produce a frontier-moving result.&lt;/p&gt;

&lt;p&gt;So the disciplined decision was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;freeze eager variants in this compressed-attention family.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The test sequence was deliberately not broad:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;focused correctness tests&lt;/li&gt;
&lt;li&gt;decode at roughly 2048 prompt tokens&lt;/li&gt;
&lt;li&gt;prefill at roughly 2048 prompt tokens&lt;/li&gt;
&lt;li&gt;long-context memory/latency at roughly 2048 prompt tokens&lt;/li&gt;
&lt;li&gt;next-token logit fidelity&lt;/li&gt;
&lt;li&gt;profile counters for active gather/decode/reduce, inactive summary work, and staging flushes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That matters because this was a kill-gate, not a hyperparameter search.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this does and does not prove
&lt;/h2&gt;

&lt;p&gt;It does prove:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;this eager value-path family should not keep getting new variants&lt;/li&gt;
&lt;li&gt;vectorizing a bad workload is not automatically enough&lt;/li&gt;
&lt;li&gt;runtime-cost modeling must happen before broad benchmarks&lt;/li&gt;
&lt;li&gt;quality gates matter as much as speed gates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does not prove:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;all compressed attention is useless&lt;/li&gt;
&lt;li&gt;TurboQuant-style key ideas are worthless&lt;/li&gt;
&lt;li&gt;lower-level fused implementations cannot work&lt;/li&gt;
&lt;li&gt;that the official TurboQuant result is invalid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The correct conclusion is narrower:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the eager implementation level failed for this family.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is still a useful result. It prevents the next round of work from being "one more eager variant with one more heuristic."&lt;/p&gt;

&lt;h2&gt;
  
  
  What came next
&lt;/h2&gt;

&lt;p&gt;At this point, the eager value-path family was frozen.&lt;/p&gt;

&lt;p&gt;But that did not yet answer a lower-level question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Was the compressed-attention idea weak, or was the eager implementation level weak?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Before doing kernel work, there was one exact cleanup worth proving.&lt;/p&gt;

&lt;p&gt;The stable compressed-key baseline currently reconstructs rotated value information in a way that appears algebraically wasteful.&lt;/p&gt;

&lt;p&gt;Current form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;decoded_values_t = R^-1(z_t)
o = sum_t a_t decoded_values_t
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because &lt;code&gt;R^-1&lt;/code&gt; is linear:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum_t a_t R^-1(z_t) = R^-1(sum_t a_t z_t)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So instead of inverse-rotating every historical token value, I can first compute the weighted sum in rotated space and inverse-rotate once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z_weighted = sum_t a_t z_t
o = R^-1(z_weighted)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is exact relative to the current value codec.&lt;/p&gt;

&lt;p&gt;It is not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a new heuristic&lt;/li&gt;
&lt;li&gt;a new approximation&lt;/li&gt;
&lt;li&gt;another active-background branch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is a baseline cleanup.&lt;/p&gt;

&lt;p&gt;For SmolLM2-135M at &lt;code&gt;T = 2048&lt;/code&gt;, the rough inverse-rotation count changes from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;H_kv * T = 3 * 2048 = 6144
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;H_kv * G * Q = 3 * 3 * 1 = 9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;for the output rotations.&lt;/p&gt;

&lt;p&gt;Here &lt;code&gt;H_kv&lt;/code&gt; is the number of KV heads, &lt;code&gt;G&lt;/code&gt; is the number of query groups per KV head, and &lt;code&gt;Q&lt;/code&gt; is the number of query positions in a decode step.&lt;/p&gt;

&lt;p&gt;That is why it deserved a quick proof.&lt;/p&gt;

&lt;p&gt;If that exact cleanup did not move the profile enough, then the only remaining TurboQuant-family question was lower-level:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;can a fused/Triton/CUDA-style compressed-attention primitive beat dense attention by a large enough margin to justify integration?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a different proof level. It is no longer a cache-API experiment; it is a primitive benchmark against dense GPU attention/logits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lessons worth keeping
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Storage compression and runtime execution must be measured separately.&lt;/li&gt;
&lt;li&gt;A stable baseline is more valuable than a pile of incomparable branches.&lt;/li&gt;
&lt;li&gt;Every approximation needs a runtime proof obligation.&lt;/li&gt;
&lt;li&gt;Python loops in decode hot paths are presumed guilty until proven otherwise.&lt;/li&gt;
&lt;li&gt;Active fraction, cache size, and runtime are different metrics.&lt;/li&gt;
&lt;li&gt;A failed eager implementation does not automatically disprove the math.&lt;/li&gt;
&lt;li&gt;A failed hard gate should stop the branch, not invite endless tuning.&lt;/li&gt;
&lt;li&gt;The next proof should be exact if possible, primitive-level if necessary, and killed quickly if weak.&lt;/li&gt;
&lt;li&gt;Benchmark-only experiment names should stay separate from public API promises.&lt;/li&gt;
&lt;li&gt;Every serious run should emit enough configuration and profiling data to explain the result later.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;This value-path work did not produce a production-ready faster cache path.&lt;/p&gt;

&lt;p&gt;But it did produce a useful map:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;exact recent/sink values were defensible but not enough&lt;/li&gt;
&lt;li&gt;summary-only background was fast because it lost too much information&lt;/li&gt;
&lt;li&gt;active chunks had a reasonable approximation argument and a bad hot path&lt;/li&gt;
&lt;li&gt;vectorization removed the catastrophic loop shape but still missed the gate&lt;/li&gt;
&lt;li&gt;active fraction, cache size, and runtime were different metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the part worth keeping.&lt;/p&gt;

&lt;p&gt;Not because the final result won. It did not.&lt;/p&gt;

&lt;p&gt;Because now I know what not to try again at the eager value-path level.&lt;/p&gt;

&lt;p&gt;The remaining question was lower-level:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What if the compressed path only failed because eager execution was the wrong implementation level?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That question needed a different kind of proof: a primitive-level comparison against dense GPU attention/logits, not another eager cache variant.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>performance</category>
      <category>research</category>
    </item>
    <item>
      <title>A Smaller KV Cache Did Not Make Transformers Faster</title>
      <dc:creator>Alankrit Verma</dc:creator>
      <pubDate>Sun, 26 Apr 2026 07:22:26 +0000</pubDate>
      <link>https://forem.com/alankritverma/a-smaller-kv-cache-did-not-make-transformers-faster-2j</link>
      <guid>https://forem.com/alankritverma/a-smaller-kv-cache-did-not-make-transformers-faster-2j</guid>
      <description>&lt;p&gt;Long-context generation makes the KV cache hard to ignore.&lt;/p&gt;

&lt;p&gt;I wanted to answer one question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Why can a KV cache become much smaller while generation gets slower?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The short answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;storage compression and attention execution are different problems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I measured KV-cache compression as a systems problem, not just a storage problem.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;quanto&lt;/code&gt; cut the cache footprint from &lt;code&gt;50.911 MiB&lt;/code&gt; to &lt;code&gt;0.913 MiB&lt;/code&gt;, but generation latency increased from &lt;code&gt;2.250 s&lt;/code&gt; to &lt;code&gt;3.912 s&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;That result was useful: it separated storage compression from execution compression.&lt;/li&gt;
&lt;li&gt;The rest of the work followed from that distinction. If attention still consumes dense tensors, smaller cache storage alone will not make decode faster.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Evidence
&lt;/h2&gt;

&lt;p&gt;I put the detailed benchmark notes in a public evidence repo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public evidence repo: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Results ledger: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Baseline storage-vs-latency summary: &lt;a href="https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/pre-turboquant-quantized-cache-report.md" rel="noopener noreferrer"&gt;https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/pre-turboquant-quantized-cache-report.md&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The first trap: storage is not execution
&lt;/h2&gt;

&lt;p&gt;The first hypothesis sounded reasonable:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;fewer cache bytes should mean faster generation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But that bundles two different claims together:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The cache stores fewer bytes.&lt;/li&gt;
&lt;li&gt;The attention step does less work.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those are not the same claim.&lt;/p&gt;

&lt;p&gt;The better engineering question was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;does the attention hot path consume less work, or did I only store the same work in a smaller format?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the lens for the rest of the post.&lt;/p&gt;

&lt;p&gt;Every generated token reuses keys and values from previous tokens. As the context grows, those cached tensors grow with it. So the natural first idea is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Compress the KV cache, store fewer bytes, and get faster generation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I tested that idea while exploring TurboQuant-style cache compression in a Hugging Face &lt;code&gt;transformers&lt;/code&gt; fork.&lt;/p&gt;

&lt;p&gt;Important scope note:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is not a claim that the official TurboQuant research idea "does not work."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The external context is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google Research introduced TurboQuant as a compression method for extreme KV-cache and vector compression: &lt;a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/" rel="noopener noreferrer"&gt;https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The TurboQuant paper describes an online vector quantization approach with residual correction for inner-product preservation: &lt;a href="https://arxiv.org/abs/2504.19874" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2504.19874&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Hugging Face &lt;code&gt;transformers&lt;/code&gt; exposes several cache strategies, including dynamic and quantized caches: &lt;a href="https://huggingface.co/docs/transformers/en/kv_cache" rel="noopener noreferrer"&gt;https://huggingface.co/docs/transformers/en/kv_cache&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What I tested was narrower:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can I make a TurboQuant-style compressed-attention path useful inside a local eager &lt;code&gt;transformers&lt;/code&gt; implementation?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The first useful result was not that a particular backend won.&lt;/p&gt;

&lt;p&gt;It was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Storage compression and attention execution are different problems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A cache can become dramatically smaller while generation gets slower.&lt;/p&gt;

&lt;p&gt;That single distinction changed the rest of the project.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mental model
&lt;/h2&gt;

&lt;p&gt;In decoder-only generation, each new token uses cached keys and values from previous tokens.&lt;/p&gt;

&lt;p&gt;Simplified for one attention head:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a = softmax(q K^T)
o = a V
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;q&lt;/code&gt; is the current query.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;K&lt;/code&gt; is the historical key cache.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;V&lt;/code&gt; is the historical value cache.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;a&lt;/code&gt; is the attention distribution.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;o&lt;/code&gt; is the output contribution from history.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keys decide where to attend. Values provide the information that gets mixed.&lt;/p&gt;

&lt;p&gt;When context length grows, both &lt;code&gt;K&lt;/code&gt; and &lt;code&gt;V&lt;/code&gt; grow.&lt;/p&gt;

&lt;p&gt;So compression can target at least two different things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Store the cache in fewer bytes.&lt;/li&gt;
&lt;li&gt;Execute attention without reconstructing dense historical tensors.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those sound related. In practice, they are different engineering targets.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first measurement
&lt;/h2&gt;

&lt;p&gt;I started with existing cache behavior in &lt;code&gt;transformers&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The baselines were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;DynamicCache&lt;/code&gt;: dense eager execution.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;quanto&lt;/code&gt;: a strong storage-compression baseline.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hqq&lt;/code&gt;: another quantized-cache baseline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benchmark below used &lt;code&gt;HuggingFaceTB/SmolLM2-135M-Instruct&lt;/code&gt; in a roughly 2048-token context generation case.&lt;/p&gt;

&lt;p&gt;I measured more than just stored bytes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generation latency&lt;/li&gt;
&lt;li&gt;stored cache footprint&lt;/li&gt;
&lt;li&gt;cache bytes per token&lt;/li&gt;
&lt;li&gt;sampled runtime memory&lt;/li&gt;
&lt;li&gt;whether generated outputs matched the dense baseline in simple cases&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;What It Represents&lt;/th&gt;
&lt;th&gt;Mean Latency&lt;/th&gt;
&lt;th&gt;Cache Footprint&lt;/th&gt;
&lt;th&gt;Cache Bytes / Token&lt;/th&gt;
&lt;th&gt;Runtime Delta Peak&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dynamic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;dense eager baseline&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2.250 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;50.911 MiB&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;23040.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.102 GB&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;quanto&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;strong storage-compression baseline&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3.912 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.913 MiB&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;413.3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.048 GB&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hqq&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;alternative quantized-cache baseline&lt;/td&gt;
&lt;td&gt;&lt;code&gt;9.770 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;19.133 MiB&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;8658.6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.040 GB&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The important row is &lt;code&gt;quanto&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It reduced stored cache footprint from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;50.911 MiB -&amp;gt; 0.913 MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is an excellent cache-size result.&lt;/p&gt;

&lt;p&gt;But latency went from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2.250 s -&amp;gt; 3.912 s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So cache storage got much smaller, while generation got slower.&lt;/p&gt;

&lt;p&gt;That is not a paradox. It shows what the backend is optimizing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why smaller storage did not mean faster attention
&lt;/h2&gt;

&lt;p&gt;The current generic quantized-cache shape in &lt;code&gt;transformers&lt;/code&gt; is roughly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Produce new dense keys and values.&lt;/li&gt;
&lt;li&gt;Quantize them for storage.&lt;/li&gt;
&lt;li&gt;Keep compressed tensors in the cache.&lt;/li&gt;
&lt;li&gt;Later dequantize cached tensors.&lt;/li&gt;
&lt;li&gt;Return dense keys and values to normal attention.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So the attention implementation still consumes dense tensors.&lt;/p&gt;

&lt;p&gt;That means the architecture is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;compressed storage + dense execution&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;compressed attention&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flccxpbclgygoftb3b8lt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flccxpbclgygoftb3b8lt.png" alt="Storage compression versus execution compression" width="800" height="412"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first design can save cache bytes.&lt;/p&gt;

&lt;p&gt;The second design is needed if the goal is to make attention itself faster.&lt;/p&gt;

&lt;p&gt;This distinction became the first real output of the project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I still looked at compressed attention
&lt;/h2&gt;

&lt;p&gt;TurboQuant-style work was interesting because the bigger promise is not simply "store the KV cache with fewer bits."&lt;/p&gt;

&lt;p&gt;The stronger target is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;store historical keys in a compressed representation&lt;/li&gt;
&lt;li&gt;compute attention logits using that compressed representation&lt;/li&gt;
&lt;li&gt;avoid reconstructing every dense historical key each decode step&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ordinary dense key path computes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;logits_t = q . k_t
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;for every historical token &lt;code&gt;t&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The compressed-key target is closer to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;logits_t ~= compressed_dot(q, code(k_t), residual(k_t))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;without materializing every full &lt;code&gt;k_t&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is an execution-path change.&lt;/p&gt;

&lt;p&gt;It requires a different shape than a normal storage-only &lt;code&gt;QuantizedCache&lt;/code&gt; backend.&lt;/p&gt;

&lt;p&gt;That is why the project became less about "add another cache backend" and more about "change what attention actually consumes."&lt;/p&gt;

&lt;h2&gt;
  
  
  The stable compressed-key baseline
&lt;/h2&gt;

&lt;p&gt;I built a stable compressed-key baseline to test that direction.&lt;/p&gt;

&lt;p&gt;Internally, I called it &lt;code&gt;reference&lt;/code&gt;. For a public reader, the better name is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the stable compressed-key baseline&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Its job was not to be the final optimized system. Its job was to prove that an end-to-end compressed-key attention path could exist in a Llama-style eager stack and provide a consistent comparison point for later experiments.&lt;/p&gt;

&lt;p&gt;It kept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compressed historical keys&lt;/li&gt;
&lt;li&gt;compressed-key attention-logit computation&lt;/li&gt;
&lt;li&gt;residual correction behavior&lt;/li&gt;
&lt;li&gt;a full value path so correctness and fidelity stayed interpretable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That baseline survived the project better than the later value-path experiments.&lt;/p&gt;

&lt;p&gt;The key lesson was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The compressed-key path was not where most failures came from.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The failures came from values.&lt;/p&gt;

&lt;p&gt;I also saw some directional evidence that compressed-key work might become more interesting as model/context size changes. But that evidence was not clean enough to be the headline result. The safe claim was narrower:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;keep the compressed-key baseline as an internal anchor, but do not call it the final system.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why values became the hard part
&lt;/h2&gt;

&lt;p&gt;Attention has two major pieces:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Compute attention weights from keys.&lt;/li&gt;
&lt;li&gt;Mix values using those weights.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even if keys are compressed, the output still requires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;o = sum_t a_t v_t
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the implementation still reconstructs or processes values across most of history, the value path remains expensive.&lt;/p&gt;

&lt;p&gt;That is exactly what happened.&lt;/p&gt;

&lt;p&gt;The project shifted from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can I compress the cache?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can I keep the compressed-key path and make historical value participation structurally cheaper?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That question led to the second half of the work: multiple value-path approximations, most of which failed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;This is the architecture lesson that shaped the rest of the work.&lt;/p&gt;

&lt;p&gt;I learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Existing quantized cache backends can be very good at reducing stored cache footprint.&lt;/li&gt;
&lt;li&gt;Stored-cache size is not the same as runtime attention cost.&lt;/li&gt;
&lt;li&gt;Dense eager execution is a serious baseline because it has a simple hot path.&lt;/li&gt;
&lt;li&gt;TurboQuant-style compressed-key attention is a different target from storage-only cache compression.&lt;/li&gt;
&lt;li&gt;The stable compressed-key path was useful enough to keep as an internal baseline.&lt;/li&gt;
&lt;li&gt;The next bottleneck was historical value mixing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where the next technical question came from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can I make historical value mixing cheaper without destroying quality?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That question is more brutal than cache compression, because it is no longer enough to store fewer bytes. The compressed representation also has to be cheap to use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scope
&lt;/h2&gt;

&lt;p&gt;These measurements came from one local fork, one benchmark setup, and a small-model-first workflow. The goal was not to claim universal results for every model and GPU.&lt;/p&gt;

&lt;p&gt;The goal was to answer a systems question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Am I actually reducing attention execution cost, or only cache storage?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For this phase, the answer was clear.&lt;/p&gt;

&lt;p&gt;I had reduced storage.&lt;/p&gt;

&lt;p&gt;I had not yet won execution.&lt;/p&gt;

&lt;p&gt;This distinction changed the next question. Once the key path had a stable compressed baseline, the remaining bottleneck was not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;can I store fewer bytes?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;can I mix historical values cheaply enough without breaking quality?&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>performance</category>
      <category>research</category>
    </item>
    <item>
      <title>Synthetic Population Testing for Recommendation Systems</title>
      <dc:creator>Alankrit Verma</dc:creator>
      <pubDate>Sat, 04 Apr 2026 02:04:50 +0000</pubDate>
      <link>https://forem.com/alankritverma/synthetic-population-testing-for-recommendation-systems-58f5</link>
      <guid>https://forem.com/alankritverma/synthetic-population-testing-for-recommendation-systems-58f5</guid>
      <description>&lt;p&gt;&lt;em&gt;Offline evaluation is necessary for recommender systems. It is also not a full test of recommender quality. The missing layer is not only better aggregate metrics, but better ways to test how a model behaves for different kinds of users before launch.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;In the last post, I argued that offline evaluation is useful but incomplete for recommendation systems.&lt;/li&gt;
&lt;li&gt;After that, I built a small public artifact to make the gap concrete.&lt;/li&gt;
&lt;li&gt;In the canonical MovieLens comparison, the popularity baseline wins &lt;code&gt;Recall@10&lt;/code&gt; and &lt;code&gt;NDCG@10&lt;/code&gt;, but the candidate model does much better for Explorer and Niche-interest users and creates a very different behavioral profile.&lt;/li&gt;
&lt;li&gt;I do not think this means “offline evaluation is wrong.”&lt;/li&gt;
&lt;li&gt;I think it means a better pre-launch evaluation stack should include some form of synthetic population testing: explicit behavioral lenses, trajectory-aware diagnostics, and tests that make hidden tradeoffs visible before launch.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Comes After “Offline Evaluation Is Not Enough”?
&lt;/h2&gt;

&lt;p&gt;In the first post, I made a narrow claim:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;offline evaluation is useful, but incomplete, because recommendation systems are interactive systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That argument matters, but by itself it leaves an obvious next question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;if aggregate offline metrics are not enough, what should be added to the evaluation stack?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I do not think the answer starts with a giant platform or a perfect user simulator.&lt;/p&gt;

&lt;p&gt;I think the more practical place to start is smaller:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;take the same baseline-vs-candidate comparison and test it through multiple behavioral lenses, not just one aggregate average.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is what I built next.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Artifact
&lt;/h2&gt;

&lt;p&gt;The current artifact is a small public recommender behavior QA harness.&lt;/p&gt;

&lt;p&gt;It compares:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one baseline recommender&lt;/li&gt;
&lt;li&gt;one candidate recommender&lt;/li&gt;
&lt;li&gt;one fixed evaluation setup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And it produces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;standard offline ranking metrics&lt;/li&gt;
&lt;li&gt;bucket-level utility&lt;/li&gt;
&lt;li&gt;behavioral diagnostics such as novelty, repetition, and catalog concentration&lt;/li&gt;
&lt;li&gt;short trajectory traces that make model behavior easier to inspect&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The canonical public run is intentionally narrow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MovieLens 100K&lt;/li&gt;
&lt;li&gt;Model A: popularity baseline&lt;/li&gt;
&lt;li&gt;Model B: genre-profile recommender with a popularity prior&lt;/li&gt;
&lt;li&gt;4 fixed buckets&lt;/li&gt;
&lt;li&gt;one frozen report bundle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point is not to claim that these two models define recommender evaluation. The point is to create one clean, reproducible proof that aggregate offline metrics can hide useful pre-launch information.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Canonical Result
&lt;/h2&gt;

&lt;p&gt;The canonical MovieLens run shows the core value in one comparison.&lt;/p&gt;

&lt;p&gt;On aggregate offline ranking metrics, the popularity baseline wins:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Recall@10&lt;/th&gt;
&lt;th&gt;NDCG@10&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model A&lt;/td&gt;
&lt;td&gt;0.088&lt;/td&gt;
&lt;td&gt;0.057&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model B&lt;/td&gt;
&lt;td&gt;0.058&lt;/td&gt;
&lt;td&gt;0.036&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If we stopped there, the conclusion would be straightforward: Model A looks better.&lt;/p&gt;

&lt;p&gt;But the bucketed view tells a different story:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bucket&lt;/th&gt;
&lt;th&gt;Model A&lt;/th&gt;
&lt;th&gt;Model B&lt;/th&gt;
&lt;th&gt;Delta (B-A)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Conservative mainstream&lt;/td&gt;
&lt;td&gt;0.519&lt;/td&gt;
&lt;td&gt;0.532&lt;/td&gt;
&lt;td&gt;0.012&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explorer / novelty-seeking&lt;/td&gt;
&lt;td&gt;0.339&lt;/td&gt;
&lt;td&gt;0.523&lt;/td&gt;
&lt;td&gt;0.184&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Niche-interest&lt;/td&gt;
&lt;td&gt;0.443&lt;/td&gt;
&lt;td&gt;0.722&lt;/td&gt;
&lt;td&gt;0.279&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low-patience&lt;/td&gt;
&lt;td&gt;0.321&lt;/td&gt;
&lt;td&gt;0.364&lt;/td&gt;
&lt;td&gt;0.043&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mxcw3czl29t6fzv48hj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mxcw3czl29t6fzv48hj.png" alt="What offline metrics missed" width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That is the point.&lt;/p&gt;

&lt;p&gt;Aggregate offline metrics say one thing. The segment-aware view says something more useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the baseline is better at recovering held-out positives&lt;/li&gt;
&lt;li&gt;the candidate is much stronger for important user lenses&lt;/li&gt;
&lt;li&gt;the behavioral profile of the system changes in ways the aggregate view compresses away&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The behavioral diagnostics make that even clearer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Novelty&lt;/th&gt;
&lt;th&gt;Repetition&lt;/th&gt;
&lt;th&gt;Catalog concentration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model A&lt;/td&gt;
&lt;td&gt;0.395&lt;/td&gt;
&lt;td&gt;0.279&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model B&lt;/td&gt;
&lt;td&gt;0.678&lt;/td&gt;
&lt;td&gt;0.664&lt;/td&gt;
&lt;td&gt;0.717&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is worth pausing on, because not every behavioral metric moves in the same direction.&lt;/p&gt;

&lt;p&gt;Model B is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;more novel&lt;/li&gt;
&lt;li&gt;less catalog-concentrated&lt;/li&gt;
&lt;li&gt;but also more repetitive in this diagnostic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not a bug in the framework. It is part of the point. Different recommendation strategies produce different behavioral signatures, and pre-launch evaluation should help make those signatures visible instead of collapsing everything into one average.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0042vaqhregokfcsmnz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0042vaqhregokfcsmnz.png" alt="Bucket utility comparison" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What “Synthetic Population Testing” Means Here
&lt;/h2&gt;

&lt;p&gt;It is important to be precise about this phrase.&lt;/p&gt;

&lt;p&gt;What I have today is &lt;strong&gt;not&lt;/strong&gt; a rich simulation of realistic synthetic humans. There are no agent conversations, no generated personas with biographies, and no claim that the current system faithfully reproduces real user psychology.&lt;/p&gt;

&lt;p&gt;What the artifact does have is a simpler and more controlled version of the same idea:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fixed behavioral lenses&lt;/li&gt;
&lt;li&gt;explicit utility assumptions&lt;/li&gt;
&lt;li&gt;short trajectory simulation under those assumptions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The four v1 buckets are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Conservative mainstream&lt;/li&gt;
&lt;li&gt;Explorer / novelty-seeking&lt;/li&gt;
&lt;li&gt;Niche-interest&lt;/li&gt;
&lt;li&gt;Low-patience&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each bucket values recommendation behavior differently. The evaluation then asks how the same two models behave when the user lens changes.&lt;/p&gt;

&lt;p&gt;So when I say &lt;strong&gt;synthetic population testing&lt;/strong&gt; here, I mean:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;an early, lightweight form of synthetic population testing built from fixed behavioral lenses, not full synthetic-user simulation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I think that still matters. It turns vague product intuition like “some users may prefer this model more than others” into an explicit, reproducible pre-launch test.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fomv81qvtmthfzd5xiz2j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fomv81qvtmthfzd5xiz2j.png" alt="What synthetic means here" width="800" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Is Better Than Another Aggregate Metric
&lt;/h2&gt;

&lt;p&gt;A natural response to the first post is to ask whether we simply need better aggregate metrics.&lt;/p&gt;

&lt;p&gt;I do not think that is enough.&lt;/p&gt;

&lt;p&gt;The problem is not only that a metric is imperfect. The deeper problem is that recommender quality is heterogeneous.&lt;/p&gt;

&lt;p&gt;Different users are helped by different behaviors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;some want safer, familiar, high-exposure items&lt;/li&gt;
&lt;li&gt;some benefit from more novelty and more variety&lt;/li&gt;
&lt;li&gt;some have narrower tastes that require stronger matching to long-tail pockets&lt;/li&gt;
&lt;li&gt;some degrade faster when sequences become stale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single global score cannot represent all of that well.&lt;/p&gt;

&lt;p&gt;That is why I think the next useful layer should look more like testing against a small synthetic population than inventing one more scalar.&lt;/p&gt;

&lt;p&gt;Instead of asking only:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;which model wins on average?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;we should also ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;which model wins for which behavioral lens?&lt;/p&gt;

&lt;p&gt;where do the models differ most?&lt;/p&gt;

&lt;p&gt;what kind of trajectory does each model produce?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This does not mean the current bucket lenses are perfect. It means they are often more informative than one collapsed aggregate average.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Short Trajectory Example
&lt;/h2&gt;

&lt;p&gt;The trajectory view matters because recommendation quality is not only one-step.&lt;/p&gt;

&lt;p&gt;Here is one Explorer / novelty-seeking comparison from the canonical run:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model A&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raiders of the Lost Ark -&amp;gt; Fargo -&amp;gt; Toy Story -&amp;gt; Return of the Jedi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Model B&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prophecy, The -&amp;gt; Cat People -&amp;gt; Wes Craven's New Nightmare -&amp;gt; Relic, The
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first sequence stays much closer to familiar, high-exposure titles. The second is much more tailored to a narrower taste profile and much more novel.&lt;/p&gt;

&lt;p&gt;This is exactly the kind of difference that disappears when evaluation is reduced to one aggregate ranking score.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kzh5vaksyaf7drogtvz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kzh5vaksyaf7drogtvz.png" alt="Explorer trace comparison" width="800" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Before Launch
&lt;/h2&gt;

&lt;p&gt;Pre-launch evaluation is about decisions, not just measurements.&lt;/p&gt;

&lt;p&gt;If a team is deciding whether to ship a new recommender, the real question is usually not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;did one mean score go up?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is closer to this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;who gets a better experience?&lt;/li&gt;
&lt;li&gt;who gets a worse one?&lt;/li&gt;
&lt;li&gt;does the candidate become more repetitive?&lt;/li&gt;
&lt;li&gt;does it collapse toward head items?&lt;/li&gt;
&lt;li&gt;does it create a healthier exploration profile?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are product and system questions, not only ranking-metric questions.&lt;/p&gt;

&lt;p&gt;That is why I like this framing. It stays honest about what the artifact is doing. It is not trying to predict the full online future. It is trying to make hidden tradeoffs visible earlier, with a tool that is still small enough to run, inspect, and reason about.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Is, And What It Is Not
&lt;/h2&gt;

&lt;p&gt;I think the strongest version of this argument is the honest one.&lt;/p&gt;

&lt;p&gt;This artifact &lt;strong&gt;is&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a small public proof&lt;/li&gt;
&lt;li&gt;a recommender-specific evaluation layer&lt;/li&gt;
&lt;li&gt;a way to make segment-level and trajectory-level tradeoffs visible&lt;/li&gt;
&lt;li&gt;a first wedge into broader testing for interactive systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This artifact is &lt;strong&gt;not&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a proof that the candidate model is globally better&lt;/li&gt;
&lt;li&gt;a replacement for offline evaluation&lt;/li&gt;
&lt;li&gt;a replacement for online experiments&lt;/li&gt;
&lt;li&gt;a full synthetic-human simulation framework&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That distinction matters. If this work is useful, it will be useful because it is clear about what it adds, not because it overclaims.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Better Evaluation Stack
&lt;/h2&gt;

&lt;p&gt;The long-term picture I have in mind looks something like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Standard offline evaluation remains the first layer.&lt;/li&gt;
&lt;li&gt;Segment-aware and trajectory-aware diagnostics become the second layer.&lt;/li&gt;
&lt;li&gt;Richer synthetic population testing may become the next layer after that.&lt;/li&gt;
&lt;li&gt;Online experiments still remain necessary for final validation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is a much more realistic stack than pretending a single aggregate metric can do the whole job.&lt;/p&gt;

&lt;p&gt;In that stack, the current artifact sits at layer two. It adds explicit behavioral lenses and short trajectory diagnostics to the familiar offline comparison workflow.&lt;/p&gt;

&lt;p&gt;That is why I think it matters, even in its current limited form.&lt;/p&gt;

&lt;p&gt;It is not the final answer.&lt;/p&gt;

&lt;p&gt;It is the first concrete artifact of the missing layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx46l3fmz2thhy9zsdzw2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx46l3fmz2thhy9zsdzw2.png" alt="Canonical result snapshot" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The first post argued that offline evaluation is not enough for recommendation systems.&lt;/p&gt;

&lt;p&gt;This artifact is my first practical answer to what should come next.&lt;/p&gt;

&lt;p&gt;Not a giant platform. Not a perfect simulation. Not a replacement for offline evaluation.&lt;/p&gt;

&lt;p&gt;Just a small, reproducible evaluation harness that compares a baseline and a candidate through multiple behavioral lenses and shows tradeoffs that aggregate metrics compress away.&lt;/p&gt;

&lt;p&gt;If offline evaluation is the first screen, then synthetic population testing, in some form, may be one of the next useful layers.&lt;/p&gt;

&lt;p&gt;This v1 is a lightweight version of that idea.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you want to see the public artifact, the canonical MovieLens demo lives in the &lt;code&gt;limitation&lt;/code&gt; repo as a report, JSON result bundle, and supporting visuals.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>algorithms</category>
      <category>research</category>
    </item>
    <item>
      <title>Why Offline Evaluation Is Not Enough for Recommendation Systems?</title>
      <dc:creator>Alankrit Verma</dc:creator>
      <pubDate>Sun, 29 Mar 2026 16:40:52 +0000</pubDate>
      <link>https://forem.com/alankritverma/why-offline-evaluation-is-not-enough-for-recommendation-systems-15ii</link>
      <guid>https://forem.com/alankritverma/why-offline-evaluation-is-not-enough-for-recommendation-systems-15ii</guid>
      <description>&lt;h2&gt;
  
  
  Why Offline Evaluation Is Not Enough for Recommendation Systems
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Offline evaluation is essential for recommender systems. It is also easy to mistake for a fuller measure of quality than it really is.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Offline evaluation is useful, fast, and necessary for recommender systems.&lt;/li&gt;
&lt;li&gt;But it is built on logged behavior generated under older exposure policies.&lt;/li&gt;
&lt;li&gt;That makes it weak at judging policy shifts, novel items, cold start behavior, and longer interaction trajectories.&lt;/li&gt;
&lt;li&gt;In a small MovieLens demo, the popularity baseline wins on aggregate offline ranking metrics, while a more personalized model does better for explorer, niche-interest, and low-patience user buckets.&lt;/li&gt;
&lt;li&gt;The practical conclusion is not to replace offline evaluation, but to stop treating it as a full test of recommender quality.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Recommendation systems are interactive systems, but offline evaluation often treats them like static predictors.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  1. The Testing Gap
&lt;/h3&gt;

&lt;p&gt;We know how to test deterministic software. We are much less certain about how to test systems that influence the behavior they later observe.&lt;/p&gt;

&lt;p&gt;Recommendation systems sit squarely in that second category. They do not just estimate what a user might click, watch, or purchase. They decide what the user gets a chance to see, and that choice helps shape the data that will later be treated as evidence.&lt;/p&gt;

&lt;p&gt;Offline evaluation is one of the standard tools in recommender systems for good reason. It is practical, fast, and often highly informative. A team can compare candidate models on historical interaction data long before it is ready to send live traffic to a new ranking policy.&lt;/p&gt;

&lt;p&gt;That usefulness, however, can make offline evaluation easy to over-interpret. A strong offline result often sounds like a strong statement about real recommendation quality. Sometimes it is. But the conclusion is narrower than it first appears.&lt;/p&gt;

&lt;p&gt;Historical interaction logs are not simply records of user preference. They are records of user preference under a particular pattern of exposure. They reflect what earlier systems chose to rank, recommend, and repeat. In that sense, the data is policy-dependent from the beginning.&lt;/p&gt;

&lt;p&gt;This matters because recommendation quality is not only about matching a fixed label. A recommender is an interactive system. Its outputs affect future inputs. Change the policy, and over time you may change what users discover, what they come to trust, what they ignore, and what they eventually consume.&lt;/p&gt;

&lt;p&gt;Consider a movie recommender. One model may reliably surface popular, familiar titles. Another may be more personal and more willing to introduce niche films that fit a specific user's taste. If the historical logs were generated under a system that already emphasized mainstream titles, those logs may be much richer in evidence for the first model's choices than for the second model's.&lt;/p&gt;

&lt;p&gt;That does not make offline evaluation wrong. It does mean the object being measured is more limited than many teams would like. Offline evaluation is useful, but insufficient.&lt;/p&gt;

&lt;p&gt;The point of this article is narrow. It is not that offline evaluation should be discarded, and it is not a general argument about all machine learning systems. The claim is simpler: recommendation systems are interactive systems, and that fact places real limits on what historical replay can tell us.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. What Offline Evaluation Is
&lt;/h3&gt;

&lt;p&gt;Offline evaluation, in the recommender setting, means evaluating a model on historical logged interactions rather than on live user traffic. The usual pattern is straightforward: train on past user-item behavior, hold out a later slice of interactions, and ask whether the model ranks the held-out items highly for the relevant users.&lt;/p&gt;

&lt;p&gt;In a movie recommendation system, the data might include watches, clicks, ratings, or add-to-list events. A model is trained on part of that history and then evaluated on interactions that were not shown during training. If a user later watched a particular film, one basic offline question is whether that film would have appeared near the top of the model's ranked list.&lt;/p&gt;

&lt;p&gt;This setup supports the ranking-style metrics commonly used in recommender systems. Teams may report measures such as Recall@K, hit rate, or NDCG to summarize how well a model recovers held-out interactions. The exact metric matters, but the general logic is the same: use historical behavior as a proxy for whether the recommendations were good.&lt;/p&gt;

&lt;p&gt;That approach is attractive because it gives a concrete and reproducible testing loop. Candidate models can be compared against the same held-out data. Regressions can be caught before launch. Incremental improvements can be measured without the cost and risk of online experimentation.&lt;/p&gt;

&lt;p&gt;It is also important to be precise about what this evaluation is actually saying. Offline evaluation does not directly measure how users would respond to a new policy in a live environment. It measures how well a model aligns with historical interactions recorded under earlier exposure conditions.&lt;/p&gt;

&lt;p&gt;That distinction is easy to blur because the workflow looks so familiar. We have training data, a test set, and a metric. But in recommendation systems, the labels are not independent of the system that helped generate them. The held-out watch or click is not just a fact about the user. It is also a fact about what the user was shown.&lt;/p&gt;

&lt;p&gt;For now, that is enough of a working definition. Offline evaluation is historical replay over logged interactions, typically framed as a ranking problem, and used as a proxy for recommendation quality under observed conditions. It is a very useful proxy. The rest of the article asks where its boundaries are.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Where Offline Evaluation Breaks
&lt;/h3&gt;

&lt;p&gt;The limitations of offline evaluation do not come from a single bad metric or a single avoidable mistake. They come from a more basic fact about recommender data: the data is generated under a policy. What users do in the logs depends in part on what earlier systems chose to show them.&lt;/p&gt;

&lt;p&gt;That sounds obvious when stated directly. But it has deeper consequences than it first appears. If the evidence used for evaluation is itself shaped by older recommendation decisions, then offline evaluation is not observing some neutral ground truth about relevance. It is observing relevance through the filter of past exposure.&lt;/p&gt;

&lt;p&gt;In a static prediction task, that distinction is often less severe. In recommendation, it sits near the center of the problem. A new recommender is rarely judged against untouched labels. It is judged against behavior recorded under an older recommender, with its own ranking habits, popularity biases, and coverage patterns.&lt;/p&gt;

&lt;p&gt;We can state the issue in simple notation. Let &lt;code&gt;pi_0&lt;/code&gt; be the logging policy that generated the historical data, and let &lt;code&gt;pi_1&lt;/code&gt; be the new policy we want to evaluate. Offline replay uses observations gathered under &lt;code&gt;pi_0&lt;/code&gt; to estimate the quality of &lt;code&gt;pi_1&lt;/code&gt;. If &lt;code&gt;pi_1&lt;/code&gt; behaves much like &lt;code&gt;pi_0&lt;/code&gt;, that may be informative. If it changes exposure materially, the estimate becomes much less complete.&lt;/p&gt;

&lt;p&gt;This is the core mismatch. The quantity we want is user response under the candidate policy. The quantity we usually observe is user response under the previous policy. The two overlap, but they are not the same object.&lt;/p&gt;

&lt;h4&gt;
  
  
  3.1 Exposure Bias
&lt;/h4&gt;

&lt;p&gt;The first break is exposure bias. Users can only react to items they were actually shown.&lt;/p&gt;

&lt;p&gt;That means an interaction log is not just a record of what users preferred. It is also a record of what the system made available. When an item receives no click, no watch, or no rating, that absence does not cleanly mean the item was irrelevant. In many cases it means the item was never placed in front of the user at all.&lt;/p&gt;

&lt;p&gt;This matters immediately for offline evaluation. Suppose a movie platform has historically given heavy exposure to well-known studio releases and much lighter exposure to niche films. The resulting data will contain dense evidence for how users responded to the mainstream catalog and sparse evidence for how they would have responded to more specialized titles.&lt;/p&gt;

&lt;p&gt;The bias here is structural rather than anecdotal. If observed feedback only exists for exposed items, then the support of the evaluation data is concentrated where the logging policy chose to spend attention. In compact form, observed reward is only available where &lt;code&gt;pi_0(i | u, c)&lt;/code&gt; is nontrivial for user &lt;code&gt;u&lt;/code&gt;, item &lt;code&gt;i&lt;/code&gt;, and context &lt;code&gt;c&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is why historical replay is partial. It is not sampling uniformly from all relevant user-item pairs. It is sampling from the subset that earlier policies made visible. In a movie recommender, this can make “popular” look easier to measure than “personally relevant,” even when the latter is closer to the product goal.&lt;/p&gt;

&lt;h4&gt;
  
  
  3.2 Old-Policy Lock-In
&lt;/h4&gt;

&lt;p&gt;Exposure bias becomes more consequential when a new policy differs from the old one in systematic ways. This is where old-policy lock-in appears.&lt;/p&gt;

&lt;p&gt;In most offline evaluations, the labels used to assess a candidate model were generated under a different ranking policy. A held-out watch event looks like a simple target, but it is downstream of earlier recommendation decisions. The new model is therefore being judged with evidence produced by the system it may be trying to replace.&lt;/p&gt;

&lt;p&gt;This creates an asymmetry. Models that resemble the old policy often enjoy richer and cleaner evidence in the historical logs. Models that shift probability mass toward less exposed regions of the catalog are evaluated in the parts of the space where the logs are thinnest.&lt;/p&gt;

&lt;p&gt;Return to the movie example. If the old system strongly favored familiar blockbusters, then the held-out data will naturally contain many interactions with those titles. A candidate model that continues to rank them highly will line up well with the log. Another model that is more willing to surface quieter but well-matched films may look weaker offline, not necessarily because users dislike those recommendations, but because the old system rarely created opportunities to observe that preference.&lt;/p&gt;

&lt;p&gt;This is one reason a better recommender can look worse offline. The issue is not only model accuracy. It is evaluation support. When performance is estimated on outcomes generated under &lt;code&gt;pi_0&lt;/code&gt;, the comparison can systematically favor policies that stay close to &lt;code&gt;pi_0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That does not make all offline comparisons invalid. If two models differ only slightly, offline evaluation can still be highly useful. But when a candidate policy changes exposure patterns in meaningful ways, offline results should be read with more caution than the metric alone suggests.&lt;/p&gt;

&lt;h4&gt;
  
  
  3.3 Novel Items and Cold Start
&lt;/h4&gt;

&lt;p&gt;The same logic becomes even sharper for new or rarely exposed content.&lt;/p&gt;

&lt;p&gt;Offline evaluation is strongest where historical evidence is plentiful. It is weakest where exposure has been limited, recent, or absent. Unfortunately, those are often exactly the regions where recommendation systems are asked to do something valuable: introduce new items, expand coverage, and connect users to parts of the catalog they would not have reached on their own.&lt;/p&gt;

&lt;p&gt;In a movie platform, consider a newly added independent film with very little interaction history. A model may have good reasons to recommend it to a narrow set of users based on metadata, embeddings, or nearby behavioral signals. But if the film barely appeared under the previous policy, then historical logs offer limited evidence for how good that recommendation would actually be.&lt;/p&gt;

&lt;p&gt;The problem is not only that the item is new. The deeper issue is that offline replay inherits the conservatism of past exposure. It is much easier to validate recommendations for already visible inventory than for inventory the old policy neglected.&lt;/p&gt;

&lt;p&gt;This creates a subtle but important pressure. Systems that stay near the historically exposed core of the catalog are easier to justify with offline evidence. Systems that broaden exposure toward the tail are often evaluated precisely where the data is least informative. Over time, that can make conservative recommendation strategies look more reliable than they really are, and exploratory strategies look less supported than they might deserve.&lt;/p&gt;

&lt;p&gt;The claim is not that offline evaluation fails in every cold-start setting. It is that historical replay is structurally weak exactly where a recommender tries to broaden exposure. For recommenders, novelty is often where the evidence is thinnest.&lt;/p&gt;

&lt;h4&gt;
  
  
  3.4 Trajectory Blindness
&lt;/h4&gt;

&lt;p&gt;Even if the exposure problem disappeared, there would still be another limitation. Recommendation quality is not purely one-step.&lt;/p&gt;

&lt;p&gt;Most offline metrics compress evaluation into local ranking success. Did the model place the held-out item near the top? Did it recover the next watch? Did it improve a ranking score on observed interactions? Those are reasonable questions, but they are mostly questions about immediate alignment with historical events.&lt;/p&gt;

&lt;p&gt;Users, however, experience recommendation systems as sequences. They return across sessions. They compare one recommendation to the previous one. They notice repetition. They develop trust or impatience. They learn whether the system helps them explore or merely loops them through slight variations of what it already knows how to sell.&lt;/p&gt;

&lt;p&gt;This is where trajectory blindness enters. A recommender can look strong on one-step relevance and still create a poor multi-step experience.&lt;/p&gt;

&lt;p&gt;Imagine a movie recommender that repeatedly serves highly similar popular thrillers because those titles have strong historical watch signals. In a one-step offline evaluation, this may look sensible. The recommendations are close to what users have previously consumed, and the metrics may reward that closeness. But over several sessions the user may experience the system as narrow, repetitive, and increasingly unhelpful.&lt;/p&gt;

&lt;p&gt;Another model might trade a small amount of one-step certainty for a better sequence. It may alternate between reliable choices and occasional high-fit long-tail discoveries. That kind of quality often lives in the trajectory rather than in any single ranking event.&lt;/p&gt;

&lt;p&gt;In notation, many offline metrics focus on something close to the quality of &lt;code&gt;r_t&lt;/code&gt; at a single step. But recommender quality often depends on properties of the sequence &lt;code&gt;(a_1, r_1), ..., (a_T, r_T)&lt;/code&gt;: how concentrated the recommendations are, whether novelty appears at the right rate, whether boredom accumulates, and whether the system adapts well after earlier choices.&lt;/p&gt;

&lt;p&gt;This is not an argument against ranking metrics. It is an argument about what they leave out. They summarize one-step fit to logged behavior. They do not, by themselves, tell us whether the interaction over time becomes richer, narrower, more repetitive, or more satisfying.&lt;/p&gt;

&lt;h4&gt;
  
  
  3.5 What This Means
&lt;/h4&gt;

&lt;p&gt;Taken together, these limitations point to a single conclusion. Offline evaluation often treats recommendation as if it were a static prediction problem with fixed labels. In practice, recommendation is an interactive system problem.&lt;/p&gt;

&lt;p&gt;The system chooses what to expose. Exposure shapes what users can respond to. Those responses become the data for future training and evaluation. Change the policy, and you may change the distribution of behavior itself.&lt;/p&gt;

&lt;p&gt;Once that is clear, the goal of evaluation also becomes clearer. The question is not only whether a model can replay the past. It is whether it can support good interaction under a changed policy. Historical replay helps answer that question, but only in part.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Why It Still Matters
&lt;/h3&gt;

&lt;p&gt;None of these limitations make offline evaluation disposable. They define its scope. That distinction matters.&lt;/p&gt;

&lt;p&gt;Recommendation teams rely on offline evaluation because it solves real engineering problems well. It is fast, reproducible, and comparatively cheap. It allows model changes to be screened before they reach users. It supports regression testing, debugging, ablation work, and benchmarking across candidate approaches. In most practical settings, there is no credible evaluation stack that excludes it.&lt;/p&gt;

&lt;p&gt;That remains true even after the critique above. A recommender team still needs a way to reject clearly weak models, validate implementation changes, and compare alternatives under a common protocol. Offline evaluation is often the first place where obvious failures become visible. If a ranking model cannot perform competitively in historical replay, it is usually hard to justify sending it to live traffic.&lt;/p&gt;

&lt;p&gt;This is especially important because online tests are expensive in more than one sense. They consume time, user attention, and organizational focus. They are also constrained by risk. A platform may be willing to test a modest ranking change online, but not a model that already appears unstable or uncompetitive offline. Historical evaluation remains the practical filter through which many candidate models must pass.&lt;/p&gt;

&lt;p&gt;The right conclusion, then, is not that offline evaluation should be replaced. It is that offline evaluation should be placed correctly. It is a strong tool for iteration and a weak tool for making broad claims about full recommender quality under changed exposure.&lt;/p&gt;

&lt;p&gt;In other words, the critique is intentional. Offline evaluation is widely used because it earns its place. The mistake is not using it. The mistake is mistaking it for a complete test.&lt;/p&gt;

&lt;p&gt;One compact way to summarize that balance is to separate what offline replay usually measures well from what it tends to leave undermeasured.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Evaluation aspect&lt;/th&gt;
&lt;th&gt;What offline replay usually captures&lt;/th&gt;
&lt;th&gt;What it tends to miss or undermeasure&lt;/th&gt;
&lt;th&gt;Movie recommender example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Immediate relevance under existing exposure&lt;/td&gt;
&lt;td&gt;Whether held-out watched items appear near the top of the ranked list&lt;/td&gt;
&lt;td&gt;Whether that ranking would still look good under a materially different exposure policy&lt;/td&gt;
&lt;td&gt;A familiar blockbuster appears in the top &lt;code&gt;K&lt;/code&gt; because it was already heavily exposed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance under policy shift&lt;/td&gt;
&lt;td&gt;Small improvements that stay near the old policy&lt;/td&gt;
&lt;td&gt;Quality of recommendations in regions where the candidate policy differs most&lt;/td&gt;
&lt;td&gt;A model that surfaces more niche dramas has little historical support where it differs from the old system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Novel or underexposed items&lt;/td&gt;
&lt;td&gt;Some signal for items with enough prior exposure&lt;/td&gt;
&lt;td&gt;Items that were new, rare, or historically under-shown&lt;/td&gt;
&lt;td&gt;A newly added indie film receives little offline credit even if it fits the user well&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cold start behavior&lt;/td&gt;
&lt;td&gt;Very coarse performance on sparse users or items&lt;/td&gt;
&lt;td&gt;Early recommendation quality when interaction history is thin&lt;/td&gt;
&lt;td&gt;A new documentary enters the catalog with too little evidence for replay to judge it fairly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repetition over sessions&lt;/td&gt;
&lt;td&gt;Little, unless explicitly measured&lt;/td&gt;
&lt;td&gt;Accumulated sameness across repeated visits&lt;/td&gt;
&lt;td&gt;The recommender keeps offering slight variations of the same thriller over multiple sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Novelty and exploration&lt;/td&gt;
&lt;td&gt;Limited signal through held-out interactions&lt;/td&gt;
&lt;td&gt;Whether the system introduces useful discovery at the right rate&lt;/td&gt;
&lt;td&gt;A long-tail science-fiction recommendation may be good, but the old logs barely contain exposure to it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Segment-level differences&lt;/td&gt;
&lt;td&gt;Aggregate averages over the evaluation set&lt;/td&gt;
&lt;td&gt;Which user groups are helped or hurt by the new policy&lt;/td&gt;
&lt;td&gt;Mainstream users may do well under Model A while exploration-seeking users do better under Model B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trajectory-level user experience&lt;/td&gt;
&lt;td&gt;Almost nothing in standard one-step metrics&lt;/td&gt;
&lt;td&gt;Trust, boredom, fatigue, and satisfaction over sequences&lt;/td&gt;
&lt;td&gt;A user keeps getting acceptable next picks but gradually disengages from repetition&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  5. Running Example: Model A vs. Model B
&lt;/h3&gt;

&lt;p&gt;The structural issues above become easier to see with a simple running example. Consider a movie recommendation system with two candidate rankers.&lt;/p&gt;

&lt;p&gt;Model A is conservative. It leans toward popular, broadly watched titles and tends to recommend within the historically dominant regions of the catalog. It is usually safe, usually familiar, and often repetitive.&lt;/p&gt;

&lt;p&gt;Model B is more personalized. It still recommends mainstream films when they fit, but it is more willing to surface niche titles, less obvious matches, and items from thinner parts of the catalog when the user profile suggests they are a good fit.&lt;/p&gt;

&lt;p&gt;Suppose the historical logs were generated under an earlier recommendation policy that behaved more like Model A. Popular titles received heavy exposure. Niche titles were shown less often. Over time, that policy produced abundant feedback on the mainstream catalog and much weaker evidence on long-tail items.&lt;/p&gt;

&lt;p&gt;Now evaluate both models offline on held-out interactions from those logs.&lt;/p&gt;

&lt;p&gt;Model A will often look strong for a simple reason: it aligns well with the exposure pattern that helped generate the data. It ranks many of the same kinds of items the old system already showed, so the held-out interactions contain ample opportunities to reward it.&lt;/p&gt;

&lt;p&gt;Model B may be better calibrated to particular users, especially users with narrower tastes or stronger appetite for discovery. But if many of its most valuable recommendations lie in regions of the catalog that were rarely exposed before, the offline log may not give it much credit. The evidence needed to validate those choices was never fully collected.&lt;/p&gt;

&lt;p&gt;This does not mean Model B is necessarily better overall. Some users may indeed prefer the safer behavior of Model A. That is part of the point. Recommendation quality is heterogeneous across users and across sessions, and a single aggregate score can hide that heterogeneity.&lt;/p&gt;

&lt;p&gt;The difference becomes clearer over repeated interaction. Model A may continue to produce acceptable next-item recommendations while gradually narrowing the user's experience into a small, overexposed slice of the catalog. Model B may produce a slightly noisier immediate ranking while creating a better long-run sequence for users who value novelty or have specialized tastes.&lt;/p&gt;

&lt;p&gt;This is the kind of divergence a later demo can make visible. Two models may look similar on an aggregate offline metric and still differ meaningfully in repetition, novelty, and which user groups they serve well.&lt;/p&gt;

&lt;h4&gt;
  
  
  A Small MovieLens Demo
&lt;/h4&gt;

&lt;p&gt;To make that less abstract, I built a small comparison on MovieLens 100K. The setup is intentionally simple. Model A is a popularity baseline. Model B is a lightweight personalized recommender built from user genre profiles with a modest popularity prior. The point is not to produce the strongest possible recommender. The point is to see what different layers of evaluation say about the same pair of systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aggregate view:&lt;/strong&gt; on standard offline ranking metrics, Model A looks better.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Recall@10&lt;/th&gt;
&lt;th&gt;NDCG@10&lt;/th&gt;
&lt;th&gt;Novelty&lt;/th&gt;
&lt;th&gt;Repetition&lt;/th&gt;
&lt;th&gt;Catalog concentration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model A&lt;/td&gt;
&lt;td&gt;0.088&lt;/td&gt;
&lt;td&gt;0.057&lt;/td&gt;
&lt;td&gt;0.395&lt;/td&gt;
&lt;td&gt;0.675&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model B&lt;/td&gt;
&lt;td&gt;0.058&lt;/td&gt;
&lt;td&gt;0.036&lt;/td&gt;
&lt;td&gt;0.678&lt;/td&gt;
&lt;td&gt;0.693&lt;/td&gt;
&lt;td&gt;0.717&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If we stopped there, the conclusion would be straightforward: the popularity baseline wins offline.&lt;/p&gt;

&lt;p&gt;But that is exactly the point of the article. Once the evaluation is widened beyond a single aggregate view, the picture changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bucketed view:&lt;/strong&gt; the same two models look quite different once we ask who is being served well.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bucket&lt;/th&gt;
&lt;th&gt;Model A utility&lt;/th&gt;
&lt;th&gt;Model B utility&lt;/th&gt;
&lt;th&gt;Delta (B-A)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Conservative mainstream&lt;/td&gt;
&lt;td&gt;0.519&lt;/td&gt;
&lt;td&gt;0.532&lt;/td&gt;
&lt;td&gt;0.012&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explorer / novelty-seeking&lt;/td&gt;
&lt;td&gt;0.339&lt;/td&gt;
&lt;td&gt;0.523&lt;/td&gt;
&lt;td&gt;0.184&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Niche-interest&lt;/td&gt;
&lt;td&gt;0.443&lt;/td&gt;
&lt;td&gt;0.722&lt;/td&gt;
&lt;td&gt;0.279&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low-patience&lt;/td&gt;
&lt;td&gt;0.321&lt;/td&gt;
&lt;td&gt;0.364&lt;/td&gt;
&lt;td&gt;0.043&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The bucketed results are more revealing than the aggregate ones. Explorer users and niche-interest users benefit much more from Model B. Low-patience users also do slightly better under Model B in the short-session simulation, even though the aggregate offline ranking metrics still prefer Model A.&lt;/p&gt;

&lt;p&gt;The behavior diagnostics tell a related story. Model B is substantially more novel and much less concentrated in the most popular slice of the catalog. For explorer users, bucket-level novelty rises from &lt;code&gt;0.405&lt;/code&gt; under Model A to &lt;code&gt;0.808&lt;/code&gt; under Model B. For niche-interest users, mean bucket utility rises by &lt;code&gt;0.279&lt;/code&gt;. That is not a rounding error. It is a segment-level change that the aggregate offline metrics compress away.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkffspju06ezccanmrb2o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkffspju06ezccanmrb2o.png" alt="Bucket-level utility comparison from the MovieLens demo" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the demo says in one glance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggregate offline metrics favor Model A.&lt;/li&gt;
&lt;li&gt;Explorer, niche-interest, and low-patience buckets do better under Model B.&lt;/li&gt;
&lt;li&gt;Model B is much more novel and less concentrated in the most popular slice of the catalog.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Two short traces make the difference more tangible.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explorer / novelty-seeking user&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model A: Raiders of the Lost Ark -&amp;gt; Fargo -&amp;gt; Toy Story -&amp;gt; Return of the Jedi
Model B: Prophecy, The -&amp;gt; Cat People -&amp;gt; Wes Craven's New Nightmare -&amp;gt; Relic, The
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first sequence stays close to familiar, high-exposure titles. The second is much more novel and much more tailored to a narrower taste profile.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Low-patience user&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model A: Star Wars -&amp;gt; Fargo -&amp;gt; Return of the Jedi -&amp;gt; Toy Story
Model B: Monty Python and the Holy Grail -&amp;gt; Full Monty -&amp;gt; American President -&amp;gt; Truth About Cats &amp;amp; Dogs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here the difference is not just novelty. The second sequence moves through a less concentrated slice of the catalog rather than repeatedly returning to the same mainstream core.&lt;/p&gt;

&lt;p&gt;This small demo does not prove that Model B is globally better. It does something more modest and more useful. It shows that the answer depends on what we mean by "better," which users we care about, and whether we look only at historical ranking recovery or also at the behavior a recommender produces over short trajectories.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. A Better Direction, Briefly
&lt;/h3&gt;

&lt;p&gt;If offline evaluation is necessary but incomplete, the natural response is not to discard it. The better response is to build a broader evaluation stack around it.&lt;/p&gt;

&lt;p&gt;That broader stack should start from the failure modes already discussed. If logged exposure is policy-dependent, then evaluation should be more explicit about where the evidence is strong and where it is weak. If quality emerges over time, then some part of evaluation should examine sequences rather than only one-step ranking recovery.&lt;/p&gt;

&lt;p&gt;In practice, this suggests a modest shift in emphasis. Instead of asking only for a single aggregate offline score, teams can also ask how models behave across user segments, how concentrated their recommendations become, how much novelty they introduce, and whether their behavior looks meaningfully different over short interaction traces.&lt;/p&gt;

&lt;p&gt;For the movie example, that might mean comparing Model A and Model B not only on Recall@K or NDCG, but also on repetition, tail exposure, and bucket-level outcomes for users with different appetites for familiarity or exploration. None of these measurements solves the full problem. They simply make the evaluation better matched to the system being evaluated.&lt;/p&gt;

&lt;p&gt;The same logic also motivates carefully designed simulated interaction or short trajectory-based testing. The point is not that such methods are already complete or universally trustworthy. The point is narrower: if recommenders shape future behavior, then some part of the evaluation stack should attempt to probe that interaction rather than treating historical replay as the whole story.&lt;/p&gt;

&lt;p&gt;This is best understood as complement, not replacement. Offline evaluation remains the fast and reliable first layer. But serious evaluation of recommender quality likely needs additional layers that are more sensitive to exposure shifts, segment differences, and longer-run experience.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Conclusion
&lt;/h3&gt;

&lt;p&gt;Offline evaluation remains one of the most useful tools in recommender systems. It is fast, practical, and deeply embedded in how teams iterate on models.&lt;/p&gt;

&lt;p&gt;Its limitation is structural rather than procedural. The data it relies on is constrained by prior exposure and generated under earlier policies, so it provides only a partial test of recommender quality.&lt;/p&gt;

&lt;p&gt;That matters most when a model changes what gets shown, expands beyond historically overexposed items, or affects the experience over repeated interaction. In those settings, replaying the past is not the same as evaluating the new system on its own terms.&lt;/p&gt;

&lt;p&gt;Offline evaluation is indispensable, but it is not the whole test. Recommendation systems shape the behavior they later observe, so any serious evaluation stack should measure interaction, not just replay the past.&lt;/p&gt;

&lt;p&gt;This demo is illustrative rather than definitive; its value is in showing how aggregate offline results can hide segment-level differences.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>architecture</category>
      <category>research</category>
    </item>
    <item>
      <title>How GenAI Genesis Began</title>
      <dc:creator>Alankrit Verma</dc:creator>
      <pubDate>Sat, 07 Mar 2026 05:12:44 +0000</pubDate>
      <link>https://forem.com/alankritverma/how-genai-genesis-began-523b</link>
      <guid>https://forem.com/alankritverma/how-genai-genesis-began-523b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alankrit Verma&lt;/strong&gt; came to the University of Toronto as a shy, math-driven student on scholarship who felt a deep responsibility to give back.&lt;/p&gt;

&lt;p&gt;That instinct led him into student leadership through &lt;strong&gt;AMACSS&lt;/strong&gt;, where he helped build a small experiment called &lt;strong&gt;AI Olympics&lt;/strong&gt; with &lt;strong&gt;39 participants&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That experiment revealed something bigger: students wanted a serious space to build, learn, and belong in AI.&lt;/p&gt;

&lt;p&gt;So Alankrit and his co-founder &lt;strong&gt;Adib Fallahpour&lt;/strong&gt; scaled that spark into &lt;strong&gt;GenAI Genesis&lt;/strong&gt; — first as a cross-campus student hackathon, and eventually into one of Canada’s largest student AI hackathons.&lt;/p&gt;

&lt;p&gt;Along the way, Alankrit helped lead the vision, website, sponsorships, partnerships, and long-term structure behind the event, including helping establish the &lt;strong&gt;GenAI Genesis Foundation&lt;/strong&gt; so the mission could sustain beyond a single organizing cycle.&lt;/p&gt;

&lt;p&gt;And now, in &lt;strong&gt;2026&lt;/strong&gt;, GenAI Genesis is entering its biggest and most ambitious chapter yet.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  From a 39-person experiment to one of Canada’s largest student AI hackathons
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;“Some communities are joined. Others are built because you cannot stop thinking about the version that should exist.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhnzvkfycj5rvl2riwe9f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhnzvkfycj5rvl2riwe9f.png" alt="GenAI Genesis team with a celebration cake"&gt;&lt;/a&gt;&lt;br&gt;GenAI Genesis team with the cake. Surprise cake courtesy of Hasleen Kaur (Head of Finance 2025, Co-Chair 2026) and Ivan Semenov (Head of Operations 2025, Co-Chair 2026).
  &lt;/p&gt;




&lt;p&gt;There are some things you plan carefully.&lt;/p&gt;

&lt;p&gt;And then there are some things that begin so quietly, so casually, that you do not realize until much later that you were standing at the start of something much bigger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GenAI Genesis&lt;/strong&gt; was one of those things.&lt;/p&gt;

&lt;p&gt;When I joined the University of Toronto, I was, in many ways, still a shy person.&lt;/p&gt;

&lt;p&gt;I was not the loudest voice in every room. I was still figuring myself out, still trying to understand what kind of life I wanted to build, and what kind of contribution I wanted to make.&lt;/p&gt;

&lt;p&gt;But I did know one thing with complete clarity: I had been given a rare opportunity, and I did not want to waste it.&lt;/p&gt;

&lt;p&gt;Coming to this country and this university on a scholarship meant a lot to me. It gave me the ability to study freely, dream more freely, and imagine a future I may not otherwise have had. And from the beginning, that created a very deep feeling in me: &lt;strong&gt;I had to give back to the community that had given so much to me.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At that time, I cared about many things at once.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I cared about &lt;strong&gt;math&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;I cared about &lt;strong&gt;building projects&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;I cared about &lt;strong&gt;recognition&lt;/strong&gt;, yes — but not just for ego. I wanted to build things that mattered.&lt;/li&gt;
&lt;li&gt;I cared about &lt;strong&gt;real-world impact&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;And I cared, very deeply, about &lt;strong&gt;community&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Math had been a big part of my identity for a long time. I had prepared seriously for the &lt;strong&gt;Euclid Mathematics Contest&lt;/strong&gt; and scored &lt;strong&gt;90/100&lt;/strong&gt;, and that experience mattered to me for more than just the number. Euclid is one of those milestones that gives you credibility, but more importantly, it gave me confidence. It made me more ambitious. It made me believe that I could build something meaningful. And it made me want to create spaces where other students could feel that same sense of challenge, excitement, and possibility.&lt;/p&gt;

&lt;p&gt;So when I came to &lt;strong&gt;U of T Scarborough&lt;/strong&gt;, I started looking around and asking myself a simple question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where is that energy here?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And honestly, at the time, I did not see enough of it.&lt;/p&gt;

&lt;p&gt;Especially in the computer science space, the student community was not really booming. This was the period after COVID, when many campus communities still felt quiet, fragmented, and difficult to revive. There was talent, but not enough momentum. Curiosity, but not enough structure. Ambition, but not enough spaces for it to gather.&lt;/p&gt;

&lt;p&gt;And somewhere along the way, I quietly made it my mission to help fix that.&lt;/p&gt;

&lt;p&gt;Not alone, of course. Communities are never built alone. But I wanted to be one of the people pushing hard in that direction.&lt;/p&gt;

&lt;p&gt;That instinct led me to &lt;strong&gt;AMACSS&lt;/strong&gt; — the &lt;strong&gt;Association of Mathematical and Computer Science Students&lt;/strong&gt;, the Departmental Student Association for the CMS department at the University of Toronto Scarborough.&lt;/p&gt;

&lt;p&gt;In my first year, I joined as a &lt;strong&gt;First-Year Representative Coordinator&lt;/strong&gt;, where I represented first-year computer science and math students to the association, and the association back to them. I also coordinated a team of &lt;strong&gt;seven people&lt;/strong&gt;, which turned out to be one of my first real lessons in leadership.&lt;/p&gt;

&lt;p&gt;Leadership, I learned very quickly, is not just about taking initiative. It is about understanding people. It is about assigning responsibility thoughtfully. It is about getting buy-in. It is about leading with grace when everyone has different levels of energy, skill, confidence, and commitment.&lt;/p&gt;

&lt;p&gt;I had always been someone who liked taking initiative, but AMACSS sharpened that instinct into something more deliberate.&lt;/p&gt;

&lt;p&gt;And in that chapter of my life, the first version of GenAI Genesis quietly appeared.&lt;/p&gt;

&lt;p&gt;Not as GenAI Genesis.&lt;/p&gt;

&lt;p&gt;Not yet.&lt;/p&gt;

&lt;p&gt;It started as something called &lt;strong&gt;AI Olympics&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Before Genesis, there was AI Olympics
&lt;/h2&gt;

&lt;p&gt;AI Olympics was the first real experiment.&lt;/p&gt;

&lt;p&gt;The original idea came from a mix of inspirations.&lt;/p&gt;

&lt;p&gt;Part of it came from my love for mathematics competitions and the kind of intellectual excitement they create. Part of it came from online hackathons I had participated in, where I had seen how energizing it could be when people come together to build under time pressure. I remember thinking again and again:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Why do we not have something like this at our university too?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At one point, I was brainstorming with &lt;strong&gt;Katrina Best&lt;/strong&gt;, who was the president at the time, about what might make for a strong event for first-year students. At first, we thought about doing something closer to a math contest. Then the idea evolved. I brainstormed with my team as well. Slowly, the concept shifted from Olympiad to something more build-oriented, more alive, more experimental.&lt;/p&gt;

&lt;p&gt;That is where &lt;strong&gt;AI Olympics&lt;/strong&gt; was born.&lt;/p&gt;

&lt;p&gt;The name came from that same spirit. We wanted something that felt like an Olympiad, but more modern, more hands-on, and more builder-focused. “AI Olympics” felt close enough to that energy, and at the time, it captured exactly what we were trying to do.&lt;/p&gt;

&lt;p&gt;It was a smaller hackathon-style event, around &lt;strong&gt;six to seven hours long&lt;/strong&gt;, with &lt;strong&gt;39 participants&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It was essentially the first-year team’s event through AMACSS, and leading it as First-Year Representative Coordinator made it feel especially personal.&lt;/p&gt;

&lt;p&gt;Many of them were beginners. The vibe in the room was not “elite competition” in the intimidating sense. It was much more like collective learning. People were curious. People were experimenting. People were just starting to understand what they could build.&lt;/p&gt;

&lt;p&gt;We taught participants how to use the tools. We gave them a website template they could plug their work into so they could build faster. We wanted to reduce friction and maximize momentum. We wanted them to feel like they could actually make something, even if they were just getting started.&lt;/p&gt;

&lt;p&gt;And maybe one of my favorite little memories from that day is how we kept ordering coffee from Tim Hortons — not once, not twice, but three times — because people kept wanting more, and apparently everyone had collectively decided that vanilla was the flavor of innovation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8u70vfxz05as93xfbc06.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8u70vfxz05as93xfbc06.jpeg" alt="AI Olympics"&gt;&lt;/a&gt;&lt;br&gt;AI Olympics
  &lt;/p&gt;

&lt;p&gt;Looking back, AI Olympics was small.&lt;/p&gt;

&lt;p&gt;But it was not small in meaning.&lt;/p&gt;

&lt;p&gt;Because it showed us something important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;People wanted a space to &lt;strong&gt;learn&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;People wanted a space to &lt;strong&gt;build&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;People wanted a space where AI felt &lt;strong&gt;exciting, approachable, social, and full of possibility&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The feedback made that obvious. People were interested in doing this again. They wanted to keep learning. They wanted to keep contributing. They wanted to build in public. They wanted more.&lt;/p&gt;

&lt;p&gt;And that was the moment the idea stopped feeling like a one-off event and started feeling like the beginning of a much larger mission.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Olympics was the spark.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GenAI Genesis was the system we built around that spark.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The moment it stopped being small
&lt;/h2&gt;

&lt;p&gt;Around that time, I was also working very closely with &lt;strong&gt;Adib Fallahpour&lt;/strong&gt;, who is not just my co-founder, but also a good friend of mine.&lt;/p&gt;

&lt;p&gt;I had first worked with Adib through my first-year team, and over time it became very clear that we were on the same wavelength in a lot of ways. He is a very kind person, a big thinker, and someone with strong vision. We both cared deeply about scaling this beyond its first version. We both felt that it should not remain a small campus event that people vaguely remembered. We wanted it to become something real.&lt;/p&gt;

&lt;p&gt;I still remember a moment from second year when Adib and I were housemates. He came into my room, and we started discussing what this thing could actually become. Not just another event. Not just another student initiative. But a serious hackathon. Something with real scale. Something that could create a home for people interested in AI, machine learning, software, and ambitious building more broadly.&lt;/p&gt;

&lt;p&gt;That conversation stayed with me.&lt;/p&gt;

&lt;p&gt;Because from that point onward, this stopped being a nice idea and started becoming a serious project.&lt;/p&gt;

&lt;p&gt;Like most ambitious student things, it began with a lot of conversations, a lot of hustle, and a slightly unreasonable amount of belief.&lt;/p&gt;

&lt;p&gt;We first tried to define the idea on paper:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What exactly was &lt;strong&gt;GenAI Genesis&lt;/strong&gt;?&lt;/li&gt;
&lt;li&gt;What would it look like at scale?&lt;/li&gt;
&lt;li&gt;What kind of experience were we trying to create?&lt;/li&gt;
&lt;li&gt;What problem were we solving?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The problem, at least to us, felt clear.&lt;/p&gt;

&lt;p&gt;At the time, there was not enough community in Toronto around this space — not the kind of entrepreneurial, energetic, builder-first AI ecosystem we wanted to see among students. There was talent, but not enough connected ambition. There were students interested in AI and ML, but not enough platforms bringing them together in a serious way.&lt;/p&gt;

&lt;p&gt;So we decided to build one.&lt;/p&gt;

&lt;p&gt;The name came together surprisingly quickly. We broke it into two parts: &lt;strong&gt;GenAI&lt;/strong&gt; and &lt;strong&gt;Genesis&lt;/strong&gt;. “Genesis” suggested beginning, emergence, evolution. And at the time, “GenAI” was the word in the air. The name reflected the moment, but the mission was always broader than just generative AI — it was about AI, machine learning, software, and the community around building them. Put together, it felt like a beginning worth naming.&lt;/p&gt;

&lt;p&gt;We did not have a full team immediately. At first, we were figuring it out from scratch. Both Adib and I were part of &lt;strong&gt;Google Developer Student Club&lt;/strong&gt;, and that gave us one starting point. We knew we could bring in people from there. Then we looked beyond Scarborough and started reaching across campuses, especially to St. George, where some of the strongest technical student communities already existed.&lt;/p&gt;

&lt;p&gt;That is how collaborations started taking shape with groups like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GDG&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UTMIST&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UofT AI&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;and later, &lt;strong&gt;CSSU&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But I want to be careful and clear here, because this matters to the story.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Important distinction:&lt;/strong&gt; GenAI Genesis was not a club-created event that I happened to be involved in.

&lt;p&gt;Those groups mattered enormously, and their support helped the vision scale far beyond what we could have done alone in the early stages. They brought expertise, reach, operational support, and legitimacy. But the mission itself — the initial push, the insistence that this had to exist — came from us.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;That distinction matters because founder stories can get flattened over time into partnerships, logos, and sponsor lists. But the truth is usually more human than that. It begins with a few people seeing a gap and deciding they are not willing to leave it empty.&lt;/p&gt;




&lt;h2&gt;
  
  
  The people who helped us scale it
&lt;/h2&gt;

&lt;p&gt;Some of the most important early support came from people who believed in the idea and helped us take it seriously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Richard&lt;/strong&gt;, from &lt;strong&gt;UTMIST&lt;/strong&gt;, played a crucial role in 2024. He was a senior to us and incredibly strong operationally. He helped us understand what it means to run something at scale, what it means to think through logistics properly, and how to turn energy into structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nimit&lt;/strong&gt;, through &lt;strong&gt;UofT AI&lt;/strong&gt;, also played a very important role in helping the initiative come together. Both Richard and Nimit helped us build the cross-campus support that allowed GenAI Genesis to grow beyond its first form.&lt;/p&gt;

&lt;p&gt;These collaborations mattered a lot. Not because the event “belonged” to those communities, but because they helped us bring the mission to the scale it deserved.&lt;/p&gt;

&lt;p&gt;Sometimes scaling an idea is not about finding people who will take it over.&lt;/p&gt;

&lt;p&gt;It is about finding people who understand it enough to help it rise.&lt;/p&gt;




&lt;h2&gt;
  
  
  A quick timeline
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;What happened&lt;/th&gt;
&lt;th&gt;Why it mattered&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Winter 2023&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;We ran &lt;strong&gt;AI Olympics&lt;/strong&gt; through AMACSS with &lt;strong&gt;39 participants&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;It proved there was real demand for a build-first AI space&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Winter 2024&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;We launched the first large-scale &lt;strong&gt;GenAI Genesis&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;The experiment became a serious institution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2025&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;We scaled dramatically with more sponsors, more prizes, and many more submissions&lt;/td&gt;
&lt;td&gt;The hackathon became a recognized force in the student AI ecosystem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2026&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;We are taking it to our biggest scale yet&lt;/td&gt;
&lt;td&gt;Bigger footprint, bigger ambition, bigger future&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;
  A tiny behind-the-scenes truth
  &lt;br&gt;
Every row in that table was held together by a lot of invisible work: outreach, relationship management, budget stress, website iterations, venue uncertainty, and a hundred tiny decisions that never show up in a recap post.&lt;br&gt;


&lt;/p&gt;




&lt;h2&gt;
  
  
  2024: when the idea met reality
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo95hb4v5y5z5p667kxfn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo95hb4v5y5z5p667kxfn.png" alt="Participants and organizers during GenAI Genesis 2024"&gt;&lt;/a&gt;&lt;br&gt;Getting Started with GenAI Genesis 2024
  &lt;/p&gt;

&lt;p&gt;The 2024 edition was the moment things started to feel very real.&lt;/p&gt;

&lt;p&gt;In winter 2024, we launched the first large-scale GenAI Genesis in downtown Toronto.&lt;/p&gt;

&lt;p&gt;Up until that point, the idea had energy. It had promise. It had momentum. But 2024 was when it had to survive the test every ambitious student project eventually faces:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Could we actually execute this at scale?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That year taught me a lot.&lt;/p&gt;

&lt;p&gt;And by “a lot,” I mean the kind of lessons that only appear when vision collides with logistics.&lt;/p&gt;

&lt;p&gt;We had to learn how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;work with a much larger team&lt;/li&gt;
&lt;li&gt;coordinate across campuses&lt;/li&gt;
&lt;li&gt;lead people with different styles, strengths, and expectations&lt;/li&gt;
&lt;li&gt;manage conflict and disagreement without letting it fracture the mission&lt;/li&gt;
&lt;li&gt;build trust with sponsors&lt;/li&gt;
&lt;li&gt;make big promises responsibly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And in the middle of all that, I was deeply involved in the work itself.&lt;/p&gt;

&lt;p&gt;From the very beginning until now, I have led the &lt;strong&gt;website side&lt;/strong&gt; of GenAI Genesis. Tech has always been one of the areas I stayed especially close to. I was also heavily involved in &lt;strong&gt;sponsorships and partnerships&lt;/strong&gt; — doing cold outreach, talking to organizations, building those relationships, and helping create the external support system that made the event possible.&lt;/p&gt;

&lt;p&gt;One of the most memorable parts of that journey was our connection with &lt;strong&gt;Google&lt;/strong&gt;, and how that relationship went from something that initially felt surreal to something that became a meaningful long-term thread in the GenAI Genesis story. There is a strange feeling when big names start trusting something you built. It is exciting, but it is also sobering. It makes you realize the stakes are now real.&lt;/p&gt;

&lt;p&gt;The 2024 edition brought in support from names including &lt;strong&gt;Google, Knockri, Wombo, Vector Institute, the Academic Advising &amp;amp; Career Centre at UTSC, and the Rotman School of Management&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We had around &lt;strong&gt;254 participants submit a project&lt;/strong&gt; and awarded roughly &lt;strong&gt;$3,000 in prizes&lt;/strong&gt;.&lt;br&gt;
But what I remember most is not just the number.&lt;/p&gt;

&lt;p&gt;I remember how much we had to figure out on the fly.&lt;/p&gt;

&lt;p&gt;Venue booking was a huge hassle. A lot of things were fragile. &lt;strong&gt;Judging&lt;/strong&gt;, especially, was something we did not have perfect prior experience with at that scale. And yet, when the time came, the team handled it with surprising grace. We made last-minute changes to make sure the judging process was fair, thoughtful, and well run. That moment stayed with me because it showed me something essential: even if we were new to this scale, we were capable of rising to it.&lt;/p&gt;

&lt;p&gt;That was the year GenAI Genesis stopped feeling like a hopeful experiment.&lt;/p&gt;

&lt;p&gt;It felt real.&lt;/p&gt;


  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuq2v3jbys38ol98qr0gr.png" alt="Participants and organizers during GenAI Genesis 2024"&gt;GenAI Genesis 2024
  



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;genesis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

  &lt;span class="na"&gt;v0&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AI&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Olympics"&lt;/span&gt;
  &lt;span class="na"&gt;participants&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;39&lt;/span&gt;
  &lt;span class="na"&gt;then&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;an experiment&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a room full of beginners&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a lot of coffee&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a lot of belief&lt;/span&gt;
  &lt;span class="na"&gt;now&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a cross-campus movement&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a large-scale AI hackathon&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a serious community&lt;/span&gt;
  &lt;span class="na"&gt;constant&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;vision&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;people&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;momentum&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2025: scale changes everything
&lt;/h2&gt;

&lt;p&gt;Then came &lt;strong&gt;2025&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And 2025 felt different.&lt;/p&gt;

&lt;p&gt;This was the year when GenAI Genesis started feeling less like an event and more like an ecosystem.&lt;/p&gt;

&lt;p&gt;By then, we were no longer operating entirely from instinct. We had learned processes. We had built systems. We had a better understanding of what worked, what broke, what participants valued, and what scale actually requires. We planned earlier. We moved more formally. We operated with more clarity.&lt;/p&gt;

&lt;p&gt;The leadership structure also evolved.&lt;/p&gt;

&lt;p&gt;In the earlier chapter, the co-chair structure included &lt;strong&gt;me, Adib, Nimit, and Richard&lt;/strong&gt;. By 2025, the co-chairs were &lt;strong&gt;me, Adib, and Matthew Tamura&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Matthew had already been involved as a strong contributor in 2024 through UTMIST and was someone I deeply appreciated — thoughtful, visionary, and strong at leading a team properly. In 2025, he stepped into a bigger leadership role with us, and that made a real difference.&lt;/p&gt;

&lt;p&gt;We also worked hard to improve the participant experience in ways that went beyond the surface.&lt;/p&gt;

&lt;p&gt;We brought in more sponsors.&lt;br&gt;
We created more networking opportunities.&lt;br&gt;
We designed stronger supporting events during the hackathon.&lt;br&gt;
We sharpened logistics.&lt;br&gt;
We elevated the experience.&lt;/p&gt;

&lt;p&gt;And the scale reflected that.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;2025&lt;/strong&gt;, we had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;around &lt;strong&gt;$15,000 in awards&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;621 participants submitted a project&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;backing from &lt;strong&gt;Google, BWC, Cohere, AMD, CGI, RBC, Northeastern University, Edge.io Solutions, the Academic Advising &amp;amp; Career Centre, and the University of Toronto&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;support from partners including the &lt;strong&gt;United Nations Association in Canada, One Degree Cooler, Vector Institute, and Hack Canada&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the most exciting moments that year was when &lt;strong&gt;AMD&lt;/strong&gt; came in and supported us in a way that allowed participants to run more complex machine learning workloads on an AMD GPU-backed local cluster. That felt genuinely wild. It was one of those moments where you step back and realize the hackathon is not just getting bigger in numbers — it is getting more technically meaningful too.&lt;/p&gt;

&lt;p&gt;From the outside, growth can look glamorous.&lt;/p&gt;

&lt;p&gt;From the inside, it often looks like spreadsheets, calls, follow-ups, contingency planning, team alignment, venue negotiations, technical troubleshooting, partnership mapping, and a hundred open loops in your head at once.&lt;/p&gt;

&lt;p&gt;People usually see the lights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Founders remember the wiring.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And by 2025, I was exhausted. Truly.&lt;/p&gt;

&lt;p&gt;But it was also the kind of exhaustion that comes from building something you care about so deeply that you keep choosing it, again and again, even when it would be easier not to.&lt;/p&gt;

&lt;p&gt;There were many moments in those years where I could have spent my time doing something else for my résumé — some other project, some other opportunity, some other clean, convenient line on paper.&lt;/p&gt;

&lt;p&gt;And again and again, I chose GenAI Genesis.&lt;/p&gt;

&lt;p&gt;Because by then it was not just a project.&lt;/p&gt;

&lt;p&gt;It was a commitment.&lt;/p&gt;


 &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmayw9gih1uzvqhly8jd0.jpg" alt="Some of the winners at GenAI Genesis 2025"&gt;Some of the winners at GenAI Genesis 2025
  



&lt;h2&gt;
  
  
  What people see vs. what it takes
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;People usually experience a hackathon at the moment it becomes exciting.&lt;/p&gt;

&lt;p&gt;Founders experience it in the months before that, when it is still fragile.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What goes into building a hackathon like this?&lt;/p&gt;

&lt;p&gt;Not just posters and prize money.&lt;/p&gt;

&lt;p&gt;It looks more like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sponsor outreach and partnership management&lt;/li&gt;
&lt;li&gt;website design and technical infrastructure&lt;/li&gt;
&lt;li&gt;judge and mentor coordination&lt;/li&gt;
&lt;li&gt;cross-campus relationship building&lt;/li&gt;
&lt;li&gt;team alignment across different working styles&lt;/li&gt;
&lt;li&gt;planning future editions before the current one is even over&lt;/li&gt;
&lt;li&gt;making sure the vision survives internal complexity&lt;/li&gt;
&lt;li&gt;solving ten operational problems before breakfast&lt;/li&gt;
&lt;li&gt;keeping something founder-led while still making it collaborative&lt;/li&gt;
&lt;li&gt;doing a lot of invisible thinking about what the next step even is&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A polished event always has a chaotic prequel.&lt;/p&gt;

&lt;p&gt;And a surprising amount of inner work goes into making sure the chaos does not win.&lt;/p&gt;


&lt;h2&gt;
  
  
  What GenAI Genesis has meant to me
&lt;/h2&gt;

&lt;p&gt;At one level, GenAI Genesis is about &lt;strong&gt;AI and machine learning&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But if I am being honest, it has never only been about AI.&lt;/p&gt;

&lt;p&gt;It is about &lt;strong&gt;belonging&lt;/strong&gt;.&lt;br&gt;
It is about &lt;strong&gt;ambition&lt;/strong&gt;.&lt;br&gt;
It is about &lt;strong&gt;opportunity&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It is about building the kind of space I wish existed more abundantly when I first arrived.&lt;/p&gt;

&lt;p&gt;A place where students do not just come to listen, collect swag, and leave. A place where they come to make things. To meet each other. To stretch. To take themselves seriously. To find their people. To realize that they are more capable than they thought.&lt;/p&gt;

&lt;p&gt;That is what I wanted to create.&lt;/p&gt;

&lt;p&gt;And I think that is why this has become more than just a hackathon to me.&lt;/p&gt;

&lt;p&gt;It has become a community, a signal, a platform, and in some ways, living proof that if you build the right room, the right people will find each other inside it.&lt;/p&gt;

&lt;p&gt;I have also learned a lot about myself through this.&lt;/p&gt;

&lt;p&gt;I learned how passionate I am about the things I truly care about. I learned how much I care about my team. I learned that leadership is not something you perform; it is something you practice. I learned how much invisible thinking goes into visible outcomes. I learned that building something meaningful costs time, energy, sleep, and sometimes other opportunities.&lt;/p&gt;

&lt;p&gt;But I also learned that some things are worth choosing repeatedly.&lt;/p&gt;

&lt;p&gt;And this was one of them.&lt;/p&gt;


&lt;h2&gt;
  
  
  The foundation behind the future
&lt;/h2&gt;

&lt;p&gt;As GenAI Genesis grew, it became increasingly important to make sure it could sustain itself beyond just the intensity of one year, one organizing cycle, or one group of students.&lt;/p&gt;

&lt;p&gt;That is a big part of why I helped establish the &lt;strong&gt;GenAI Genesis Foundation&lt;/strong&gt; as an NGO, along with four other directors.&lt;/p&gt;

&lt;p&gt;That step mattered deeply to me.&lt;/p&gt;

&lt;p&gt;Because if GenAI Genesis was going to keep growing properly, it needed more than momentum.&lt;/p&gt;

&lt;p&gt;It needed structure.&lt;br&gt;
It needed continuity.&lt;br&gt;
It needed a long-term home.&lt;/p&gt;

&lt;p&gt;Founding the Foundation was part of making sure that what we built would not just peak.&lt;/p&gt;

&lt;p&gt;It would endure.&lt;/p&gt;

&lt;p&gt;And I am very proud of that.&lt;/p&gt;


&lt;h2&gt;
  
  
  People I want to thank
&lt;/h2&gt;

&lt;p&gt;No founder story is ever truly solo.&lt;/p&gt;

&lt;p&gt;And GenAI Genesis certainly was not.&lt;/p&gt;

&lt;p&gt;I started this with &lt;strong&gt;Adib Fallahpour&lt;/strong&gt;, my co-founder, and I want to begin there. Thank you, Adib, for building this vision with me from the early days, for dreaming big, for caring deeply, and for helping turn a small experiment into something much larger than either of us could have reached alone.&lt;/p&gt;

&lt;p&gt;I want to thank &lt;strong&gt;Richard&lt;/strong&gt;, who helped us significantly in 2024 through &lt;strong&gt;UTMIST&lt;/strong&gt;. Richard brought strong operational guidance at a time when we were still learning how to scale properly, and his support played an important role in helping us bring the hackathon to life at a bigger level.&lt;/p&gt;

&lt;p&gt;I also want to thank &lt;strong&gt;Nimit&lt;/strong&gt;, who helped us through &lt;strong&gt;UofT AI&lt;/strong&gt; and contributed meaningfully to the growth of the initiative in its earlier large-scale chapter. Cross-campus support mattered a lot, and Nimit was part of that story.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;2025&lt;/strong&gt;, I want to thank &lt;strong&gt;Matthew Tamura&lt;/strong&gt;, who joined me and Adib as a co-chair in 2025. Matthew brought a lot of clarity, vision, and leadership to that year, and I deeply appreciated building that edition alongside him.&lt;/p&gt;

&lt;p&gt;And for &lt;strong&gt;2026&lt;/strong&gt;, I want to thank &lt;strong&gt;Hasleen Kaur&lt;/strong&gt; and &lt;strong&gt;Ivan Semenov&lt;/strong&gt;, who are co-chairing this year alongside me. Both are wonderful people and wonderful friends, and I am genuinely grateful to be building this chapter with them.&lt;/p&gt;

&lt;p&gt;There are many people behind the scenes who have contributed to GenAI Genesis over the years — teammates, sponsors, organizers, mentors, judges, friends, and supporters — and I carry a lot of gratitude for all of them.&lt;/p&gt;

&lt;p&gt;Communities may remember the banner.&lt;/p&gt;

&lt;p&gt;But founders remember the people who helped hold it up.&lt;/p&gt;


&lt;h2&gt;
  
  
  And now, 2026
&lt;/h2&gt;

&lt;p&gt;And now we arrive here.&lt;/p&gt;

&lt;p&gt;What started as a 39-person experiment has grown into something far bigger, and in &lt;strong&gt;2026&lt;/strong&gt;, we are taking GenAI Genesis to its biggest scale yet.&lt;/p&gt;

&lt;p&gt;This year, we are going much, much bigger.&lt;/p&gt;

&lt;p&gt;We are preparing to bring together &lt;strong&gt;close to 1,000 people in person&lt;/strong&gt;. We are building across &lt;strong&gt;three major spaces at the University of Toronto&lt;/strong&gt; — &lt;strong&gt;Convocation Hall, Bahen, and Myhal&lt;/strong&gt; — to create an experience that is bigger not just in attendance, but in ambition, energy, and depth.&lt;/p&gt;

&lt;p&gt;This year feels different.&lt;/p&gt;

&lt;p&gt;Not because the mission has changed, but because the scale has finally caught up to the size of the vision.&lt;/p&gt;

&lt;p&gt;We are crossing into four digits.&lt;br&gt;
We are building across multiple buildings.&lt;br&gt;
We are thinking bigger than ever before.&lt;/p&gt;

&lt;p&gt;And for me personally, this year is meaningful in another way too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2026 will be my last year serving as Co-Chair of the hackathon.&lt;/strong&gt; After this, I will be moving into more of an advisory role.&lt;/p&gt;

&lt;p&gt;There is something beautiful about that.&lt;/p&gt;

&lt;p&gt;Because one of the deepest measures of building something well is whether it can continue growing beyond the chapter where you are the one carrying it most directly.&lt;/p&gt;

&lt;p&gt;That is what I want for GenAI Genesis.&lt;/p&gt;

&lt;p&gt;I want it to outgrow any one person, any one year, any one team.&lt;/p&gt;

&lt;p&gt;I want it to keep becoming a place where ambitious students find each other, where builders take themselves seriously, where new ideas are given room to breathe, and where community feels like a force multiplier rather than just a word on a poster.&lt;/p&gt;

&lt;p&gt;So if you have been watching from the sidelines, this is your sign.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Join us on March 13, 14, and 15, 2026.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Come build with us.&lt;br&gt;
Come meet the people shaping what comes next.&lt;br&gt;
Come be part of something that started small, but refused to stay small.&lt;/p&gt;

&lt;p&gt;And when you do, I hope you feel what I felt at the beginning of this whole journey:&lt;/p&gt;

&lt;p&gt;That strange, beautiful energy that appears when ambitious people gather around an idea and decide to make it real.&lt;/p&gt;

&lt;p&gt;That, in the end, is what GenAI Genesis has always been about.&lt;/p&gt;


&lt;h2&gt;
  
  
  Connect with me
&lt;/h2&gt;

&lt;p&gt;If this story resonated with you, feel free to connect with me online, follow GenAI Genesis, or reach out.&lt;/p&gt;

&lt;p&gt;I always love meeting people who care deeply about building communities, technology, and meaningful things.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://genaigenesis.ca/" rel="noopener noreferrer"&gt;Official GenAI Genesis Website&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://genai-genesis-2025.devpost.com/" rel="noopener noreferrer"&gt;GenAI Genesis 2025 Devpost&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://alankrit.me/" rel="noopener noreferrer"&gt;My Website&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.instagram.com/genaigenesis/" rel="noopener noreferrer"&gt;Follow GenAI Genesis on Instagram&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.linkedin.com/in/alankritverma/" rel="noopener noreferrer"&gt;Connect with me on LinkedIn&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://genaigenesis.ca/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;See the 2026 event, follow the journey, or reach out if you want to build something meaningful together.&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>ai</category>
      <category>hackathon</category>
      <category>community</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
