Forem: Alankrit Verma

The Last Pivot: Why Quality Gates Killed My Final KV-Cache Speedup

Alankrit Verma — Mon, 27 Apr 2026 04:40:54 +0000

I wanted to answer one question:

After packed-codebook TurboQuant failed, was there still a credible latency path?

The short answer:

there was a real speed ceiling, but no stable quality-preserving implementation path.

TL;DR

Hardware-friendly int4 K/V passed byte gates but failed real-KV logit quality.
Qwen2.5-7B work reduction had a real speed ceiling: p_attn=0.334, with 1.20x to 1.21x projected speedup at 5% selector overhead.
Oracle quality failed anyway: no implementable selector passed all 4 decode steps.
The lesson was strict: a speed ceiling is only permission to run a quality gate, not permission to implement.

Evidence

I put the detailed benchmark notes in the public evidence repo:

Results ledger: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md
Hardware-friendly int4 K/V quality probe: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/hfkv-quality-k0-prep-summary.md
Work-reduction speed and oracle-quality probe: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/work-reduction-oracle-summary.md

The rule after the pivot

At this point, another TurboQuant variant would have been circular.

So the rule changed:

no implementation before a speed ceiling and an oracle-quality gate pass on the same target.

That rule matters because each partial result can otherwise be overinterpreted:

byte compression can pass while quality fails
synthetic quality can pass while real-KV quality fails
attention-only speed can pass while full decode speed cannot move enough
a speed ceiling can pass while no stable selector exists
a row-level oracle pass can hide step-to-step instability

For me, this became the gate-discipline part of the series: deciding when not to build.

The setup:

storage compression had already separated itself from latency
eager value-path approximations had failed
fused packed-codebook logits had not beaten dense logits by enough

By this point, the original packed-codebook TurboQuant latency path was closed.

The evidence was not subtle:

eager value-path variants failed
exact cleanup did not move latency enough
primitive feasibility failed badly
fused packed-codebook logits beat eager but missed dense-speed bars

So the next move could not be:

one more TurboQuant variant

It had to change the hypothesis.

I tested two pivots:

hardware-friendly int4 K/V
long-context work reduction

Both taught useful things.

Neither justified a runtime latency implementation.

Pivot 1: hardware-friendly int4 K/V

The packed-codebook representation was expensive to consume.

So the next idea was simpler:

What if the representation is less clever but more hardware-friendly?

Instead of codebooks, rotations, and residual machinery, use blockwise int4 K/V.

Two formats were tested:

symmetric int4
affine int4

Both quantized over last-dimension blocks with block_size=32.

The hope was:

simpler unpack/dequant path
predictable memory layout
fewer exotic operations
easier future kernel

This was not integrated into the public cache API.

It was a quality and K0-prep probe only.

The rule was strict:

if real model K/V quality fails, do not write the kernel.

HFKV passed bytes and failed quality

On the real-KV check with HuggingFaceTB/SmolLM2-135M-Instruct, both formats compressed KV substantially.

They also preserved next-token argmax.

But both failed the hard decode-logit MSE gate.

Format	Top-k@10	Argmax	Decode-Logit MSE	Required MSE	KV Ratio	Gate
symmetric int4	`0.800`	yes	`1.739284`	`<=0.25`	`3.56x`	fail
affine int4	`0.800`	yes	`1.360282`	`<=0.25`	`3.20x`	fail

The tempting interpretation would be:

argmax survived, so maybe this is fine.

That is not a good enough bar.

Top-k overlap only exactly hit the minimum, and logit MSE was far outside the target.

The correct decision was:

do not build HFKV-K0.

The reusable lesson:

byte compression is not quality.

The synthetic K0-prep numbers looked fine, but synthetic random tensors were not predictive enough. Real model K/V was the gate, and it failed.

Pivot 2: work reduction

After that, I stopped asking:

can I compress the historical values?

and asked:

does the model actually need all historical tokens for stable decode?

This is a different latency hypothesis.

It is not cache compression.

It is dense-attention work reduction.

The idea is:

full attention over history

becomes:

attention over a selected subset of history

If only a fraction f of historical tokens are active, the idealized attention work shrinks.

But this only matters if attention is a large enough part of full decode.

So the first gate was a speed ceiling.

Speed ceiling math

Let:

p_attn be the fraction of decode time spent in attention
f be the active historical fraction
h be selector/masking overhead as a fraction of the original decode step

The rough projected speedup is:

speedup = 1 / ((1 - p_attn) + p_attn * f + h)

This equation is intentionally simple.

It asks:

even if the selector existed, is there enough attention work to remove?

On small models, the answer was mostly no.

For SmolLM2-135M, attention was not a large enough share of full decode. The quality signal was real, but the latency ceiling was too low.

So I moved to a larger real target:

Qwen/Qwen2.5-7B-Instruct

at roughly:

8192 prompt tokens

Qwen had a real speed ceiling

The Qwen2.5-7B speed-ceiling result was the strongest latency signal in the whole project.

In bfloat16, the dense decode step was about:

20.447 ms

The SDPA real-model projection estimated:

p_attn = 0.334

Projected speedups:

Active Fraction	Projected Speedup, 0% Overhead	Projected Speedup, 5% Overhead	Projected Speedup, 10% Overhead	Gate
`0.337`	`1.28x`	`1.21x`	`1.14x`	pass at 5%
`0.350`	`1.28x`	`1.20x`	`1.13x`	pass at 5%
`0.376`	`1.26x`	`1.19x`	`1.12x`	near miss

This was not a fake result.

There was real room.

But speed ceiling is only half the story.

The next question was:

can an implementable selector preserve quality while keeping only about 34-38% of history?

Oracle quality failed

The official quality gate used:

model: Qwen/Qwen2.5-7B-Instruct
context: 8192
decode steps: 4
dtype: bfloat16
active fraction gate: 0.376

The dense reference was finite on all steps.

So this was a valid quality run.

The headline:

15 selector-step rows passed,
but 0 selector configurations passed all 4 decode steps.

That distinction matters.

A row-level pass says:

this selector worked on this step.

An implementation pass needs:

this selector family worked consistently across decode steps.

No implementable selector did.

Selector	Active Hist Fraction	Passed Steps	Failed Steps	Max MSE	Main Failure
`global_block_mass:f=0.3760:b=16`	`0.374`	`0,1,3`	`2`	`0.585789`	step `2` MSE
`global_block_mass:f=0.3500:b=16`	`0.348`	`0,3`	`1,2`	`0.900147`	steps `1,2` MSE
`global_block_mass:f=0.3370:b=16`	`0.335`	`0,2,3`	`1`	`0.919634`	step `1` MSE
`recent_sink:sink=4:recent=3072`	`0.375`	none	`0,1,2,3`	`0.163465`	layer-local relative L2

The most tempting result was recent_sink:sink=4:recent=3072.

It had:

max decode-logit MSE 0.163465
top-k overlap at least 0.8
stable argmax

But it failed every step on layer-local median post-o_proj relative L2:

0.1394 > 0.10

At this point, the dangerous move would be:

relax the quality gate because the result is close.

That is how projects go in circles.

The gate existed before the result.

The result failed the gate.

So the decision was:

do not build the runtime selector.

What survived

The final result is not:

everything was useless.

The surviving lessons are more precise.

Dense GPU decode is a serious baseline

Dense attention is not naive.

It has:

simple data layout
optimized kernels
clean tensor operations
no unpack/reconstruct overhead

Any compressed path has to beat that, not just beat its own prototype.

Memory and latency are separate scorecards

Cache compression can be valuable even if it does not reduce latency.

The right memory/capacity metrics are:

cache bytes per token
maximum context before OOM
batch size at fixed VRAM
throughput under memory pressure
quality at long context

That is a different product goal from:

faster decode when dense already fits

Speed ceiling is necessary but not sufficient

Qwen2.5-7B proved there can be enough attention share for work reduction to matter.

But the selector also has to preserve quality.

The oracle could not find one stable implementable selector under the hard gate.

Paper claims and integration claims are different

A paper can make a valid primitive or memory claim.

A transformers integration needs a different proof:

full decode timing
real model quality
update cost
value path
generation overhead
target hardware
target dtype baseline

Those are not interchangeable.

Final state

The current honest state is:

No active GPU decode-latency implementation path.

Closed as latency paths:

eager TurboQuant-family variants
packed-codebook fused K1/residual/value integration
hardware-friendly int4 K/V kernel work
Qwen work-reduction selector implementation

Still potentially useful:

storage/capacity compression
exact cleanup as baseline hygiene
the measurement discipline
the failure map

The next reasonable project, if the goal continues, is not:

one more latency variant

It is:

a separate memory/capacity plan

with its own scorecard.

Closing

The whole project started with a simple hope:

smaller KV cache, faster transformers.

The actual lesson was harder:

compression only speeds up decode if the compressed representation is cheap to consume on the target execution path.

That condition failed repeatedly:

in eager value approximations
in packed-codebook primitive timing
in fused logits upper bounds
in simple int4 K/V quality
in long-context work reduction quality

That is not wasted work.

It is a map of where the obvious traps are.

And for performance engineering, a good negative map is often the thing that prevents the next six months of bad work.

The final lesson:

A speed ceiling is only permission to run a quality gate. It is not permission to implement.

Qwen2.5-7B had enough attention share for work reduction to matter. The oracle still failed to find one stable implementable selector. That is why the latency path stopped.

Beating Eager TurboQuant Was Not Enough: Why Dense GPU Attention Still Won

Alankrit Verma — Mon, 27 Apr 2026 04:37:16 +0000

I wanted to answer one question:

If I remove eager overhead, can a TurboQuant-style compressed primitive beat dense GPU logits?

The short answer:

it beat eager TurboQuant, but it did not beat dense FP16 logits by enough.

TL;DR

Exact weighted value decode was mathematically clean, but only improved value_decode_sec by about 2.9%.
A fused packed-codebook logits kernel removed most eager overhead and beat eager TurboQuant main logits by 7x to 18x.
It still missed the dense FP16 logits gate: the best K0.2 result was 1.56x at 8192 and 1.99x at 16384, where the gate required >=2x.
The lesson was strict: beating your own eager prototype is not enough. The compressed path must beat the real dense baseline with room left for softmax, values, residuals, and updates.

Evidence

I put the detailed benchmark notes in the public evidence repo:

Results ledger: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md
Exact reference value-decode rewrite: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/exact-reference-value-decode-summary.md
Primitive feasibility probe: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/primitive-feasibility-summary.md
Fused packed-codebook proof: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/fused-kernel-proof-summary.md

The question after eager failed

After the eager value-path failure, there were two possible interpretations:

The compressed-attention idea was weak.
The eager implementation level was weak.

Those are different claims, so they needed different tests.

The first test was exact cleanup: remove algebraic waste without changing the algorithm. If that produced a large win, I could keep improving the stable path.

It did not.

The second test was a primitive upper bound: remove most eager overhead and ask whether the compressed representation could beat dense GPU logits before adding softmax, values, residuals, and model integration.

That is why the fused proof was intentionally narrow. A narrow upper bound is useful because it can kill a bad integration path before the integration work starts.

The setup:

earlier eager value-path experiments failed
that did not prove compressed attention was impossible
it only proved the eager implementation level was weak

The eager value-path work ended at a clean stop:

the eager value-path family failed.

That did not yet prove the compressed-attention idea was bad.

It proved that my eager implementation shape was bad.

So the next question was narrower:

If I remove the obvious eager overhead, can the compressed representation beat dense attention primitives by enough to justify real integration?

The answer was still no.

But the reason is more interesting than "the kernel was slow."

The fused path got much faster than eager TurboQuant. It just did not get fast enough versus dense GPU logits.

First, an exact cleanup

Before kernel work, there was one exact mathematical cleanup to test.

The stable compressed-key baseline had a value path that effectively did this:

decoded_values_t = R^-1(z_t)
o = sum_t a_t decoded_values_t

Here:

z_t is the value representation in rotated space.
R^-1 is the inverse rotation.
a_t is the attention weight for token t.
o is the weighted value output.

Because R^-1 is linear:

sum_t a_t R^-1(z_t) = R^-1(sum_t a_t z_t)

So I can first compute the weighted sum in rotated space:

z_weighted = sum_t a_t z_t

and then inverse-rotate once:

o = R^-1(z_weighted)

This is not a heuristic.

It is exactly equivalent to the existing codec math, up to normal floating-point accumulation details.

For a decode step, this changes the rough output-side rotation count from:

H_kv * T

to:

H_kv * G * Q

Where:

H_kv is the number of key/value heads.
T is the historical context length.
G is the number of query groups per KV head.
Q is the number of query positions.

For a single-token decode with long history, that is a large algebraic reduction.

It was worth doing.

It was not enough.

The cleanup passed correctness and made the value decode path slightly cleaner, but it did not become a real latency win. In the focused benchmark, value_decode_sec improved only about 2.9%.

The lesson:

exact algebraic cleanup is good engineering, but it is not automatically a product-level speedup.

After that, the only serious TurboQuant-family latency question left was a primitive question.

The primitive question

The dense key-logit computation is:

L = Q K^T

For decode, this means comparing the current query against all historical keys.

The compressed-key hope is:

L ~= compressed_logits(Q, codes(K), scales(K), residuals(K))

without materializing full dense historical keys.

This is the part that sounds like the headline TurboQuant promise:

store fewer key bytes
compute attention logits from the compressed representation
avoid dense historical key reads

But the compressed representation has its own costs:

unpacking low-bit codes
codebook lookup
radius or scale multiplication
residual correction
query rotation or transformed query math
extra metadata reads

So the real primitive question was:

Is the compressed representation cheaper to consume than dense FP16 keys?

Not "is it smaller?"

Not "is it faster than my Python/eager version?"

The target was dense GPU logits.

Eager primitive feasibility failed hard

The first primitive feasibility check used synthetic large GQA-style shapes.

It measured whether the current compressed primitive could compete with dense attention/logits at the operation level.

The result was not close.

The compressed reference primitive was roughly:

12x to 24x slower than dense

And encode-one-token cost alone was around:

0.65 ms to 0.84 ms

That killed the idea that more eager PyTorch variants would solve the problem.

But it still left one fair objection:

Of course eager PyTorch lost. What if a fused kernel removes the unpack and codebook overhead?

That objection was valid.

So I wrote the fused proof.

The fused K0 proof

The fused proof intentionally started small.

It did not implement full attention.

It did not implement values.

It did not implement residual correction.

It did not integrate into model generation.

It only implemented packed-codebook main logits.

That made it an upper-bound test.

If main logits alone could not beat dense logits by enough, then full attention would not have enough room either.

The kernel did:

packed int4 unpack
codebook lookup
radius multiplication
dot products against query groups
output logits

The benchmark compared:

dense logits

against:

fused packed-codebook main logits

The important target was not eager TurboQuant.

The important target was dense logits.

The kernel worked, but the path still failed

The fused kernel did remove a lot of eager overhead.

On the main-logits operation, it was roughly:

7x to 18x faster than eager TurboQuant main logits

That is a real engineering improvement.

But the dense baseline was extremely strong.

For the llama70b_gqa synthetic profile, the corrected float16 output K0.2 sweep looked like this:

Block Tokens	Context	Dense Logits	Fused Main Logits	Fused vs Dense	Required	Gate
`32`	`8192`	`0.061 ms`	`0.048 ms`	`1.27x`	`>=2.0x`	fail
`32`	`16384`	`0.093 ms`	`0.049 ms`	`1.90x`	`>=2.0x`	fail
`64`	`8192`	`0.060 ms`	`0.039 ms`	`1.56x`	`>=2.0x`	fail
`64`	`16384`	`0.092 ms`	`0.046 ms`	`1.99x`	`>=2.0x`	near miss
`128`	`8192`	`0.058 ms`	`0.108 ms`	`0.54x`	`>=2.0x`	fail
`256`	`8192`	`0.059 ms`	`0.043 ms`	`1.38x`	`>=2.0x`	fail

The best tile was block_tokens=64.

It nearly reached 2x at 16k.

It did not clear the bar at 8k.

And this was still only main logits.

Full attention would add:

online softmax
value accumulation
residual correction
cache update cost
model integration overhead

So a near miss on logits-only at 16k was not enough.

The disciplined decision was:

stop the packed-codebook fused-kernel path.

Why dense logits were so hard to beat

Dense FP16 logits have a boring shape, and that is exactly why they are strong.

They use:

contiguous FP16 data
regular tensor operations
optimized GPU paths
no unpacking
no codebook lookup
no token-specific scale reconstruction

The packed-codebook path stores fewer bytes, but each byte is not immediately useful math.

It has to be unpacked and interpreted.

That interpretation cost is the whole fight.

In my kernel, there was another practical issue: query-group padding.

The kernel used tl.dot, which wants a tile dimension of at least 16.

For GQA shapes:

llama70b_gqa had G=8, so half the tile was padding.
llama8b_gqa had G=4, so three quarters of the tile was padding.

The lower-level kernel removed Python overhead, but it did not change the fact that the dense baseline had a cleaner GPU execution shape.

What about the TurboQuant claim?

This is where benchmark language matters.

The external TurboQuant claim is not the same as what I was trying to ship.

For context, the public references are:

Google Research blog: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
TurboQuant paper: https://arxiv.org/abs/2504.19874

The public TurboQuant framing includes:

extreme KV/vector compression
quality retention at low bits
faster attention-logit computation under a specialized setup

That is not identical to:

drop a cache implementation into Hugging Face generate()
and beat dense FP16/BF16 end-to-end decode

Those are different bars.

In particular:

attention logits are only one part of full decode
a primitive result does not include cache update, value path, model layers, or generation overhead
comparing against FP32 unquantized keys is not the same as comparing against an optimized FP16/BF16 dense path
H100/JAX-style specialized kernels are not the same environment as a general local transformers fork

So this work does not prove:

TurboQuant is wrong.

It proves:

my packed-codebook TurboQuant-family path did not have enough room to become a full decode-latency win in this repo.

That distinction is important.

It is also why the failure was useful.

It stopped me from turning a near-miss primitive into months of residual/value/model integration work.

The reusable lesson

The main lesson from this stage was:

Beating your own unoptimized implementation is not enough.

A compressed path has to beat the real target baseline.

For GPU decode, the real target baseline is dense optimized attention/logits, not the first eager prototype.

The second lesson:

A logits-only win needs enough margin to pay for the rest of attention.

If the best logits-only result barely reaches the required threshold at one context and misses at another, full attention is not going to improve the situation for free.

Where this left the project

After this stage, the packed-codebook TurboQuant latency path was closed.

Not because compression was fake.

Not because the math was useless.

Because the compressed representation was not cheap enough to consume compared with dense FP16/BF16 GPU math.

That left two possible pivots:

Try a more hardware-friendly KV representation.
Stop compressing values and instead reduce dense attention work.

Those became the final experiments.

They also had to pass the same discipline: bytes, speed ceilings, and quality are separate gates.

What I learned

If you only remember one thing from this post, it should be this:

A compressed kernel has to beat the real dense baseline, not just the unoptimized compressed prototype.

The fused logits kernel was useful because it answered that question before I spent time on residual correction, value accumulation, and model integration.

When A Good Approximation Still Loses

Alankrit Verma — Sun, 26 Apr 2026 07:29:57 +0000

I wanted to answer one question:

Why did a mathematically reasonable value approximation still fail as a runtime optimization?

The short answer:

active fraction is not runtime.

TL;DR

I tried to make historical value mixing cheaper after compressed-key attention was stable.
Chunk summaries were cheap because they removed information; active chunks had a plausible error argument but terrible eager runtime.
The final vectorized eager path kept active fraction low, but value decode got worse: 11.906 ms became about 26 ms.
The lesson was concrete: active token fraction is not runtime. Count gathers, decodes, reductions, scatters, and bookkeeping.

Evidence

I put the detailed benchmark notes in the public evidence repo:

Results ledger: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md
Active-chunk postmortem: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/active-chunk-background-postmortem.md
Final vectorized eager spike summary: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/spike-a-vectorized-active-background-summary.md

My value-side bet

My working hypothesis was not:

randomly approximate values and hope generation survives.

It was more structured:

Keep the compressed-key path stable.
Preserve recent and sink values exactly because they are high-risk.
Summarize low-risk historical values only if the quality loss is bounded.
Select active historical chunks only when the approximation risk is high.
Require the implementation to reduce real hot-path work, not just theoretical token count.

The mistake was that step 5 was weaker than steps 1-4 for too long.

That is why this post is useful beyond this specific cache implementation. It is a case study in the difference between:

a reasonable approximation argument

and:

a runtime shape that the GPU and framework can execute cheaply.

The setup:

a smaller KV cache is not automatically a faster attention path
compressed keys can reduce part of the problem
values still have to be mixed across history
that value path became the bottleneck

The earlier architecture lesson was:

A smaller KV cache is not automatically a faster attention path.

That pushed me toward compressed-attention execution instead of storage-only cache compression.

I built a stable compressed-key baseline. It was not fast enough to be the final answer, but it was coherent. It gave me a way to separate the key side from the value side.

This distinction matters for the rest of the post:

The experiments below are not a verdict on the official TurboQuant paper or every possible fused implementation. They are a verdict on the eager value-path family I built in this transformers fork.

Then the real problem became clear:

even with compressed keys, the model still has to mix historical values.

The rest of this piece follows the experiments that tried to make that value path cheaper.

None became the final answer.

But each one taught something useful.

The value-side problem

After attention weights are computed, the model still needs:

o = sum_t a_t v_t

The expensive part is that t ranges over the history.

If the implementation still decodes or processes most historical values every step, then it has not really escaped the long-context cost.

So the value-path goal was:

keep enough value information for quality, but stop paying full historical value cost everywhere.

The hard part is that "enough" is query-dependent, head-dependent, and quality-sensitive.

What survived early: exact recent and sink values

One value-side ingredient consistently made sense:

keep a small exact window for the most recent tokens and a few sink tokens.

Internally this was called exact_recent_sink. The reader-facing name is:

exact recent/sink value window

The reason is simple:

recent tokens often matter a lot
sink tokens can have special attention behavior
preserving them exactly helps stability

This was not a speed breakthrough in eager execution, but it was the only value-side ingredient that kept looking defensible.

What failed early: more hybrid logic

The first broad hybrid value experiment mixed several ideas:

exact recent/sink windows
delayed value quantization
saliency bookkeeping
anchors
selective residual logic

Internally this was legacy_hybrid_fast.

It lost.

It added complexity without removing enough dominant work. It was slower than the stable compressed-key baseline and not cleanly better on fidelity.

The reusable lesson:

hybrid logic must remove dominant work, not just add selective corrections.

I also tested saliency-heavy background variants and anchors.

Those did not become forward paths either.

The pattern was the same:

more bookkeeping
more branches
unclear runtime win
not enough quality/runtime payoff

These branches are part of the work, but they do not each need their own long section. Their shared lesson was the same, and the later chunk experiments explain it more cleanly.

Chunk summaries were fast because they were wrong

The next idea was to summarize historical values by chunks.

For each chunk C_j, define a mean value:

mu_j = (1 / c) sum_(t in C_j) v_t

where c is the chunk size.

And define the attention mass on that chunk:

M_j = sum_(t in C_j) a_t

Then approximate the chunk contribution as:

M_j mu_j

instead of:

sum_(t in C_j) a_t v_t

This is attractive because it replaces many token-level value contributions with one chunk-level contribution.

But the blanket summary version lost too much information.

It could be faster, but it was fast for the wrong reason: it threw away too much of the task.

Reusable lesson:

a speedup that comes from destroying quality is not an optimization.

Active chunks had better math and terrible runtime

The next idea tried to keep the good part of chunk summaries without summarizing everything.

Instead of treating every historical chunk the same:

keep important chunks active and exact
summarize the inactive chunks

For inactive chunks:

o_j_hat = M_j mu_j

For active chunks:

o_j = sum_(t in C_j) a_t v_t

To choose active chunks, I used a score:

score_j = mass_j * spread_j

The intuition was reasonable:

high attention mass means the chunk matters
high value spread means the mean may be a bad summary
high mass * spread means the chunk is risky to approximate

This was a real approximation argument.

It was not a runtime argument.

The eager implementation was disastrous.

Path	Decode 2048	Memory Context 2048	Long Next-Token MSE
stable compressed-key baseline	`54.607 ms`	`13.717 s`	`0.349`
exact recent/sink window	`70.205 ms`	`14.485 s`	`0.326`
active-chunk value approximation	`1148.843 ms`	`311.692 s`	`0.494`

The profile explained the failure.

The active-chunk implementation:

selected active chunks
looped over those chunks in Python
decoded tiny slices repeatedly
scattered small contributions back repeatedly

At decode 2048, value decode time went from:

9.080 ms -> 1079.421 ms

That is not a small miss. That is an implementation shape failure.

The major lesson:

I had proved an approximation bound, not a runtime win.

The process lesson was just as important:

Before a large run, every approximation needs a written hot-path shape: how many gathers, decodes, reductions, scatters, and Python-level loops are actually in the decode step.

The final eager test: vectorize active chunks

After that failure, the next question was narrow:

Was the active-chunk idea bad, or was the eager implementation shape bad?

So I ran one final eager viability test.

Internal name:

vectorized_active_background

Reader-facing name:

final vectorized active-background value test

The goal was to keep the same compressed-key baseline and redesign only historical value participation.

The design used:

exact sink tokens
exact recent tokens
a dense staging buffer for values that had aged out of the recent window
fixed-size historical chunks
active exact chunks selected by attention mass and spread
inactive chunk summaries
vectorized gather/decode/reduction instead of Python loops

The intended output decomposition was:

o = o_sink + o_recent + o_staging + o_history

For historical chunks:

o_history ~= sum_(j in A) sum_(t in C_j) a_t v_t
           + sum_(j not in A) M_j mu_j

where A is the active chunk set.

This test had a hard benchmark gate.

It needed:

at least 20% better decode step latency than the stable compressed-key baseline
at least 40% lower value-decode time
active token fraction at most 0.25
at least 20% better long-context latency
no material fidelity collapse

The benchmark grid was intentionally small:

Reader Label	Summary Chunk Size	Active Chunk Ratio	Min Active Chunks
chunk 16, 12.5% active	`16`	`0.125`	`1`
chunk 16, 25% active	`16`	`0.25`	`1`
chunk 32, 12.5% active	`32`	`0.125`	`1`

This was not a sweep for the sake of sweeping. It was a falsification test.

The result: active fraction was not runtime

The stable compressed-key baseline for this run was:

Metric	Baseline
Decode `2048` step latency	`70.604 ms`
Decode `2048` value-decode time	`11.906 ms`
Prefill `2048` latency	`58.128 ms`
Long-context latency	`14.212 s`
Fidelity MSE vs dense baseline	`0.2326`
Fidelity top-k overlap	`1.0`

The candidates:

Reader Label	Decode Step	Decode vs Baseline	Value Decode	Active Token Fraction	Long-Context Latency	Fidelity MSE	Top-k
chunk 16, 12.5% active	`68.726 ms`	`+2.7%`	`26.461 ms`	`0.125`	`17.756 s`	`1.3115`	`0.5`
chunk 16, 25% active	`68.768 ms`	`+2.6%`	`26.149 ms`	`0.25`	`17.884 s`	`0.3314`	`0.7`
chunk 32, 12.5% active	`68.386 ms`	`+3.1%`	`26.341 ms`	`0.1333`	`17.780 s`	`2.1598`	`0.4`

No configuration passed.

The active token fraction stayed within budget.

But the actual value-decode time got much worse:

11.906 ms -> about 26 ms

Long-context latency got about 25% worse.

Fidelity regressed.

The core lesson:

active token fraction is not runtime.

The runtime is closer to:

total_cost = selection
           + gather
           + active_decode
           + active_reduce
           + inactive_reduce
           + bookkeeping

not:

total_cost = active_fraction * dense_cost

The final vectorized eager path removed the catastrophic Python loop, but it still did not produce a frontier-moving result.

So the disciplined decision was:

freeze eager variants in this compressed-attention family.

The test sequence was deliberately not broad:

focused correctness tests
decode at roughly 2048 prompt tokens
prefill at roughly 2048 prompt tokens
long-context memory/latency at roughly 2048 prompt tokens
next-token logit fidelity
profile counters for active gather/decode/reduce, inactive summary work, and staging flushes

That matters because this was a kill-gate, not a hyperparameter search.

What this does and does not prove

It does prove:

this eager value-path family should not keep getting new variants
vectorizing a bad workload is not automatically enough
runtime-cost modeling must happen before broad benchmarks
quality gates matter as much as speed gates

It does not prove:

all compressed attention is useless
TurboQuant-style key ideas are worthless
lower-level fused implementations cannot work
that the official TurboQuant result is invalid

The correct conclusion is narrower:

the eager implementation level failed for this family.

That is still a useful result. It prevents the next round of work from being "one more eager variant with one more heuristic."

What came next

At this point, the eager value-path family was frozen.

But that did not yet answer a lower-level question:

Was the compressed-attention idea weak, or was the eager implementation level weak?

Before doing kernel work, there was one exact cleanup worth proving.

The stable compressed-key baseline currently reconstructs rotated value information in a way that appears algebraically wasteful.

Current form:

decoded_values_t = R^-1(z_t)
o = sum_t a_t decoded_values_t

Because R^-1 is linear:

sum_t a_t R^-1(z_t) = R^-1(sum_t a_t z_t)

So instead of inverse-rotating every historical token value, I can first compute the weighted sum in rotated space and inverse-rotate once:

z_weighted = sum_t a_t z_t
o = R^-1(z_weighted)

This is exact relative to the current value codec.

It is not:

a new heuristic
a new approximation
another active-background branch

It is a baseline cleanup.

For SmolLM2-135M at T = 2048, the rough inverse-rotation count changes from:

H_kv * T = 3 * 2048 = 6144

to:

H_kv * G * Q = 3 * 3 * 1 = 9

for the output rotations.

Here H_kv is the number of KV heads, G is the number of query groups per KV head, and Q is the number of query positions in a decode step.

That is why it deserved a quick proof.

If that exact cleanup did not move the profile enough, then the only remaining TurboQuant-family question was lower-level:

can a fused/Triton/CUDA-style compressed-attention primitive beat dense attention by a large enough margin to justify integration?

That is a different proof level. It is no longer a cache-API experiment; it is a primitive benchmark against dense GPU attention/logits.

The lessons worth keeping

Storage compression and runtime execution must be measured separately.
A stable baseline is more valuable than a pile of incomparable branches.
Every approximation needs a runtime proof obligation.
Python loops in decode hot paths are presumed guilty until proven otherwise.
Active fraction, cache size, and runtime are different metrics.
A failed eager implementation does not automatically disprove the math.
A failed hard gate should stop the branch, not invite endless tuning.
The next proof should be exact if possible, primitive-level if necessary, and killed quickly if weak.
Benchmark-only experiment names should stay separate from public API promises.
Every serious run should emit enough configuration and profiling data to explain the result later.

Closing

This value-path work did not produce a production-ready faster cache path.

But it did produce a useful map:

exact recent/sink values were defensible but not enough
summary-only background was fast because it lost too much information
active chunks had a reasonable approximation argument and a bad hot path
vectorization removed the catastrophic loop shape but still missed the gate
active fraction, cache size, and runtime were different metrics

That is the part worth keeping.

Not because the final result won. It did not.

Because now I know what not to try again at the eager value-path level.

The remaining question was lower-level:

What if the compressed path only failed because eager execution was the wrong implementation level?

That question needed a different kind of proof: a primitive-level comparison against dense GPU attention/logits, not another eager cache variant.

A Smaller KV Cache Did Not Make Transformers Faster

Alankrit Verma — Sun, 26 Apr 2026 07:22:26 +0000

Long-context generation makes the KV cache hard to ignore.

I wanted to answer one question:

Why can a KV cache become much smaller while generation gets slower?

The short answer:

storage compression and attention execution are different problems.

TL;DR

I measured KV-cache compression as a systems problem, not just a storage problem.
quanto cut the cache footprint from 50.911 MiB to 0.913 MiB, but generation latency increased from 2.250 s to 3.912 s.
That result was useful: it separated storage compression from execution compression.
The rest of the work followed from that distinction. If attention still consumes dense tensors, smaller cache storage alone will not make decode faster.

Evidence

I put the detailed benchmark notes in a public evidence repo:

Public evidence repo: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence
Results ledger: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/results-ledger.md
Baseline storage-vs-latency summary: https://github.com/AlankritVerma01/turboquant-kv-cache-evidence/blob/main/evidence/benchmark-summaries/pre-turboquant-quantized-cache-report.md

The first trap: storage is not execution

The first hypothesis sounded reasonable:

fewer cache bytes should mean faster generation.

But that bundles two different claims together:

The cache stores fewer bytes.
The attention step does less work.

Those are not the same claim.

The better engineering question was:

does the attention hot path consume less work, or did I only store the same work in a smaller format?

That is the lens for the rest of the post.

Every generated token reuses keys and values from previous tokens. As the context grows, those cached tensors grow with it. So the natural first idea is simple:

Compress the KV cache, store fewer bytes, and get faster generation.

I tested that idea while exploring TurboQuant-style cache compression in a Hugging Face transformers fork.

Important scope note:

This is not a claim that the official TurboQuant research idea "does not work."

The external context is:

Google Research introduced TurboQuant as a compression method for extreme KV-cache and vector compression: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
The TurboQuant paper describes an online vector quantization approach with residual correction for inner-product preservation: https://arxiv.org/abs/2504.19874
Hugging Face transformers exposes several cache strategies, including dynamic and quantized caches: https://huggingface.co/docs/transformers/en/kv_cache

What I tested was narrower:

Can I make a TurboQuant-style compressed-attention path useful inside a local eager transformers implementation?

The first useful result was not that a particular backend won.

It was this:

Storage compression and attention execution are different problems.

A cache can become dramatically smaller while generation gets slower.

That single distinction changed the rest of the project.

The mental model

In decoder-only generation, each new token uses cached keys and values from previous tokens.

Simplified for one attention head:

a = softmax(q K^T)
o = a V

Where:

q is the current query.
K is the historical key cache.
V is the historical value cache.
a is the attention distribution.
o is the output contribution from history.

Keys decide where to attend. Values provide the information that gets mixed.

When context length grows, both K and V grow.

So compression can target at least two different things:

Store the cache in fewer bytes.
Execute attention without reconstructing dense historical tensors.

Those sound related. In practice, they are different engineering targets.

The first measurement

I started with existing cache behavior in transformers.

The baselines were:

DynamicCache: dense eager execution.
quanto: a strong storage-compression baseline.
hqq: another quantized-cache baseline.

The benchmark below used HuggingFaceTB/SmolLM2-135M-Instruct in a roughly 2048-token context generation case.

I measured more than just stored bytes:

generation latency
stored cache footprint
cache bytes per token
sampled runtime memory
whether generated outputs matched the dense baseline in simple cases

Backend	What It Represents	Mean Latency	Cache Footprint	Cache Bytes / Token	Runtime Delta Peak
`dynamic`	dense eager baseline	`2.250 s`	`50.911 MiB`	`23040.0`	`0.102 GB`
`quanto`	strong storage-compression baseline	`3.912 s`	`0.913 MiB`	`413.3`	`0.048 GB`
`hqq`	alternative quantized-cache baseline	`9.770 s`	`19.133 MiB`	`8658.6`	`0.040 GB`

The important row is quanto.

It reduced stored cache footprint from:

50.911 MiB -> 0.913 MiB

That is an excellent cache-size result.

But latency went from:

2.250 s -> 3.912 s

So cache storage got much smaller, while generation got slower.

That is not a paradox. It shows what the backend is optimizing.

Why smaller storage did not mean faster attention

The current generic quantized-cache shape in transformers is roughly:

Produce new dense keys and values.
Quantize them for storage.
Keep compressed tensors in the cache.
Later dequantize cached tensors.
Return dense keys and values to normal attention.

So the attention implementation still consumes dense tensors.

That means the architecture is:

compressed storage + dense execution

not:

compressed attention

The first design can save cache bytes.

The second design is needed if the goal is to make attention itself faster.

This distinction became the first real output of the project.

Why I still looked at compressed attention

TurboQuant-style work was interesting because the bigger promise is not simply "store the KV cache with fewer bits."

The stronger target is:

store historical keys in a compressed representation
compute attention logits using that compressed representation
avoid reconstructing every dense historical key each decode step

The ordinary dense key path computes:

logits_t = q . k_t

for every historical token t.

The compressed-key target is closer to:

logits_t ~= compressed_dot(q, code(k_t), residual(k_t))

without materializing every full k_t.

That is an execution-path change.

It requires a different shape than a normal storage-only QuantizedCache backend.

That is why the project became less about "add another cache backend" and more about "change what attention actually consumes."

The stable compressed-key baseline

I built a stable compressed-key baseline to test that direction.

Internally, I called it reference. For a public reader, the better name is:

the stable compressed-key baseline

Its job was not to be the final optimized system. Its job was to prove that an end-to-end compressed-key attention path could exist in a Llama-style eager stack and provide a consistent comparison point for later experiments.

It kept:

compressed historical keys
compressed-key attention-logit computation
residual correction behavior
a full value path so correctness and fidelity stayed interpretable

That baseline survived the project better than the later value-path experiments.

The key lesson was:

The compressed-key path was not where most failures came from.

The failures came from values.

I also saw some directional evidence that compressed-key work might become more interesting as model/context size changes. But that evidence was not clean enough to be the headline result. The safe claim was narrower:

keep the compressed-key baseline as an internal anchor, but do not call it the final system.

Why values became the hard part

Attention has two major pieces:

Compute attention weights from keys.
Mix values using those weights.

Even if keys are compressed, the output still requires:

o = sum_t a_t v_t

If the implementation still reconstructs or processes values across most of history, the value path remains expensive.

That is exactly what happened.

The project shifted from:

Can I compress the cache?

to:

Can I keep the compressed-key path and make historical value participation structurally cheaper?

That question led to the second half of the work: multiple value-path approximations, most of which failed.

What I learned

This is the architecture lesson that shaped the rest of the work.

I learned:

Existing quantized cache backends can be very good at reducing stored cache footprint.
Stored-cache size is not the same as runtime attention cost.
Dense eager execution is a serious baseline because it has a simple hot path.
TurboQuant-style compressed-key attention is a different target from storage-only cache compression.
The stable compressed-key path was useful enough to keep as an internal baseline.
The next bottleneck was historical value mixing.

That is where the next technical question came from:

Can I make historical value mixing cheaper without destroying quality?

That question is more brutal than cache compression, because it is no longer enough to store fewer bytes. The compressed representation also has to be cheap to use.

Scope

These measurements came from one local fork, one benchmark setup, and a small-model-first workflow. The goal was not to claim universal results for every model and GPU.

The goal was to answer a systems question:

Am I actually reducing attention execution cost, or only cache storage?

For this phase, the answer was clear.

I had reduced storage.

I had not yet won execution.

This distinction changed the next question. Once the key path had a stable compressed baseline, the remaining bottleneck was not:

can I store fewer bytes?

It was:

can I mix historical values cheaply enough without breaking quality?

Synthetic Population Testing for Recommendation Systems

Alankrit Verma — Sat, 04 Apr 2026 02:04:50 +0000

Offline evaluation is necessary for recommender systems. It is also not a full test of recommender quality. The missing layer is not only better aggregate metrics, but better ways to test how a model behaves for different kinds of users before launch.

TL;DR

In the last post, I argued that offline evaluation is useful but incomplete for recommendation systems.
After that, I built a small public artifact to make the gap concrete.
In the canonical MovieLens comparison, the popularity baseline wins Recall@10 and NDCG@10, but the candidate model does much better for Explorer and Niche-interest users and creates a very different behavioral profile.
I do not think this means “offline evaluation is wrong.”
I think it means a better pre-launch evaluation stack should include some form of synthetic population testing: explicit behavioral lenses, trajectory-aware diagnostics, and tests that make hidden tradeoffs visible before launch.

What Comes After “Offline Evaluation Is Not Enough”?

In the first post, I made a narrow claim:

offline evaluation is useful, but incomplete, because recommendation systems are interactive systems.

That argument matters, but by itself it leaves an obvious next question:

if aggregate offline metrics are not enough, what should be added to the evaluation stack?

I do not think the answer starts with a giant platform or a perfect user simulator.

I think the more practical place to start is smaller:

take the same baseline-vs-candidate comparison and test it through multiple behavioral lenses, not just one aggregate average.

That is what I built next.

The Artifact

The current artifact is a small public recommender behavior QA harness.

It compares:

one baseline recommender
one candidate recommender
one fixed evaluation setup

And it produces:

standard offline ranking metrics
bucket-level utility
behavioral diagnostics such as novelty, repetition, and catalog concentration
short trajectory traces that make model behavior easier to inspect

The canonical public run is intentionally narrow:

MovieLens 100K
Model A: popularity baseline
Model B: genre-profile recommender with a popularity prior
4 fixed buckets
one frozen report bundle

The point is not to claim that these two models define recommender evaluation. The point is to create one clean, reproducible proof that aggregate offline metrics can hide useful pre-launch information.

The Canonical Result

The canonical MovieLens run shows the core value in one comparison.

On aggregate offline ranking metrics, the popularity baseline wins:

Model	Recall@10	NDCG@10
Model A	0.088	0.057
Model B	0.058	0.036

If we stopped there, the conclusion would be straightforward: Model A looks better.

But the bucketed view tells a different story:

Bucket	Model A	Model B	Delta (B-A)
Conservative mainstream	0.519	0.532	0.012
Explorer / novelty-seeking	0.339	0.523	0.184
Niche-interest	0.443	0.722	0.279
Low-patience	0.321	0.364	0.043

That is the point.

Aggregate offline metrics say one thing. The segment-aware view says something more useful:

the baseline is better at recovering held-out positives
the candidate is much stronger for important user lenses
the behavioral profile of the system changes in ways the aggregate view compresses away

The behavioral diagnostics make that even clearer:

Model	Novelty	Repetition	Catalog concentration
Model A	0.395	0.279	1.000
Model B	0.678	0.664	0.717

This is worth pausing on, because not every behavioral metric moves in the same direction.

Model B is:

more novel
less catalog-concentrated
but also more repetitive in this diagnostic

That is not a bug in the framework. It is part of the point. Different recommendation strategies produce different behavioral signatures, and pre-launch evaluation should help make those signatures visible instead of collapsing everything into one average.

What “Synthetic Population Testing” Means Here

It is important to be precise about this phrase.

What I have today is not a rich simulation of realistic synthetic humans. There are no agent conversations, no generated personas with biographies, and no claim that the current system faithfully reproduces real user psychology.

What the artifact does have is a simpler and more controlled version of the same idea:

fixed behavioral lenses
explicit utility assumptions
short trajectory simulation under those assumptions

The four v1 buckets are:

Conservative mainstream
Explorer / novelty-seeking
Niche-interest
Low-patience

Each bucket values recommendation behavior differently. The evaluation then asks how the same two models behave when the user lens changes.

So when I say synthetic population testing here, I mean:

an early, lightweight form of synthetic population testing built from fixed behavioral lenses, not full synthetic-user simulation.

I think that still matters. It turns vague product intuition like “some users may prefer this model more than others” into an explicit, reproducible pre-launch test.

Why This Is Better Than Another Aggregate Metric

A natural response to the first post is to ask whether we simply need better aggregate metrics.

I do not think that is enough.

The problem is not only that a metric is imperfect. The deeper problem is that recommender quality is heterogeneous.

Different users are helped by different behaviors:

some want safer, familiar, high-exposure items
some benefit from more novelty and more variety
some have narrower tastes that require stronger matching to long-tail pockets
some degrade faster when sequences become stale

A single global score cannot represent all of that well.

That is why I think the next useful layer should look more like testing against a small synthetic population than inventing one more scalar.

Instead of asking only:

which model wins on average?

we should also ask:

which model wins for which behavioral lens?

where do the models differ most?

what kind of trajectory does each model produce?

This does not mean the current bucket lenses are perfect. It means they are often more informative than one collapsed aggregate average.

One Short Trajectory Example

The trajectory view matters because recommendation quality is not only one-step.

Here is one Explorer / novelty-seeking comparison from the canonical run:

Model A

Raiders of the Lost Ark -> Fargo -> Toy Story -> Return of the Jedi

Model B

Prophecy, The -> Cat People -> Wes Craven's New Nightmare -> Relic, The

The first sequence stays much closer to familiar, high-exposure titles. The second is much more tailored to a narrower taste profile and much more novel.

This is exactly the kind of difference that disappears when evaluation is reduced to one aggregate ranking score.

Why This Matters Before Launch

Pre-launch evaluation is about decisions, not just measurements.

If a team is deciding whether to ship a new recommender, the real question is usually not:

did one mean score go up?

It is closer to this:

who gets a better experience?
who gets a worse one?
does the candidate become more repetitive?
does it collapse toward head items?
does it create a healthier exploration profile?

Those are product and system questions, not only ranking-metric questions.

That is why I like this framing. It stays honest about what the artifact is doing. It is not trying to predict the full online future. It is trying to make hidden tradeoffs visible earlier, with a tool that is still small enough to run, inspect, and reason about.

What This Is, And What It Is Not

I think the strongest version of this argument is the honest one.

This artifact is:

a small public proof
a recommender-specific evaluation layer
a way to make segment-level and trajectory-level tradeoffs visible
a first wedge into broader testing for interactive systems

This artifact is not:

a proof that the candidate model is globally better
a replacement for offline evaluation
a replacement for online experiments
a full synthetic-human simulation framework

That distinction matters. If this work is useful, it will be useful because it is clear about what it adds, not because it overclaims.

A Better Evaluation Stack

The long-term picture I have in mind looks something like this:

Standard offline evaluation remains the first layer.
Segment-aware and trajectory-aware diagnostics become the second layer.
Richer synthetic population testing may become the next layer after that.
Online experiments still remain necessary for final validation.

That is a much more realistic stack than pretending a single aggregate metric can do the whole job.

In that stack, the current artifact sits at layer two. It adds explicit behavioral lenses and short trajectory diagnostics to the familiar offline comparison workflow.

That is why I think it matters, even in its current limited form.

It is not the final answer.

It is the first concrete artifact of the missing layer.

Conclusion

The first post argued that offline evaluation is not enough for recommendation systems.

This artifact is my first practical answer to what should come next.

Not a giant platform. Not a perfect simulation. Not a replacement for offline evaluation.

Just a small, reproducible evaluation harness that compares a baseline and a candidate through multiple behavioral lenses and shows tradeoffs that aggregate metrics compress away.

If offline evaluation is the first screen, then synthetic population testing, in some form, may be one of the next useful layers.

This v1 is a lightweight version of that idea.

If you want to see the public artifact, the canonical MovieLens demo lives in the limitation repo as a report, JSON result bundle, and supporting visuals.

Why Offline Evaluation Is Not Enough for Recommendation Systems?

Alankrit Verma — Sun, 29 Mar 2026 16:40:52 +0000

Why Offline Evaluation Is Not Enough for Recommendation Systems

Offline evaluation is essential for recommender systems. It is also easy to mistake for a fuller measure of quality than it really is.

TL;DR

Offline evaluation is useful, fast, and necessary for recommender systems.
But it is built on logged behavior generated under older exposure policies.
That makes it weak at judging policy shifts, novel items, cold start behavior, and longer interaction trajectories.
In a small MovieLens demo, the popularity baseline wins on aggregate offline ranking metrics, while a more personalized model does better for explorer, niche-interest, and low-patience user buckets.
The practical conclusion is not to replace offline evaluation, but to stop treating it as a full test of recommender quality.

Recommendation systems are interactive systems, but offline evaluation often treats them like static predictors.

1. The Testing Gap

We know how to test deterministic software. We are much less certain about how to test systems that influence the behavior they later observe.

Recommendation systems sit squarely in that second category. They do not just estimate what a user might click, watch, or purchase. They decide what the user gets a chance to see, and that choice helps shape the data that will later be treated as evidence.

Offline evaluation is one of the standard tools in recommender systems for good reason. It is practical, fast, and often highly informative. A team can compare candidate models on historical interaction data long before it is ready to send live traffic to a new ranking policy.

That usefulness, however, can make offline evaluation easy to over-interpret. A strong offline result often sounds like a strong statement about real recommendation quality. Sometimes it is. But the conclusion is narrower than it first appears.

Historical interaction logs are not simply records of user preference. They are records of user preference under a particular pattern of exposure. They reflect what earlier systems chose to rank, recommend, and repeat. In that sense, the data is policy-dependent from the beginning.

This matters because recommendation quality is not only about matching a fixed label. A recommender is an interactive system. Its outputs affect future inputs. Change the policy, and over time you may change what users discover, what they come to trust, what they ignore, and what they eventually consume.

Consider a movie recommender. One model may reliably surface popular, familiar titles. Another may be more personal and more willing to introduce niche films that fit a specific user's taste. If the historical logs were generated under a system that already emphasized mainstream titles, those logs may be much richer in evidence for the first model's choices than for the second model's.

That does not make offline evaluation wrong. It does mean the object being measured is more limited than many teams would like. Offline evaluation is useful, but insufficient.

The point of this article is narrow. It is not that offline evaluation should be discarded, and it is not a general argument about all machine learning systems. The claim is simpler: recommendation systems are interactive systems, and that fact places real limits on what historical replay can tell us.

2. What Offline Evaluation Is

Offline evaluation, in the recommender setting, means evaluating a model on historical logged interactions rather than on live user traffic. The usual pattern is straightforward: train on past user-item behavior, hold out a later slice of interactions, and ask whether the model ranks the held-out items highly for the relevant users.

In a movie recommendation system, the data might include watches, clicks, ratings, or add-to-list events. A model is trained on part of that history and then evaluated on interactions that were not shown during training. If a user later watched a particular film, one basic offline question is whether that film would have appeared near the top of the model's ranked list.

This setup supports the ranking-style metrics commonly used in recommender systems. Teams may report measures such as Recall@K, hit rate, or NDCG to summarize how well a model recovers held-out interactions. The exact metric matters, but the general logic is the same: use historical behavior as a proxy for whether the recommendations were good.

That approach is attractive because it gives a concrete and reproducible testing loop. Candidate models can be compared against the same held-out data. Regressions can be caught before launch. Incremental improvements can be measured without the cost and risk of online experimentation.

It is also important to be precise about what this evaluation is actually saying. Offline evaluation does not directly measure how users would respond to a new policy in a live environment. It measures how well a model aligns with historical interactions recorded under earlier exposure conditions.

That distinction is easy to blur because the workflow looks so familiar. We have training data, a test set, and a metric. But in recommendation systems, the labels are not independent of the system that helped generate them. The held-out watch or click is not just a fact about the user. It is also a fact about what the user was shown.

For now, that is enough of a working definition. Offline evaluation is historical replay over logged interactions, typically framed as a ranking problem, and used as a proxy for recommendation quality under observed conditions. It is a very useful proxy. The rest of the article asks where its boundaries are.

3. Where Offline Evaluation Breaks

The limitations of offline evaluation do not come from a single bad metric or a single avoidable mistake. They come from a more basic fact about recommender data: the data is generated under a policy. What users do in the logs depends in part on what earlier systems chose to show them.

That sounds obvious when stated directly. But it has deeper consequences than it first appears. If the evidence used for evaluation is itself shaped by older recommendation decisions, then offline evaluation is not observing some neutral ground truth about relevance. It is observing relevance through the filter of past exposure.

In a static prediction task, that distinction is often less severe. In recommendation, it sits near the center of the problem. A new recommender is rarely judged against untouched labels. It is judged against behavior recorded under an older recommender, with its own ranking habits, popularity biases, and coverage patterns.

We can state the issue in simple notation. Let pi_0 be the logging policy that generated the historical data, and let pi_1 be the new policy we want to evaluate. Offline replay uses observations gathered under pi_0 to estimate the quality of pi_1. If pi_1 behaves much like pi_0, that may be informative. If it changes exposure materially, the estimate becomes much less complete.

This is the core mismatch. The quantity we want is user response under the candidate policy. The quantity we usually observe is user response under the previous policy. The two overlap, but they are not the same object.

3.1 Exposure Bias

The first break is exposure bias. Users can only react to items they were actually shown.

That means an interaction log is not just a record of what users preferred. It is also a record of what the system made available. When an item receives no click, no watch, or no rating, that absence does not cleanly mean the item was irrelevant. In many cases it means the item was never placed in front of the user at all.

This matters immediately for offline evaluation. Suppose a movie platform has historically given heavy exposure to well-known studio releases and much lighter exposure to niche films. The resulting data will contain dense evidence for how users responded to the mainstream catalog and sparse evidence for how they would have responded to more specialized titles.

The bias here is structural rather than anecdotal. If observed feedback only exists for exposed items, then the support of the evaluation data is concentrated where the logging policy chose to spend attention. In compact form, observed reward is only available where pi_0(i | u, c) is nontrivial for user u, item i, and context c.

That is why historical replay is partial. It is not sampling uniformly from all relevant user-item pairs. It is sampling from the subset that earlier policies made visible. In a movie recommender, this can make “popular” look easier to measure than “personally relevant,” even when the latter is closer to the product goal.

3.2 Old-Policy Lock-In

Exposure bias becomes more consequential when a new policy differs from the old one in systematic ways. This is where old-policy lock-in appears.

In most offline evaluations, the labels used to assess a candidate model were generated under a different ranking policy. A held-out watch event looks like a simple target, but it is downstream of earlier recommendation decisions. The new model is therefore being judged with evidence produced by the system it may be trying to replace.

This creates an asymmetry. Models that resemble the old policy often enjoy richer and cleaner evidence in the historical logs. Models that shift probability mass toward less exposed regions of the catalog are evaluated in the parts of the space where the logs are thinnest.

Return to the movie example. If the old system strongly favored familiar blockbusters, then the held-out data will naturally contain many interactions with those titles. A candidate model that continues to rank them highly will line up well with the log. Another model that is more willing to surface quieter but well-matched films may look weaker offline, not necessarily because users dislike those recommendations, but because the old system rarely created opportunities to observe that preference.

This is one reason a better recommender can look worse offline. The issue is not only model accuracy. It is evaluation support. When performance is estimated on outcomes generated under pi_0, the comparison can systematically favor policies that stay close to pi_0.

That does not make all offline comparisons invalid. If two models differ only slightly, offline evaluation can still be highly useful. But when a candidate policy changes exposure patterns in meaningful ways, offline results should be read with more caution than the metric alone suggests.

3.3 Novel Items and Cold Start

The same logic becomes even sharper for new or rarely exposed content.

Offline evaluation is strongest where historical evidence is plentiful. It is weakest where exposure has been limited, recent, or absent. Unfortunately, those are often exactly the regions where recommendation systems are asked to do something valuable: introduce new items, expand coverage, and connect users to parts of the catalog they would not have reached on their own.

In a movie platform, consider a newly added independent film with very little interaction history. A model may have good reasons to recommend it to a narrow set of users based on metadata, embeddings, or nearby behavioral signals. But if the film barely appeared under the previous policy, then historical logs offer limited evidence for how good that recommendation would actually be.

The problem is not only that the item is new. The deeper issue is that offline replay inherits the conservatism of past exposure. It is much easier to validate recommendations for already visible inventory than for inventory the old policy neglected.

This creates a subtle but important pressure. Systems that stay near the historically exposed core of the catalog are easier to justify with offline evidence. Systems that broaden exposure toward the tail are often evaluated precisely where the data is least informative. Over time, that can make conservative recommendation strategies look more reliable than they really are, and exploratory strategies look less supported than they might deserve.

The claim is not that offline evaluation fails in every cold-start setting. It is that historical replay is structurally weak exactly where a recommender tries to broaden exposure. For recommenders, novelty is often where the evidence is thinnest.

3.4 Trajectory Blindness

Even if the exposure problem disappeared, there would still be another limitation. Recommendation quality is not purely one-step.

Most offline metrics compress evaluation into local ranking success. Did the model place the held-out item near the top? Did it recover the next watch? Did it improve a ranking score on observed interactions? Those are reasonable questions, but they are mostly questions about immediate alignment with historical events.

Users, however, experience recommendation systems as sequences. They return across sessions. They compare one recommendation to the previous one. They notice repetition. They develop trust or impatience. They learn whether the system helps them explore or merely loops them through slight variations of what it already knows how to sell.

This is where trajectory blindness enters. A recommender can look strong on one-step relevance and still create a poor multi-step experience.

Imagine a movie recommender that repeatedly serves highly similar popular thrillers because those titles have strong historical watch signals. In a one-step offline evaluation, this may look sensible. The recommendations are close to what users have previously consumed, and the metrics may reward that closeness. But over several sessions the user may experience the system as narrow, repetitive, and increasingly unhelpful.

Another model might trade a small amount of one-step certainty for a better sequence. It may alternate between reliable choices and occasional high-fit long-tail discoveries. That kind of quality often lives in the trajectory rather than in any single ranking event.

In notation, many offline metrics focus on something close to the quality of r_t at a single step. But recommender quality often depends on properties of the sequence (a_1, r_1), ..., (a_T, r_T): how concentrated the recommendations are, whether novelty appears at the right rate, whether boredom accumulates, and whether the system adapts well after earlier choices.

This is not an argument against ranking metrics. It is an argument about what they leave out. They summarize one-step fit to logged behavior. They do not, by themselves, tell us whether the interaction over time becomes richer, narrower, more repetitive, or more satisfying.

3.5 What This Means

Taken together, these limitations point to a single conclusion. Offline evaluation often treats recommendation as if it were a static prediction problem with fixed labels. In practice, recommendation is an interactive system problem.

The system chooses what to expose. Exposure shapes what users can respond to. Those responses become the data for future training and evaluation. Change the policy, and you may change the distribution of behavior itself.

Once that is clear, the goal of evaluation also becomes clearer. The question is not only whether a model can replay the past. It is whether it can support good interaction under a changed policy. Historical replay helps answer that question, but only in part.

4. Why It Still Matters

None of these limitations make offline evaluation disposable. They define its scope. That distinction matters.

Recommendation teams rely on offline evaluation because it solves real engineering problems well. It is fast, reproducible, and comparatively cheap. It allows model changes to be screened before they reach users. It supports regression testing, debugging, ablation work, and benchmarking across candidate approaches. In most practical settings, there is no credible evaluation stack that excludes it.

That remains true even after the critique above. A recommender team still needs a way to reject clearly weak models, validate implementation changes, and compare alternatives under a common protocol. Offline evaluation is often the first place where obvious failures become visible. If a ranking model cannot perform competitively in historical replay, it is usually hard to justify sending it to live traffic.

This is especially important because online tests are expensive in more than one sense. They consume time, user attention, and organizational focus. They are also constrained by risk. A platform may be willing to test a modest ranking change online, but not a model that already appears unstable or uncompetitive offline. Historical evaluation remains the practical filter through which many candidate models must pass.

The right conclusion, then, is not that offline evaluation should be replaced. It is that offline evaluation should be placed correctly. It is a strong tool for iteration and a weak tool for making broad claims about full recommender quality under changed exposure.

In other words, the critique is intentional. Offline evaluation is widely used because it earns its place. The mistake is not using it. The mistake is mistaking it for a complete test.

One compact way to summarize that balance is to separate what offline replay usually measures well from what it tends to leave undermeasured.

Evaluation aspect	What offline replay usually captures	What it tends to miss or undermeasure	Movie recommender example
Immediate relevance under existing exposure	Whether held-out watched items appear near the top of the ranked list	Whether that ranking would still look good under a materially different exposure policy	A familiar blockbuster appears in the top `K` because it was already heavily exposed
Performance under policy shift	Small improvements that stay near the old policy	Quality of recommendations in regions where the candidate policy differs most	A model that surfaces more niche dramas has little historical support where it differs from the old system
Novel or underexposed items	Some signal for items with enough prior exposure	Items that were new, rare, or historically under-shown	A newly added indie film receives little offline credit even if it fits the user well
Cold start behavior	Very coarse performance on sparse users or items	Early recommendation quality when interaction history is thin	A new documentary enters the catalog with too little evidence for replay to judge it fairly
Repetition over sessions	Little, unless explicitly measured	Accumulated sameness across repeated visits	The recommender keeps offering slight variations of the same thriller over multiple sessions
Novelty and exploration	Limited signal through held-out interactions	Whether the system introduces useful discovery at the right rate	A long-tail science-fiction recommendation may be good, but the old logs barely contain exposure to it
Segment-level differences	Aggregate averages over the evaluation set	Which user groups are helped or hurt by the new policy	Mainstream users may do well under Model A while exploration-seeking users do better under Model B
Trajectory-level user experience	Almost nothing in standard one-step metrics	Trust, boredom, fatigue, and satisfaction over sequences	A user keeps getting acceptable next picks but gradually disengages from repetition

5. Running Example: Model A vs. Model B

The structural issues above become easier to see with a simple running example. Consider a movie recommendation system with two candidate rankers.

Model A is conservative. It leans toward popular, broadly watched titles and tends to recommend within the historically dominant regions of the catalog. It is usually safe, usually familiar, and often repetitive.

Model B is more personalized. It still recommends mainstream films when they fit, but it is more willing to surface niche titles, less obvious matches, and items from thinner parts of the catalog when the user profile suggests they are a good fit.

Suppose the historical logs were generated under an earlier recommendation policy that behaved more like Model A. Popular titles received heavy exposure. Niche titles were shown less often. Over time, that policy produced abundant feedback on the mainstream catalog and much weaker evidence on long-tail items.

Now evaluate both models offline on held-out interactions from those logs.

Model A will often look strong for a simple reason: it aligns well with the exposure pattern that helped generate the data. It ranks many of the same kinds of items the old system already showed, so the held-out interactions contain ample opportunities to reward it.

Model B may be better calibrated to particular users, especially users with narrower tastes or stronger appetite for discovery. But if many of its most valuable recommendations lie in regions of the catalog that were rarely exposed before, the offline log may not give it much credit. The evidence needed to validate those choices was never fully collected.

This does not mean Model B is necessarily better overall. Some users may indeed prefer the safer behavior of Model A. That is part of the point. Recommendation quality is heterogeneous across users and across sessions, and a single aggregate score can hide that heterogeneity.

The difference becomes clearer over repeated interaction. Model A may continue to produce acceptable next-item recommendations while gradually narrowing the user's experience into a small, overexposed slice of the catalog. Model B may produce a slightly noisier immediate ranking while creating a better long-run sequence for users who value novelty or have specialized tastes.

This is the kind of divergence a later demo can make visible. Two models may look similar on an aggregate offline metric and still differ meaningfully in repetition, novelty, and which user groups they serve well.

A Small MovieLens Demo

To make that less abstract, I built a small comparison on MovieLens 100K. The setup is intentionally simple. Model A is a popularity baseline. Model B is a lightweight personalized recommender built from user genre profiles with a modest popularity prior. The point is not to produce the strongest possible recommender. The point is to see what different layers of evaluation say about the same pair of systems.

Aggregate view: on standard offline ranking metrics, Model A looks better.

Model	Recall@10	NDCG@10	Novelty	Repetition	Catalog concentration
Model A	0.088	0.057	0.395	0.675	1.000
Model B	0.058	0.036	0.678	0.693	0.717

If we stopped there, the conclusion would be straightforward: the popularity baseline wins offline.

But that is exactly the point of the article. Once the evaluation is widened beyond a single aggregate view, the picture changes.

Bucketed view: the same two models look quite different once we ask who is being served well.

Bucket	Model A utility	Model B utility	Delta (B-A)
Conservative mainstream	0.519	0.532	0.012
Explorer / novelty-seeking	0.339	0.523	0.184
Niche-interest	0.443	0.722	0.279
Low-patience	0.321	0.364	0.043

The bucketed results are more revealing than the aggregate ones. Explorer users and niche-interest users benefit much more from Model B. Low-patience users also do slightly better under Model B in the short-session simulation, even though the aggregate offline ranking metrics still prefer Model A.

The behavior diagnostics tell a related story. Model B is substantially more novel and much less concentrated in the most popular slice of the catalog. For explorer users, bucket-level novelty rises from 0.405 under Model A to 0.808 under Model B. For niche-interest users, mean bucket utility rises by 0.279. That is not a rounding error. It is a segment-level change that the aggregate offline metrics compress away.

What the demo says in one glance

Aggregate offline metrics favor Model A.
Explorer, niche-interest, and low-patience buckets do better under Model B.
Model B is much more novel and less concentrated in the most popular slice of the catalog.

Two short traces make the difference more tangible.

Explorer / novelty-seeking user

Model A: Raiders of the Lost Ark -> Fargo -> Toy Story -> Return of the Jedi
Model B: Prophecy, The -> Cat People -> Wes Craven's New Nightmare -> Relic, The

The first sequence stays close to familiar, high-exposure titles. The second is much more novel and much more tailored to a narrower taste profile.

Low-patience user

Model A: Star Wars -> Fargo -> Return of the Jedi -> Toy Story
Model B: Monty Python and the Holy Grail -> Full Monty -> American President -> Truth About Cats & Dogs

Here the difference is not just novelty. The second sequence moves through a less concentrated slice of the catalog rather than repeatedly returning to the same mainstream core.

This small demo does not prove that Model B is globally better. It does something more modest and more useful. It shows that the answer depends on what we mean by "better," which users we care about, and whether we look only at historical ranking recovery or also at the behavior a recommender produces over short trajectories.

6. A Better Direction, Briefly

If offline evaluation is necessary but incomplete, the natural response is not to discard it. The better response is to build a broader evaluation stack around it.

That broader stack should start from the failure modes already discussed. If logged exposure is policy-dependent, then evaluation should be more explicit about where the evidence is strong and where it is weak. If quality emerges over time, then some part of evaluation should examine sequences rather than only one-step ranking recovery.

In practice, this suggests a modest shift in emphasis. Instead of asking only for a single aggregate offline score, teams can also ask how models behave across user segments, how concentrated their recommendations become, how much novelty they introduce, and whether their behavior looks meaningfully different over short interaction traces.

For the movie example, that might mean comparing Model A and Model B not only on Recall@K or NDCG, but also on repetition, tail exposure, and bucket-level outcomes for users with different appetites for familiarity or exploration. None of these measurements solves the full problem. They simply make the evaluation better matched to the system being evaluated.

The same logic also motivates carefully designed simulated interaction or short trajectory-based testing. The point is not that such methods are already complete or universally trustworthy. The point is narrower: if recommenders shape future behavior, then some part of the evaluation stack should attempt to probe that interaction rather than treating historical replay as the whole story.

This is best understood as complement, not replacement. Offline evaluation remains the fast and reliable first layer. But serious evaluation of recommender quality likely needs additional layers that are more sensitive to exposure shifts, segment differences, and longer-run experience.

7. Conclusion

Offline evaluation remains one of the most useful tools in recommender systems. It is fast, practical, and deeply embedded in how teams iterate on models.

Its limitation is structural rather than procedural. The data it relies on is constrained by prior exposure and generated under earlier policies, so it provides only a partial test of recommender quality.

That matters most when a model changes what gets shown, expands beyond historically overexposed items, or affects the experience over repeated interaction. In those settings, replaying the past is not the same as evaluating the new system on its own terms.

Offline evaluation is indispensable, but it is not the whole test. Recommendation systems shape the behavior they later observe, so any serious evaluation stack should measure interaction, not just replay the past.

This demo is illustrative rather than definitive; its value is in showing how aggregate offline results can hide segment-level differences.

How GenAI Genesis Began

Alankrit Verma — Sat, 07 Mar 2026 05:12:44 +0000

TL;DR

Alankrit Verma came to the University of Toronto as a shy, math-driven student on scholarship who felt a deep responsibility to give back.

That instinct led him into student leadership through AMACSS, where he helped build a small experiment called AI Olympics with 39 participants.

That experiment revealed something bigger: students wanted a serious space to build, learn, and belong in AI.

So Alankrit and his co-founder Adib Fallahpour scaled that spark into GenAI Genesis — first as a cross-campus student hackathon, and eventually into one of Canada’s largest student AI hackathons.

Along the way, Alankrit helped lead the vision, website, sponsorships, partnerships, and long-term structure behind the event, including helping establish the GenAI Genesis Foundation so the mission could sustain beyond a single organizing cycle.

And now, in 2026, GenAI Genesis is entering its biggest and most ambitious chapter yet.

From a 39-person experiment to one of Canada’s largest student AI hackathons

“Some communities are joined. Others are built because you cannot stop thinking about the version that should exist.”

GenAI Genesis team with the cake. Surprise cake courtesy of Hasleen Kaur (Head of Finance 2025, Co-Chair 2026) and Ivan Semenov (Head of Operations 2025, Co-Chair 2026).

There are some things you plan carefully.

And then there are some things that begin so quietly, so casually, that you do not realize until much later that you were standing at the start of something much bigger.

GenAI Genesis was one of those things.

When I joined the University of Toronto, I was, in many ways, still a shy person.

I was not the loudest voice in every room. I was still figuring myself out, still trying to understand what kind of life I wanted to build, and what kind of contribution I wanted to make.

But I did know one thing with complete clarity: I had been given a rare opportunity, and I did not want to waste it.

Coming to this country and this university on a scholarship meant a lot to me. It gave me the ability to study freely, dream more freely, and imagine a future I may not otherwise have had. And from the beginning, that created a very deep feeling in me: I had to give back to the community that had given so much to me.

At that time, I cared about many things at once.

I cared about math.
I cared about building projects.
I cared about recognition, yes — but not just for ego. I wanted to build things that mattered.
I cared about real-world impact.
And I cared, very deeply, about community.

Math had been a big part of my identity for a long time. I had prepared seriously for the Euclid Mathematics Contest and scored 90/100, and that experience mattered to me for more than just the number. Euclid is one of those milestones that gives you credibility, but more importantly, it gave me confidence. It made me more ambitious. It made me believe that I could build something meaningful. And it made me want to create spaces where other students could feel that same sense of challenge, excitement, and possibility.

So when I came to U of T Scarborough, I started looking around and asking myself a simple question:

Where is that energy here?

And honestly, at the time, I did not see enough of it.

Especially in the computer science space, the student community was not really booming. This was the period after COVID, when many campus communities still felt quiet, fragmented, and difficult to revive. There was talent, but not enough momentum. Curiosity, but not enough structure. Ambition, but not enough spaces for it to gather.

And somewhere along the way, I quietly made it my mission to help fix that.

Not alone, of course. Communities are never built alone. But I wanted to be one of the people pushing hard in that direction.

That instinct led me to AMACSS — the Association of Mathematical and Computer Science Students, the Departmental Student Association for the CMS department at the University of Toronto Scarborough.

In my first year, I joined as a First-Year Representative Coordinator, where I represented first-year computer science and math students to the association, and the association back to them. I also coordinated a team of seven people, which turned out to be one of my first real lessons in leadership.

Leadership, I learned very quickly, is not just about taking initiative. It is about understanding people. It is about assigning responsibility thoughtfully. It is about getting buy-in. It is about leading with grace when everyone has different levels of energy, skill, confidence, and commitment.

I had always been someone who liked taking initiative, but AMACSS sharpened that instinct into something more deliberate.

And in that chapter of my life, the first version of GenAI Genesis quietly appeared.

Not as GenAI Genesis.

Not yet.

It started as something called AI Olympics.

Before Genesis, there was AI Olympics

AI Olympics was the first real experiment.

The original idea came from a mix of inspirations.

Part of it came from my love for mathematics competitions and the kind of intellectual excitement they create. Part of it came from online hackathons I had participated in, where I had seen how energizing it could be when people come together to build under time pressure. I remember thinking again and again:

Why do we not have something like this at our university too?

At one point, I was brainstorming with Katrina Best, who was the president at the time, about what might make for a strong event for first-year students. At first, we thought about doing something closer to a math contest. Then the idea evolved. I brainstormed with my team as well. Slowly, the concept shifted from Olympiad to something more build-oriented, more alive, more experimental.

That is where AI Olympics was born.

The name came from that same spirit. We wanted something that felt like an Olympiad, but more modern, more hands-on, and more builder-focused. “AI Olympics” felt close enough to that energy, and at the time, it captured exactly what we were trying to do.

It was a smaller hackathon-style event, around six to seven hours long, with 39 participants.

It was essentially the first-year team’s event through AMACSS, and leading it as First-Year Representative Coordinator made it feel especially personal.

Many of them were beginners. The vibe in the room was not “elite competition” in the intimidating sense. It was much more like collective learning. People were curious. People were experimenting. People were just starting to understand what they could build.

We taught participants how to use the tools. We gave them a website template they could plug their work into so they could build faster. We wanted to reduce friction and maximize momentum. We wanted them to feel like they could actually make something, even if they were just getting started.

And maybe one of my favorite little memories from that day is how we kept ordering coffee from Tim Hortons — not once, not twice, but three times — because people kept wanting more, and apparently everyone had collectively decided that vanilla was the flavor of innovation.

AI Olympics

Looking back, AI Olympics was small.

But it was not small in meaning.

Because it showed us something important:

People wanted a space to learn.
People wanted a space to build.
People wanted a space where AI felt exciting, approachable, social, and full of possibility.

The feedback made that obvious. People were interested in doing this again. They wanted to keep learning. They wanted to keep contributing. They wanted to build in public. They wanted more.

And that was the moment the idea stopped feeling like a one-off event and started feeling like the beginning of a much larger mission.

AI Olympics was the spark.

GenAI Genesis was the system we built around that spark.

The moment it stopped being small

Around that time, I was also working very closely with Adib Fallahpour, who is not just my co-founder, but also a good friend of mine.

I had first worked with Adib through my first-year team, and over time it became very clear that we were on the same wavelength in a lot of ways. He is a very kind person, a big thinker, and someone with strong vision. We both cared deeply about scaling this beyond its first version. We both felt that it should not remain a small campus event that people vaguely remembered. We wanted it to become something real.

I still remember a moment from second year when Adib and I were housemates. He came into my room, and we started discussing what this thing could actually become. Not just another event. Not just another student initiative. But a serious hackathon. Something with real scale. Something that could create a home for people interested in AI, machine learning, software, and ambitious building more broadly.

That conversation stayed with me.

Because from that point onward, this stopped being a nice idea and started becoming a serious project.

Like most ambitious student things, it began with a lot of conversations, a lot of hustle, and a slightly unreasonable amount of belief.

We first tried to define the idea on paper:

What exactly was GenAI Genesis?
What would it look like at scale?
What kind of experience were we trying to create?
What problem were we solving?

The problem, at least to us, felt clear.

At the time, there was not enough community in Toronto around this space — not the kind of entrepreneurial, energetic, builder-first AI ecosystem we wanted to see among students. There was talent, but not enough connected ambition. There were students interested in AI and ML, but not enough platforms bringing them together in a serious way.

So we decided to build one.

The name came together surprisingly quickly. We broke it into two parts: GenAI and Genesis. “Genesis” suggested beginning, emergence, evolution. And at the time, “GenAI” was the word in the air. The name reflected the moment, but the mission was always broader than just generative AI — it was about AI, machine learning, software, and the community around building them. Put together, it felt like a beginning worth naming.

We did not have a full team immediately. At first, we were figuring it out from scratch. Both Adib and I were part of Google Developer Student Club, and that gave us one starting point. We knew we could bring in people from there. Then we looked beyond Scarborough and started reaching across campuses, especially to St. George, where some of the strongest technical student communities already existed.

That is how collaborations started taking shape with groups like:

GDG
UTMIST
UofT AI
and later, CSSU

But I want to be careful and clear here, because this matters to the story.

Important distinction: GenAI Genesis was not a club-created event that I happened to be involved in.

Those groups mattered enormously, and their support helped the vision scale far beyond what we could have done alone in the early stages. They brought expertise, reach, operational support, and legitimacy. But the mission itself — the initial push, the insistence that this had to exist — came from us.

That distinction matters because founder stories can get flattened over time into partnerships, logos, and sponsor lists. But the truth is usually more human than that. It begins with a few people seeing a gap and deciding they are not willing to leave it empty.

The people who helped us scale it

Some of the most important early support came from people who believed in the idea and helped us take it seriously.

Richard, from UTMIST, played a crucial role in 2024. He was a senior to us and incredibly strong operationally. He helped us understand what it means to run something at scale, what it means to think through logistics properly, and how to turn energy into structure.

Nimit, through UofT AI, also played a very important role in helping the initiative come together. Both Richard and Nimit helped us build the cross-campus support that allowed GenAI Genesis to grow beyond its first form.

These collaborations mattered a lot. Not because the event “belonged” to those communities, but because they helped us bring the mission to the scale it deserved.

Sometimes scaling an idea is not about finding people who will take it over.

It is about finding people who understand it enough to help it rise.

A quick timeline

Year	What happened	Why it mattered
Winter 2023	We ran AI Olympics through AMACSS with 39 participants	It proved there was real demand for a build-first AI space
Winter 2024	We launched the first large-scale GenAI Genesis	The experiment became a serious institution
2025	We scaled dramatically with more sponsors, more prizes, and many more submissions	The hackathon became a recognized force in the student AI ecosystem
2026	We are taking it to our biggest scale yet	Bigger footprint, bigger ambition, bigger future

A tiny behind-the-scenes truth
Every row in that table was held together by a lot of invisible work: outreach, relationship management, budget stress, website iterations, venue uncertainty, and a hundred tiny decisions that never show up in a recap post.

2024: when the idea met reality

Getting Started with GenAI Genesis 2024

The 2024 edition was the moment things started to feel very real.

In winter 2024, we launched the first large-scale GenAI Genesis in downtown Toronto.

Up until that point, the idea had energy. It had promise. It had momentum. But 2024 was when it had to survive the test every ambitious student project eventually faces:

Could we actually execute this at scale?

That year taught me a lot.

And by “a lot,” I mean the kind of lessons that only appear when vision collides with logistics.

We had to learn how to:

work with a much larger team
coordinate across campuses
lead people with different styles, strengths, and expectations
manage conflict and disagreement without letting it fracture the mission
build trust with sponsors
make big promises responsibly

And in the middle of all that, I was deeply involved in the work itself.

From the very beginning until now, I have led the website side of GenAI Genesis. Tech has always been one of the areas I stayed especially close to. I was also heavily involved in sponsorships and partnerships — doing cold outreach, talking to organizations, building those relationships, and helping create the external support system that made the event possible.

One of the most memorable parts of that journey was our connection with Google, and how that relationship went from something that initially felt surreal to something that became a meaningful long-term thread in the GenAI Genesis story. There is a strange feeling when big names start trusting something you built. It is exciting, but it is also sobering. It makes you realize the stakes are now real.

The 2024 edition brought in support from names including Google, Knockri, Wombo, Vector Institute, the Academic Advising & Career Centre at UTSC, and the Rotman School of Management.

We had around 254 participants submit a project and awarded roughly $3,000 in prizes.
But what I remember most is not just the number.

I remember how much we had to figure out on the fly.

Venue booking was a huge hassle. A lot of things were fragile. Judging, especially, was something we did not have perfect prior experience with at that scale. And yet, when the time came, the team handled it with surprising grace. We made last-minute changes to make sure the judging process was fair, thoughtful, and well run. That moment stayed with me because it showed me something essential: even if we were new to this scale, we were capable of rising to it.

That was the year GenAI Genesis stopped feeling like a hopeful experiment.

It felt real.

GenAI Genesis 2024

genesis:

  v0: "AI Olympics"
  participants: 39
  then:
    - an experiment
    - a room full of beginners
    - a lot of coffee
    - a lot of belief
  now:
    - a cross-campus movement
    - a large-scale AI hackathon
    - a serious community
  constant:
    - vision
    - people
    - momentum

2025: scale changes everything

Then came 2025.

And 2025 felt different.

This was the year when GenAI Genesis started feeling less like an event and more like an ecosystem.

By then, we were no longer operating entirely from instinct. We had learned processes. We had built systems. We had a better understanding of what worked, what broke, what participants valued, and what scale actually requires. We planned earlier. We moved more formally. We operated with more clarity.

The leadership structure also evolved.

In the earlier chapter, the co-chair structure included me, Adib, Nimit, and Richard. By 2025, the co-chairs were me, Adib, and Matthew Tamura.

Matthew had already been involved as a strong contributor in 2024 through UTMIST and was someone I deeply appreciated — thoughtful, visionary, and strong at leading a team properly. In 2025, he stepped into a bigger leadership role with us, and that made a real difference.

We also worked hard to improve the participant experience in ways that went beyond the surface.

We brought in more sponsors.
We created more networking opportunities.
We designed stronger supporting events during the hackathon.
We sharpened logistics.
We elevated the experience.

And the scale reflected that.

In 2025, we had:

around $15,000 in awards
621 participants submitted a project
backing from Google, BWC, Cohere, AMD, CGI, RBC, Northeastern University, Edge.io Solutions, the Academic Advising & Career Centre, and the University of Toronto
support from partners including the United Nations Association in Canada, One Degree Cooler, Vector Institute, and Hack Canada

One of the most exciting moments that year was when AMD came in and supported us in a way that allowed participants to run more complex machine learning workloads on an AMD GPU-backed local cluster. That felt genuinely wild. It was one of those moments where you step back and realize the hackathon is not just getting bigger in numbers — it is getting more technically meaningful too.

From the outside, growth can look glamorous.

From the inside, it often looks like spreadsheets, calls, follow-ups, contingency planning, team alignment, venue negotiations, technical troubleshooting, partnership mapping, and a hundred open loops in your head at once.

People usually see the lights.

Founders remember the wiring.

And by 2025, I was exhausted. Truly.

But it was also the kind of exhaustion that comes from building something you care about so deeply that you keep choosing it, again and again, even when it would be easier not to.

There were many moments in those years where I could have spent my time doing something else for my résumé — some other project, some other opportunity, some other clean, convenient line on paper.

And again and again, I chose GenAI Genesis.

Because by then it was not just a project.

It was a commitment.

Some of the winners at GenAI Genesis 2025

What people see vs. what it takes

People usually experience a hackathon at the moment it becomes exciting.

Founders experience it in the months before that, when it is still fragile.

What goes into building a hackathon like this?

Not just posters and prize money.

It looks more like this:

sponsor outreach and partnership management
website design and technical infrastructure
judge and mentor coordination
cross-campus relationship building
team alignment across different working styles
planning future editions before the current one is even over
making sure the vision survives internal complexity
solving ten operational problems before breakfast
keeping something founder-led while still making it collaborative
doing a lot of invisible thinking about what the next step even is

A polished event always has a chaotic prequel.

And a surprising amount of inner work goes into making sure the chaos does not win.

What GenAI Genesis has meant to me

At one level, GenAI Genesis is about AI and machine learning.

But if I am being honest, it has never only been about AI.

It is about belonging.
It is about ambition.
It is about opportunity.

It is about building the kind of space I wish existed more abundantly when I first arrived.

A place where students do not just come to listen, collect swag, and leave. A place where they come to make things. To meet each other. To stretch. To take themselves seriously. To find their people. To realize that they are more capable than they thought.

That is what I wanted to create.

And I think that is why this has become more than just a hackathon to me.

It has become a community, a signal, a platform, and in some ways, living proof that if you build the right room, the right people will find each other inside it.

I have also learned a lot about myself through this.

I learned how passionate I am about the things I truly care about. I learned how much I care about my team. I learned that leadership is not something you perform; it is something you practice. I learned how much invisible thinking goes into visible outcomes. I learned that building something meaningful costs time, energy, sleep, and sometimes other opportunities.

But I also learned that some things are worth choosing repeatedly.

And this was one of them.

The foundation behind the future

As GenAI Genesis grew, it became increasingly important to make sure it could sustain itself beyond just the intensity of one year, one organizing cycle, or one group of students.

That is a big part of why I helped establish the GenAI Genesis Foundation as an NGO, along with four other directors.

That step mattered deeply to me.

Because if GenAI Genesis was going to keep growing properly, it needed more than momentum.

It needed structure.
It needed continuity.
It needed a long-term home.

Founding the Foundation was part of making sure that what we built would not just peak.

It would endure.

And I am very proud of that.

People I want to thank

No founder story is ever truly solo.

And GenAI Genesis certainly was not.

I started this with Adib Fallahpour, my co-founder, and I want to begin there. Thank you, Adib, for building this vision with me from the early days, for dreaming big, for caring deeply, and for helping turn a small experiment into something much larger than either of us could have reached alone.

I want to thank Richard, who helped us significantly in 2024 through UTMIST. Richard brought strong operational guidance at a time when we were still learning how to scale properly, and his support played an important role in helping us bring the hackathon to life at a bigger level.

I also want to thank Nimit, who helped us through UofT AI and contributed meaningfully to the growth of the initiative in its earlier large-scale chapter. Cross-campus support mattered a lot, and Nimit was part of that story.

For 2025, I want to thank Matthew Tamura, who joined me and Adib as a co-chair in 2025. Matthew brought a lot of clarity, vision, and leadership to that year, and I deeply appreciated building that edition alongside him.

And for 2026, I want to thank Hasleen Kaur and Ivan Semenov, who are co-chairing this year alongside me. Both are wonderful people and wonderful friends, and I am genuinely grateful to be building this chapter with them.

There are many people behind the scenes who have contributed to GenAI Genesis over the years — teammates, sponsors, organizers, mentors, judges, friends, and supporters — and I carry a lot of gratitude for all of them.

Communities may remember the banner.

But founders remember the people who helped hold it up.

And now, 2026

And now we arrive here.

What started as a 39-person experiment has grown into something far bigger, and in 2026, we are taking GenAI Genesis to its biggest scale yet.

This year, we are going much, much bigger.

We are preparing to bring together close to 1,000 people in person. We are building across three major spaces at the University of Toronto — Convocation Hall, Bahen, and Myhal — to create an experience that is bigger not just in attendance, but in ambition, energy, and depth.

This year feels different.

Not because the mission has changed, but because the scale has finally caught up to the size of the vision.

We are crossing into four digits.
We are building across multiple buildings.
We are thinking bigger than ever before.

And for me personally, this year is meaningful in another way too.

2026 will be my last year serving as Co-Chair of the hackathon. After this, I will be moving into more of an advisory role.

There is something beautiful about that.

Because one of the deepest measures of building something well is whether it can continue growing beyond the chapter where you are the one carrying it most directly.

That is what I want for GenAI Genesis.

I want it to outgrow any one person, any one year, any one team.

I want it to keep becoming a place where ambitious students find each other, where builders take themselves seriously, where new ideas are given room to breathe, and where community feels like a force multiplier rather than just a word on a poster.

So if you have been watching from the sidelines, this is your sign.

Join us on March 13, 14, and 15, 2026.

Come build with us.
Come meet the people shaping what comes next.
Come be part of something that started small, but refused to stay small.

And when you do, I hope you feel what I felt at the beginning of this whole journey:

That strange, beautiful energy that appears when ambitious people gather around an idea and decide to make it real.

That, in the end, is what GenAI Genesis has always been about.

Connect with me

If this story resonated with you, feel free to connect with me online, follow GenAI Genesis, or reach out.

I always love meeting people who care deeply about building communities, technology, and meaningful things.

See the 2026 event, follow the journey, or reach out if you want to build something meaningful together.