<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Alankrit Verma</title>
    <description>The latest articles on Forem by Alankrit Verma (@alankritverma).</description>
    <link>https://forem.com/alankritverma</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3810413%2F534cb6dc-3366-4fa4-b44a-49ba12793a1b.jpg</url>
      <title>Forem: Alankrit Verma</title>
      <link>https://forem.com/alankritverma</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/alankritverma"/>
    <language>en</language>
    <item>
      <title>Synthetic Population Testing for Recommendation Systems</title>
      <dc:creator>Alankrit Verma</dc:creator>
      <pubDate>Sat, 04 Apr 2026 02:04:50 +0000</pubDate>
      <link>https://forem.com/alankritverma/synthetic-population-testing-for-recommendation-systems-58f5</link>
      <guid>https://forem.com/alankritverma/synthetic-population-testing-for-recommendation-systems-58f5</guid>
      <description>&lt;p&gt;&lt;em&gt;Offline evaluation is necessary for recommender systems. It is also not a full test of recommender quality. The missing layer is not only better aggregate metrics, but better ways to test how a model behaves for different kinds of users before launch.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;In the last post, I argued that offline evaluation is useful but incomplete for recommendation systems.&lt;/li&gt;
&lt;li&gt;After that, I built a small public artifact to make the gap concrete.&lt;/li&gt;
&lt;li&gt;In the canonical MovieLens comparison, the popularity baseline wins &lt;code&gt;Recall@10&lt;/code&gt; and &lt;code&gt;NDCG@10&lt;/code&gt;, but the candidate model does much better for Explorer and Niche-interest users and creates a very different behavioral profile.&lt;/li&gt;
&lt;li&gt;I do not think this means “offline evaluation is wrong.”&lt;/li&gt;
&lt;li&gt;I think it means a better pre-launch evaluation stack should include some form of synthetic population testing: explicit behavioral lenses, trajectory-aware diagnostics, and tests that make hidden tradeoffs visible before launch.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Comes After “Offline Evaluation Is Not Enough”?
&lt;/h2&gt;

&lt;p&gt;In the first post, I made a narrow claim:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;offline evaluation is useful, but incomplete, because recommendation systems are interactive systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That argument matters, but by itself it leaves an obvious next question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;if aggregate offline metrics are not enough, what should be added to the evaluation stack?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I do not think the answer starts with a giant platform or a perfect user simulator.&lt;/p&gt;

&lt;p&gt;I think the more practical place to start is smaller:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;take the same baseline-vs-candidate comparison and test it through multiple behavioral lenses, not just one aggregate average.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is what I built next.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Artifact
&lt;/h2&gt;

&lt;p&gt;The current artifact is a small public recommender behavior QA harness.&lt;/p&gt;

&lt;p&gt;It compares:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one baseline recommender&lt;/li&gt;
&lt;li&gt;one candidate recommender&lt;/li&gt;
&lt;li&gt;one fixed evaluation setup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And it produces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;standard offline ranking metrics&lt;/li&gt;
&lt;li&gt;bucket-level utility&lt;/li&gt;
&lt;li&gt;behavioral diagnostics such as novelty, repetition, and catalog concentration&lt;/li&gt;
&lt;li&gt;short trajectory traces that make model behavior easier to inspect&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The canonical public run is intentionally narrow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MovieLens 100K&lt;/li&gt;
&lt;li&gt;Model A: popularity baseline&lt;/li&gt;
&lt;li&gt;Model B: genre-profile recommender with a popularity prior&lt;/li&gt;
&lt;li&gt;4 fixed buckets&lt;/li&gt;
&lt;li&gt;one frozen report bundle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point is not to claim that these two models define recommender evaluation. The point is to create one clean, reproducible proof that aggregate offline metrics can hide useful pre-launch information.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Canonical Result
&lt;/h2&gt;

&lt;p&gt;The canonical MovieLens run shows the core value in one comparison.&lt;/p&gt;

&lt;p&gt;On aggregate offline ranking metrics, the popularity baseline wins:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Recall@10&lt;/th&gt;
&lt;th&gt;NDCG@10&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model A&lt;/td&gt;
&lt;td&gt;0.088&lt;/td&gt;
&lt;td&gt;0.057&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model B&lt;/td&gt;
&lt;td&gt;0.058&lt;/td&gt;
&lt;td&gt;0.036&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If we stopped there, the conclusion would be straightforward: Model A looks better.&lt;/p&gt;

&lt;p&gt;But the bucketed view tells a different story:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bucket&lt;/th&gt;
&lt;th&gt;Model A&lt;/th&gt;
&lt;th&gt;Model B&lt;/th&gt;
&lt;th&gt;Delta (B-A)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Conservative mainstream&lt;/td&gt;
&lt;td&gt;0.519&lt;/td&gt;
&lt;td&gt;0.532&lt;/td&gt;
&lt;td&gt;0.012&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explorer / novelty-seeking&lt;/td&gt;
&lt;td&gt;0.339&lt;/td&gt;
&lt;td&gt;0.523&lt;/td&gt;
&lt;td&gt;0.184&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Niche-interest&lt;/td&gt;
&lt;td&gt;0.443&lt;/td&gt;
&lt;td&gt;0.722&lt;/td&gt;
&lt;td&gt;0.279&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low-patience&lt;/td&gt;
&lt;td&gt;0.321&lt;/td&gt;
&lt;td&gt;0.364&lt;/td&gt;
&lt;td&gt;0.043&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mxcw3czl29t6fzv48hj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mxcw3czl29t6fzv48hj.png" alt="What offline metrics missed" width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That is the point.&lt;/p&gt;

&lt;p&gt;Aggregate offline metrics say one thing. The segment-aware view says something more useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the baseline is better at recovering held-out positives&lt;/li&gt;
&lt;li&gt;the candidate is much stronger for important user lenses&lt;/li&gt;
&lt;li&gt;the behavioral profile of the system changes in ways the aggregate view compresses away&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The behavioral diagnostics make that even clearer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Novelty&lt;/th&gt;
&lt;th&gt;Repetition&lt;/th&gt;
&lt;th&gt;Catalog concentration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model A&lt;/td&gt;
&lt;td&gt;0.395&lt;/td&gt;
&lt;td&gt;0.279&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model B&lt;/td&gt;
&lt;td&gt;0.678&lt;/td&gt;
&lt;td&gt;0.664&lt;/td&gt;
&lt;td&gt;0.717&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is worth pausing on, because not every behavioral metric moves in the same direction.&lt;/p&gt;

&lt;p&gt;Model B is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;more novel&lt;/li&gt;
&lt;li&gt;less catalog-concentrated&lt;/li&gt;
&lt;li&gt;but also more repetitive in this diagnostic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not a bug in the framework. It is part of the point. Different recommendation strategies produce different behavioral signatures, and pre-launch evaluation should help make those signatures visible instead of collapsing everything into one average.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0042vaqhregokfcsmnz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0042vaqhregokfcsmnz.png" alt="Bucket utility comparison" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What “Synthetic Population Testing” Means Here
&lt;/h2&gt;

&lt;p&gt;It is important to be precise about this phrase.&lt;/p&gt;

&lt;p&gt;What I have today is &lt;strong&gt;not&lt;/strong&gt; a rich simulation of realistic synthetic humans. There are no agent conversations, no generated personas with biographies, and no claim that the current system faithfully reproduces real user psychology.&lt;/p&gt;

&lt;p&gt;What the artifact does have is a simpler and more controlled version of the same idea:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fixed behavioral lenses&lt;/li&gt;
&lt;li&gt;explicit utility assumptions&lt;/li&gt;
&lt;li&gt;short trajectory simulation under those assumptions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The four v1 buckets are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Conservative mainstream&lt;/li&gt;
&lt;li&gt;Explorer / novelty-seeking&lt;/li&gt;
&lt;li&gt;Niche-interest&lt;/li&gt;
&lt;li&gt;Low-patience&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each bucket values recommendation behavior differently. The evaluation then asks how the same two models behave when the user lens changes.&lt;/p&gt;

&lt;p&gt;So when I say &lt;strong&gt;synthetic population testing&lt;/strong&gt; here, I mean:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;an early, lightweight form of synthetic population testing built from fixed behavioral lenses, not full synthetic-user simulation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I think that still matters. It turns vague product intuition like “some users may prefer this model more than others” into an explicit, reproducible pre-launch test.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fomv81qvtmthfzd5xiz2j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fomv81qvtmthfzd5xiz2j.png" alt="What synthetic means here" width="800" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Is Better Than Another Aggregate Metric
&lt;/h2&gt;

&lt;p&gt;A natural response to the first post is to ask whether we simply need better aggregate metrics.&lt;/p&gt;

&lt;p&gt;I do not think that is enough.&lt;/p&gt;

&lt;p&gt;The problem is not only that a metric is imperfect. The deeper problem is that recommender quality is heterogeneous.&lt;/p&gt;

&lt;p&gt;Different users are helped by different behaviors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;some want safer, familiar, high-exposure items&lt;/li&gt;
&lt;li&gt;some benefit from more novelty and more variety&lt;/li&gt;
&lt;li&gt;some have narrower tastes that require stronger matching to long-tail pockets&lt;/li&gt;
&lt;li&gt;some degrade faster when sequences become stale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single global score cannot represent all of that well.&lt;/p&gt;

&lt;p&gt;That is why I think the next useful layer should look more like testing against a small synthetic population than inventing one more scalar.&lt;/p&gt;

&lt;p&gt;Instead of asking only:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;which model wins on average?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;we should also ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;which model wins for which behavioral lens?&lt;/p&gt;

&lt;p&gt;where do the models differ most?&lt;/p&gt;

&lt;p&gt;what kind of trajectory does each model produce?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This does not mean the current bucket lenses are perfect. It means they are often more informative than one collapsed aggregate average.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Short Trajectory Example
&lt;/h2&gt;

&lt;p&gt;The trajectory view matters because recommendation quality is not only one-step.&lt;/p&gt;

&lt;p&gt;Here is one Explorer / novelty-seeking comparison from the canonical run:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model A&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raiders of the Lost Ark -&amp;gt; Fargo -&amp;gt; Toy Story -&amp;gt; Return of the Jedi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Model B&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prophecy, The -&amp;gt; Cat People -&amp;gt; Wes Craven's New Nightmare -&amp;gt; Relic, The
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first sequence stays much closer to familiar, high-exposure titles. The second is much more tailored to a narrower taste profile and much more novel.&lt;/p&gt;

&lt;p&gt;This is exactly the kind of difference that disappears when evaluation is reduced to one aggregate ranking score.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kzh5vaksyaf7drogtvz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4kzh5vaksyaf7drogtvz.png" alt="Explorer trace comparison" width="800" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Before Launch
&lt;/h2&gt;

&lt;p&gt;Pre-launch evaluation is about decisions, not just measurements.&lt;/p&gt;

&lt;p&gt;If a team is deciding whether to ship a new recommender, the real question is usually not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;did one mean score go up?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is closer to this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;who gets a better experience?&lt;/li&gt;
&lt;li&gt;who gets a worse one?&lt;/li&gt;
&lt;li&gt;does the candidate become more repetitive?&lt;/li&gt;
&lt;li&gt;does it collapse toward head items?&lt;/li&gt;
&lt;li&gt;does it create a healthier exploration profile?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are product and system questions, not only ranking-metric questions.&lt;/p&gt;

&lt;p&gt;That is why I like this framing. It stays honest about what the artifact is doing. It is not trying to predict the full online future. It is trying to make hidden tradeoffs visible earlier, with a tool that is still small enough to run, inspect, and reason about.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Is, And What It Is Not
&lt;/h2&gt;

&lt;p&gt;I think the strongest version of this argument is the honest one.&lt;/p&gt;

&lt;p&gt;This artifact &lt;strong&gt;is&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a small public proof&lt;/li&gt;
&lt;li&gt;a recommender-specific evaluation layer&lt;/li&gt;
&lt;li&gt;a way to make segment-level and trajectory-level tradeoffs visible&lt;/li&gt;
&lt;li&gt;a first wedge into broader testing for interactive systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This artifact is &lt;strong&gt;not&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a proof that the candidate model is globally better&lt;/li&gt;
&lt;li&gt;a replacement for offline evaluation&lt;/li&gt;
&lt;li&gt;a replacement for online experiments&lt;/li&gt;
&lt;li&gt;a full synthetic-human simulation framework&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That distinction matters. If this work is useful, it will be useful because it is clear about what it adds, not because it overclaims.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Better Evaluation Stack
&lt;/h2&gt;

&lt;p&gt;The long-term picture I have in mind looks something like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Standard offline evaluation remains the first layer.&lt;/li&gt;
&lt;li&gt;Segment-aware and trajectory-aware diagnostics become the second layer.&lt;/li&gt;
&lt;li&gt;Richer synthetic population testing may become the next layer after that.&lt;/li&gt;
&lt;li&gt;Online experiments still remain necessary for final validation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is a much more realistic stack than pretending a single aggregate metric can do the whole job.&lt;/p&gt;

&lt;p&gt;In that stack, the current artifact sits at layer two. It adds explicit behavioral lenses and short trajectory diagnostics to the familiar offline comparison workflow.&lt;/p&gt;

&lt;p&gt;That is why I think it matters, even in its current limited form.&lt;/p&gt;

&lt;p&gt;It is not the final answer.&lt;/p&gt;

&lt;p&gt;It is the first concrete artifact of the missing layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx46l3fmz2thhy9zsdzw2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx46l3fmz2thhy9zsdzw2.png" alt="Canonical result snapshot" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The first post argued that offline evaluation is not enough for recommendation systems.&lt;/p&gt;

&lt;p&gt;This artifact is my first practical answer to what should come next.&lt;/p&gt;

&lt;p&gt;Not a giant platform. Not a perfect simulation. Not a replacement for offline evaluation.&lt;/p&gt;

&lt;p&gt;Just a small, reproducible evaluation harness that compares a baseline and a candidate through multiple behavioral lenses and shows tradeoffs that aggregate metrics compress away.&lt;/p&gt;

&lt;p&gt;If offline evaluation is the first screen, then synthetic population testing, in some form, may be one of the next useful layers.&lt;/p&gt;

&lt;p&gt;This v1 is a lightweight version of that idea.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you want to see the public artifact, the canonical MovieLens demo lives in the &lt;code&gt;limitation&lt;/code&gt; repo as a report, JSON result bundle, and supporting visuals.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>learning</category>
      <category>algorithms</category>
    </item>
    <item>
      <title>Why Offline Evaluation Is Not Enough for Recommendation Systems?</title>
      <dc:creator>Alankrit Verma</dc:creator>
      <pubDate>Sun, 29 Mar 2026 16:40:52 +0000</pubDate>
      <link>https://forem.com/alankritverma/why-offline-evaluation-is-not-enough-for-recommendation-systems-15ii</link>
      <guid>https://forem.com/alankritverma/why-offline-evaluation-is-not-enough-for-recommendation-systems-15ii</guid>
      <description>&lt;h2&gt;
  
  
  Why Offline Evaluation Is Not Enough for Recommendation Systems
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Offline evaluation is essential for recommender systems. It is also easy to mistake for a fuller measure of quality than it really is.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Offline evaluation is useful, fast, and necessary for recommender systems.&lt;/li&gt;
&lt;li&gt;But it is built on logged behavior generated under older exposure policies.&lt;/li&gt;
&lt;li&gt;That makes it weak at judging policy shifts, novel items, cold start behavior, and longer interaction trajectories.&lt;/li&gt;
&lt;li&gt;In a small MovieLens demo, the popularity baseline wins on aggregate offline ranking metrics, while a more personalized model does better for explorer, niche-interest, and low-patience user buckets.&lt;/li&gt;
&lt;li&gt;The practical conclusion is not to replace offline evaluation, but to stop treating it as a full test of recommender quality.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Recommendation systems are interactive systems, but offline evaluation often treats them like static predictors.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  1. The Testing Gap
&lt;/h3&gt;

&lt;p&gt;We know how to test deterministic software. We are much less certain about how to test systems that influence the behavior they later observe.&lt;/p&gt;

&lt;p&gt;Recommendation systems sit squarely in that second category. They do not just estimate what a user might click, watch, or purchase. They decide what the user gets a chance to see, and that choice helps shape the data that will later be treated as evidence.&lt;/p&gt;

&lt;p&gt;Offline evaluation is one of the standard tools in recommender systems for good reason. It is practical, fast, and often highly informative. A team can compare candidate models on historical interaction data long before it is ready to send live traffic to a new ranking policy.&lt;/p&gt;

&lt;p&gt;That usefulness, however, can make offline evaluation easy to over-interpret. A strong offline result often sounds like a strong statement about real recommendation quality. Sometimes it is. But the conclusion is narrower than it first appears.&lt;/p&gt;

&lt;p&gt;Historical interaction logs are not simply records of user preference. They are records of user preference under a particular pattern of exposure. They reflect what earlier systems chose to rank, recommend, and repeat. In that sense, the data is policy-dependent from the beginning.&lt;/p&gt;

&lt;p&gt;This matters because recommendation quality is not only about matching a fixed label. A recommender is an interactive system. Its outputs affect future inputs. Change the policy, and over time you may change what users discover, what they come to trust, what they ignore, and what they eventually consume.&lt;/p&gt;

&lt;p&gt;Consider a movie recommender. One model may reliably surface popular, familiar titles. Another may be more personal and more willing to introduce niche films that fit a specific user's taste. If the historical logs were generated under a system that already emphasized mainstream titles, those logs may be much richer in evidence for the first model's choices than for the second model's.&lt;/p&gt;

&lt;p&gt;That does not make offline evaluation wrong. It does mean the object being measured is more limited than many teams would like. Offline evaluation is useful, but insufficient.&lt;/p&gt;

&lt;p&gt;The point of this article is narrow. It is not that offline evaluation should be discarded, and it is not a general argument about all machine learning systems. The claim is simpler: recommendation systems are interactive systems, and that fact places real limits on what historical replay can tell us.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. What Offline Evaluation Is
&lt;/h3&gt;

&lt;p&gt;Offline evaluation, in the recommender setting, means evaluating a model on historical logged interactions rather than on live user traffic. The usual pattern is straightforward: train on past user-item behavior, hold out a later slice of interactions, and ask whether the model ranks the held-out items highly for the relevant users.&lt;/p&gt;

&lt;p&gt;In a movie recommendation system, the data might include watches, clicks, ratings, or add-to-list events. A model is trained on part of that history and then evaluated on interactions that were not shown during training. If a user later watched a particular film, one basic offline question is whether that film would have appeared near the top of the model's ranked list.&lt;/p&gt;

&lt;p&gt;This setup supports the ranking-style metrics commonly used in recommender systems. Teams may report measures such as Recall@K, hit rate, or NDCG to summarize how well a model recovers held-out interactions. The exact metric matters, but the general logic is the same: use historical behavior as a proxy for whether the recommendations were good.&lt;/p&gt;

&lt;p&gt;That approach is attractive because it gives a concrete and reproducible testing loop. Candidate models can be compared against the same held-out data. Regressions can be caught before launch. Incremental improvements can be measured without the cost and risk of online experimentation.&lt;/p&gt;

&lt;p&gt;It is also important to be precise about what this evaluation is actually saying. Offline evaluation does not directly measure how users would respond to a new policy in a live environment. It measures how well a model aligns with historical interactions recorded under earlier exposure conditions.&lt;/p&gt;

&lt;p&gt;That distinction is easy to blur because the workflow looks so familiar. We have training data, a test set, and a metric. But in recommendation systems, the labels are not independent of the system that helped generate them. The held-out watch or click is not just a fact about the user. It is also a fact about what the user was shown.&lt;/p&gt;

&lt;p&gt;For now, that is enough of a working definition. Offline evaluation is historical replay over logged interactions, typically framed as a ranking problem, and used as a proxy for recommendation quality under observed conditions. It is a very useful proxy. The rest of the article asks where its boundaries are.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Where Offline Evaluation Breaks
&lt;/h3&gt;

&lt;p&gt;The limitations of offline evaluation do not come from a single bad metric or a single avoidable mistake. They come from a more basic fact about recommender data: the data is generated under a policy. What users do in the logs depends in part on what earlier systems chose to show them.&lt;/p&gt;

&lt;p&gt;That sounds obvious when stated directly. But it has deeper consequences than it first appears. If the evidence used for evaluation is itself shaped by older recommendation decisions, then offline evaluation is not observing some neutral ground truth about relevance. It is observing relevance through the filter of past exposure.&lt;/p&gt;

&lt;p&gt;In a static prediction task, that distinction is often less severe. In recommendation, it sits near the center of the problem. A new recommender is rarely judged against untouched labels. It is judged against behavior recorded under an older recommender, with its own ranking habits, popularity biases, and coverage patterns.&lt;/p&gt;

&lt;p&gt;We can state the issue in simple notation. Let &lt;code&gt;pi_0&lt;/code&gt; be the logging policy that generated the historical data, and let &lt;code&gt;pi_1&lt;/code&gt; be the new policy we want to evaluate. Offline replay uses observations gathered under &lt;code&gt;pi_0&lt;/code&gt; to estimate the quality of &lt;code&gt;pi_1&lt;/code&gt;. If &lt;code&gt;pi_1&lt;/code&gt; behaves much like &lt;code&gt;pi_0&lt;/code&gt;, that may be informative. If it changes exposure materially, the estimate becomes much less complete.&lt;/p&gt;

&lt;p&gt;This is the core mismatch. The quantity we want is user response under the candidate policy. The quantity we usually observe is user response under the previous policy. The two overlap, but they are not the same object.&lt;/p&gt;

&lt;h4&gt;
  
  
  3.1 Exposure Bias
&lt;/h4&gt;

&lt;p&gt;The first break is exposure bias. Users can only react to items they were actually shown.&lt;/p&gt;

&lt;p&gt;That means an interaction log is not just a record of what users preferred. It is also a record of what the system made available. When an item receives no click, no watch, or no rating, that absence does not cleanly mean the item was irrelevant. In many cases it means the item was never placed in front of the user at all.&lt;/p&gt;

&lt;p&gt;This matters immediately for offline evaluation. Suppose a movie platform has historically given heavy exposure to well-known studio releases and much lighter exposure to niche films. The resulting data will contain dense evidence for how users responded to the mainstream catalog and sparse evidence for how they would have responded to more specialized titles.&lt;/p&gt;

&lt;p&gt;The bias here is structural rather than anecdotal. If observed feedback only exists for exposed items, then the support of the evaluation data is concentrated where the logging policy chose to spend attention. In compact form, observed reward is only available where &lt;code&gt;pi_0(i | u, c)&lt;/code&gt; is nontrivial for user &lt;code&gt;u&lt;/code&gt;, item &lt;code&gt;i&lt;/code&gt;, and context &lt;code&gt;c&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is why historical replay is partial. It is not sampling uniformly from all relevant user-item pairs. It is sampling from the subset that earlier policies made visible. In a movie recommender, this can make “popular” look easier to measure than “personally relevant,” even when the latter is closer to the product goal.&lt;/p&gt;

&lt;h4&gt;
  
  
  3.2 Old-Policy Lock-In
&lt;/h4&gt;

&lt;p&gt;Exposure bias becomes more consequential when a new policy differs from the old one in systematic ways. This is where old-policy lock-in appears.&lt;/p&gt;

&lt;p&gt;In most offline evaluations, the labels used to assess a candidate model were generated under a different ranking policy. A held-out watch event looks like a simple target, but it is downstream of earlier recommendation decisions. The new model is therefore being judged with evidence produced by the system it may be trying to replace.&lt;/p&gt;

&lt;p&gt;This creates an asymmetry. Models that resemble the old policy often enjoy richer and cleaner evidence in the historical logs. Models that shift probability mass toward less exposed regions of the catalog are evaluated in the parts of the space where the logs are thinnest.&lt;/p&gt;

&lt;p&gt;Return to the movie example. If the old system strongly favored familiar blockbusters, then the held-out data will naturally contain many interactions with those titles. A candidate model that continues to rank them highly will line up well with the log. Another model that is more willing to surface quieter but well-matched films may look weaker offline, not necessarily because users dislike those recommendations, but because the old system rarely created opportunities to observe that preference.&lt;/p&gt;

&lt;p&gt;This is one reason a better recommender can look worse offline. The issue is not only model accuracy. It is evaluation support. When performance is estimated on outcomes generated under &lt;code&gt;pi_0&lt;/code&gt;, the comparison can systematically favor policies that stay close to &lt;code&gt;pi_0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That does not make all offline comparisons invalid. If two models differ only slightly, offline evaluation can still be highly useful. But when a candidate policy changes exposure patterns in meaningful ways, offline results should be read with more caution than the metric alone suggests.&lt;/p&gt;

&lt;h4&gt;
  
  
  3.3 Novel Items and Cold Start
&lt;/h4&gt;

&lt;p&gt;The same logic becomes even sharper for new or rarely exposed content.&lt;/p&gt;

&lt;p&gt;Offline evaluation is strongest where historical evidence is plentiful. It is weakest where exposure has been limited, recent, or absent. Unfortunately, those are often exactly the regions where recommendation systems are asked to do something valuable: introduce new items, expand coverage, and connect users to parts of the catalog they would not have reached on their own.&lt;/p&gt;

&lt;p&gt;In a movie platform, consider a newly added independent film with very little interaction history. A model may have good reasons to recommend it to a narrow set of users based on metadata, embeddings, or nearby behavioral signals. But if the film barely appeared under the previous policy, then historical logs offer limited evidence for how good that recommendation would actually be.&lt;/p&gt;

&lt;p&gt;The problem is not only that the item is new. The deeper issue is that offline replay inherits the conservatism of past exposure. It is much easier to validate recommendations for already visible inventory than for inventory the old policy neglected.&lt;/p&gt;

&lt;p&gt;This creates a subtle but important pressure. Systems that stay near the historically exposed core of the catalog are easier to justify with offline evidence. Systems that broaden exposure toward the tail are often evaluated precisely where the data is least informative. Over time, that can make conservative recommendation strategies look more reliable than they really are, and exploratory strategies look less supported than they might deserve.&lt;/p&gt;

&lt;p&gt;The claim is not that offline evaluation fails in every cold-start setting. It is that historical replay is structurally weak exactly where a recommender tries to broaden exposure. For recommenders, novelty is often where the evidence is thinnest.&lt;/p&gt;

&lt;h4&gt;
  
  
  3.4 Trajectory Blindness
&lt;/h4&gt;

&lt;p&gt;Even if the exposure problem disappeared, there would still be another limitation. Recommendation quality is not purely one-step.&lt;/p&gt;

&lt;p&gt;Most offline metrics compress evaluation into local ranking success. Did the model place the held-out item near the top? Did it recover the next watch? Did it improve a ranking score on observed interactions? Those are reasonable questions, but they are mostly questions about immediate alignment with historical events.&lt;/p&gt;

&lt;p&gt;Users, however, experience recommendation systems as sequences. They return across sessions. They compare one recommendation to the previous one. They notice repetition. They develop trust or impatience. They learn whether the system helps them explore or merely loops them through slight variations of what it already knows how to sell.&lt;/p&gt;

&lt;p&gt;This is where trajectory blindness enters. A recommender can look strong on one-step relevance and still create a poor multi-step experience.&lt;/p&gt;

&lt;p&gt;Imagine a movie recommender that repeatedly serves highly similar popular thrillers because those titles have strong historical watch signals. In a one-step offline evaluation, this may look sensible. The recommendations are close to what users have previously consumed, and the metrics may reward that closeness. But over several sessions the user may experience the system as narrow, repetitive, and increasingly unhelpful.&lt;/p&gt;

&lt;p&gt;Another model might trade a small amount of one-step certainty for a better sequence. It may alternate between reliable choices and occasional high-fit long-tail discoveries. That kind of quality often lives in the trajectory rather than in any single ranking event.&lt;/p&gt;

&lt;p&gt;In notation, many offline metrics focus on something close to the quality of &lt;code&gt;r_t&lt;/code&gt; at a single step. But recommender quality often depends on properties of the sequence &lt;code&gt;(a_1, r_1), ..., (a_T, r_T)&lt;/code&gt;: how concentrated the recommendations are, whether novelty appears at the right rate, whether boredom accumulates, and whether the system adapts well after earlier choices.&lt;/p&gt;

&lt;p&gt;This is not an argument against ranking metrics. It is an argument about what they leave out. They summarize one-step fit to logged behavior. They do not, by themselves, tell us whether the interaction over time becomes richer, narrower, more repetitive, or more satisfying.&lt;/p&gt;

&lt;h4&gt;
  
  
  3.5 What This Means
&lt;/h4&gt;

&lt;p&gt;Taken together, these limitations point to a single conclusion. Offline evaluation often treats recommendation as if it were a static prediction problem with fixed labels. In practice, recommendation is an interactive system problem.&lt;/p&gt;

&lt;p&gt;The system chooses what to expose. Exposure shapes what users can respond to. Those responses become the data for future training and evaluation. Change the policy, and you may change the distribution of behavior itself.&lt;/p&gt;

&lt;p&gt;Once that is clear, the goal of evaluation also becomes clearer. The question is not only whether a model can replay the past. It is whether it can support good interaction under a changed policy. Historical replay helps answer that question, but only in part.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Why It Still Matters
&lt;/h3&gt;

&lt;p&gt;None of these limitations make offline evaluation disposable. They define its scope. That distinction matters.&lt;/p&gt;

&lt;p&gt;Recommendation teams rely on offline evaluation because it solves real engineering problems well. It is fast, reproducible, and comparatively cheap. It allows model changes to be screened before they reach users. It supports regression testing, debugging, ablation work, and benchmarking across candidate approaches. In most practical settings, there is no credible evaluation stack that excludes it.&lt;/p&gt;

&lt;p&gt;That remains true even after the critique above. A recommender team still needs a way to reject clearly weak models, validate implementation changes, and compare alternatives under a common protocol. Offline evaluation is often the first place where obvious failures become visible. If a ranking model cannot perform competitively in historical replay, it is usually hard to justify sending it to live traffic.&lt;/p&gt;

&lt;p&gt;This is especially important because online tests are expensive in more than one sense. They consume time, user attention, and organizational focus. They are also constrained by risk. A platform may be willing to test a modest ranking change online, but not a model that already appears unstable or uncompetitive offline. Historical evaluation remains the practical filter through which many candidate models must pass.&lt;/p&gt;

&lt;p&gt;The right conclusion, then, is not that offline evaluation should be replaced. It is that offline evaluation should be placed correctly. It is a strong tool for iteration and a weak tool for making broad claims about full recommender quality under changed exposure.&lt;/p&gt;

&lt;p&gt;In other words, the critique is intentional. Offline evaluation is widely used because it earns its place. The mistake is not using it. The mistake is mistaking it for a complete test.&lt;/p&gt;

&lt;p&gt;One compact way to summarize that balance is to separate what offline replay usually measures well from what it tends to leave undermeasured.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Evaluation aspect&lt;/th&gt;
&lt;th&gt;What offline replay usually captures&lt;/th&gt;
&lt;th&gt;What it tends to miss or undermeasure&lt;/th&gt;
&lt;th&gt;Movie recommender example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Immediate relevance under existing exposure&lt;/td&gt;
&lt;td&gt;Whether held-out watched items appear near the top of the ranked list&lt;/td&gt;
&lt;td&gt;Whether that ranking would still look good under a materially different exposure policy&lt;/td&gt;
&lt;td&gt;A familiar blockbuster appears in the top &lt;code&gt;K&lt;/code&gt; because it was already heavily exposed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance under policy shift&lt;/td&gt;
&lt;td&gt;Small improvements that stay near the old policy&lt;/td&gt;
&lt;td&gt;Quality of recommendations in regions where the candidate policy differs most&lt;/td&gt;
&lt;td&gt;A model that surfaces more niche dramas has little historical support where it differs from the old system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Novel or underexposed items&lt;/td&gt;
&lt;td&gt;Some signal for items with enough prior exposure&lt;/td&gt;
&lt;td&gt;Items that were new, rare, or historically under-shown&lt;/td&gt;
&lt;td&gt;A newly added indie film receives little offline credit even if it fits the user well&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cold start behavior&lt;/td&gt;
&lt;td&gt;Very coarse performance on sparse users or items&lt;/td&gt;
&lt;td&gt;Early recommendation quality when interaction history is thin&lt;/td&gt;
&lt;td&gt;A new documentary enters the catalog with too little evidence for replay to judge it fairly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repetition over sessions&lt;/td&gt;
&lt;td&gt;Little, unless explicitly measured&lt;/td&gt;
&lt;td&gt;Accumulated sameness across repeated visits&lt;/td&gt;
&lt;td&gt;The recommender keeps offering slight variations of the same thriller over multiple sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Novelty and exploration&lt;/td&gt;
&lt;td&gt;Limited signal through held-out interactions&lt;/td&gt;
&lt;td&gt;Whether the system introduces useful discovery at the right rate&lt;/td&gt;
&lt;td&gt;A long-tail science-fiction recommendation may be good, but the old logs barely contain exposure to it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Segment-level differences&lt;/td&gt;
&lt;td&gt;Aggregate averages over the evaluation set&lt;/td&gt;
&lt;td&gt;Which user groups are helped or hurt by the new policy&lt;/td&gt;
&lt;td&gt;Mainstream users may do well under Model A while exploration-seeking users do better under Model B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trajectory-level user experience&lt;/td&gt;
&lt;td&gt;Almost nothing in standard one-step metrics&lt;/td&gt;
&lt;td&gt;Trust, boredom, fatigue, and satisfaction over sequences&lt;/td&gt;
&lt;td&gt;A user keeps getting acceptable next picks but gradually disengages from repetition&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  5. Running Example: Model A vs. Model B
&lt;/h3&gt;

&lt;p&gt;The structural issues above become easier to see with a simple running example. Consider a movie recommendation system with two candidate rankers.&lt;/p&gt;

&lt;p&gt;Model A is conservative. It leans toward popular, broadly watched titles and tends to recommend within the historically dominant regions of the catalog. It is usually safe, usually familiar, and often repetitive.&lt;/p&gt;

&lt;p&gt;Model B is more personalized. It still recommends mainstream films when they fit, but it is more willing to surface niche titles, less obvious matches, and items from thinner parts of the catalog when the user profile suggests they are a good fit.&lt;/p&gt;

&lt;p&gt;Suppose the historical logs were generated under an earlier recommendation policy that behaved more like Model A. Popular titles received heavy exposure. Niche titles were shown less often. Over time, that policy produced abundant feedback on the mainstream catalog and much weaker evidence on long-tail items.&lt;/p&gt;

&lt;p&gt;Now evaluate both models offline on held-out interactions from those logs.&lt;/p&gt;

&lt;p&gt;Model A will often look strong for a simple reason: it aligns well with the exposure pattern that helped generate the data. It ranks many of the same kinds of items the old system already showed, so the held-out interactions contain ample opportunities to reward it.&lt;/p&gt;

&lt;p&gt;Model B may be better calibrated to particular users, especially users with narrower tastes or stronger appetite for discovery. But if many of its most valuable recommendations lie in regions of the catalog that were rarely exposed before, the offline log may not give it much credit. The evidence needed to validate those choices was never fully collected.&lt;/p&gt;

&lt;p&gt;This does not mean Model B is necessarily better overall. Some users may indeed prefer the safer behavior of Model A. That is part of the point. Recommendation quality is heterogeneous across users and across sessions, and a single aggregate score can hide that heterogeneity.&lt;/p&gt;

&lt;p&gt;The difference becomes clearer over repeated interaction. Model A may continue to produce acceptable next-item recommendations while gradually narrowing the user's experience into a small, overexposed slice of the catalog. Model B may produce a slightly noisier immediate ranking while creating a better long-run sequence for users who value novelty or have specialized tastes.&lt;/p&gt;

&lt;p&gt;This is the kind of divergence a later demo can make visible. Two models may look similar on an aggregate offline metric and still differ meaningfully in repetition, novelty, and which user groups they serve well.&lt;/p&gt;

&lt;h4&gt;
  
  
  A Small MovieLens Demo
&lt;/h4&gt;

&lt;p&gt;To make that less abstract, I built a small comparison on MovieLens 100K. The setup is intentionally simple. Model A is a popularity baseline. Model B is a lightweight personalized recommender built from user genre profiles with a modest popularity prior. The point is not to produce the strongest possible recommender. The point is to see what different layers of evaluation say about the same pair of systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aggregate view:&lt;/strong&gt; on standard offline ranking metrics, Model A looks better.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Recall@10&lt;/th&gt;
&lt;th&gt;NDCG@10&lt;/th&gt;
&lt;th&gt;Novelty&lt;/th&gt;
&lt;th&gt;Repetition&lt;/th&gt;
&lt;th&gt;Catalog concentration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model A&lt;/td&gt;
&lt;td&gt;0.088&lt;/td&gt;
&lt;td&gt;0.057&lt;/td&gt;
&lt;td&gt;0.395&lt;/td&gt;
&lt;td&gt;0.675&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model B&lt;/td&gt;
&lt;td&gt;0.058&lt;/td&gt;
&lt;td&gt;0.036&lt;/td&gt;
&lt;td&gt;0.678&lt;/td&gt;
&lt;td&gt;0.693&lt;/td&gt;
&lt;td&gt;0.717&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If we stopped there, the conclusion would be straightforward: the popularity baseline wins offline.&lt;/p&gt;

&lt;p&gt;But that is exactly the point of the article. Once the evaluation is widened beyond a single aggregate view, the picture changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bucketed view:&lt;/strong&gt; the same two models look quite different once we ask who is being served well.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bucket&lt;/th&gt;
&lt;th&gt;Model A utility&lt;/th&gt;
&lt;th&gt;Model B utility&lt;/th&gt;
&lt;th&gt;Delta (B-A)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Conservative mainstream&lt;/td&gt;
&lt;td&gt;0.519&lt;/td&gt;
&lt;td&gt;0.532&lt;/td&gt;
&lt;td&gt;0.012&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explorer / novelty-seeking&lt;/td&gt;
&lt;td&gt;0.339&lt;/td&gt;
&lt;td&gt;0.523&lt;/td&gt;
&lt;td&gt;0.184&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Niche-interest&lt;/td&gt;
&lt;td&gt;0.443&lt;/td&gt;
&lt;td&gt;0.722&lt;/td&gt;
&lt;td&gt;0.279&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low-patience&lt;/td&gt;
&lt;td&gt;0.321&lt;/td&gt;
&lt;td&gt;0.364&lt;/td&gt;
&lt;td&gt;0.043&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The bucketed results are more revealing than the aggregate ones. Explorer users and niche-interest users benefit much more from Model B. Low-patience users also do slightly better under Model B in the short-session simulation, even though the aggregate offline ranking metrics still prefer Model A.&lt;/p&gt;

&lt;p&gt;The behavior diagnostics tell a related story. Model B is substantially more novel and much less concentrated in the most popular slice of the catalog. For explorer users, bucket-level novelty rises from &lt;code&gt;0.405&lt;/code&gt; under Model A to &lt;code&gt;0.808&lt;/code&gt; under Model B. For niche-interest users, mean bucket utility rises by &lt;code&gt;0.279&lt;/code&gt;. That is not a rounding error. It is a segment-level change that the aggregate offline metrics compress away.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkffspju06ezccanmrb2o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkffspju06ezccanmrb2o.png" alt="Bucket-level utility comparison from the MovieLens demo" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the demo says in one glance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggregate offline metrics favor Model A.&lt;/li&gt;
&lt;li&gt;Explorer, niche-interest, and low-patience buckets do better under Model B.&lt;/li&gt;
&lt;li&gt;Model B is much more novel and less concentrated in the most popular slice of the catalog.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Two short traces make the difference more tangible.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explorer / novelty-seeking user&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model A: Raiders of the Lost Ark -&amp;gt; Fargo -&amp;gt; Toy Story -&amp;gt; Return of the Jedi
Model B: Prophecy, The -&amp;gt; Cat People -&amp;gt; Wes Craven's New Nightmare -&amp;gt; Relic, The
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first sequence stays close to familiar, high-exposure titles. The second is much more novel and much more tailored to a narrower taste profile.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Low-patience user&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model A: Star Wars -&amp;gt; Fargo -&amp;gt; Return of the Jedi -&amp;gt; Toy Story
Model B: Monty Python and the Holy Grail -&amp;gt; Full Monty -&amp;gt; American President -&amp;gt; Truth About Cats &amp;amp; Dogs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here the difference is not just novelty. The second sequence moves through a less concentrated slice of the catalog rather than repeatedly returning to the same mainstream core.&lt;/p&gt;

&lt;p&gt;This small demo does not prove that Model B is globally better. It does something more modest and more useful. It shows that the answer depends on what we mean by "better," which users we care about, and whether we look only at historical ranking recovery or also at the behavior a recommender produces over short trajectories.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. A Better Direction, Briefly
&lt;/h3&gt;

&lt;p&gt;If offline evaluation is necessary but incomplete, the natural response is not to discard it. The better response is to build a broader evaluation stack around it.&lt;/p&gt;

&lt;p&gt;That broader stack should start from the failure modes already discussed. If logged exposure is policy-dependent, then evaluation should be more explicit about where the evidence is strong and where it is weak. If quality emerges over time, then some part of evaluation should examine sequences rather than only one-step ranking recovery.&lt;/p&gt;

&lt;p&gt;In practice, this suggests a modest shift in emphasis. Instead of asking only for a single aggregate offline score, teams can also ask how models behave across user segments, how concentrated their recommendations become, how much novelty they introduce, and whether their behavior looks meaningfully different over short interaction traces.&lt;/p&gt;

&lt;p&gt;For the movie example, that might mean comparing Model A and Model B not only on Recall@K or NDCG, but also on repetition, tail exposure, and bucket-level outcomes for users with different appetites for familiarity or exploration. None of these measurements solves the full problem. They simply make the evaluation better matched to the system being evaluated.&lt;/p&gt;

&lt;p&gt;The same logic also motivates carefully designed simulated interaction or short trajectory-based testing. The point is not that such methods are already complete or universally trustworthy. The point is narrower: if recommenders shape future behavior, then some part of the evaluation stack should attempt to probe that interaction rather than treating historical replay as the whole story.&lt;/p&gt;

&lt;p&gt;This is best understood as complement, not replacement. Offline evaluation remains the fast and reliable first layer. But serious evaluation of recommender quality likely needs additional layers that are more sensitive to exposure shifts, segment differences, and longer-run experience.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Conclusion
&lt;/h3&gt;

&lt;p&gt;Offline evaluation remains one of the most useful tools in recommender systems. It is fast, practical, and deeply embedded in how teams iterate on models.&lt;/p&gt;

&lt;p&gt;Its limitation is structural rather than procedural. The data it relies on is constrained by prior exposure and generated under earlier policies, so it provides only a partial test of recommender quality.&lt;/p&gt;

&lt;p&gt;That matters most when a model changes what gets shown, expands beyond historically overexposed items, or affects the experience over repeated interaction. In those settings, replaying the past is not the same as evaluating the new system on its own terms.&lt;/p&gt;

&lt;p&gt;Offline evaluation is indispensable, but it is not the whole test. Recommendation systems shape the behavior they later observe, so any serious evaluation stack should measure interaction, not just replay the past.&lt;/p&gt;

&lt;p&gt;This demo is illustrative rather than definitive; its value is in showing how aggregate offline results can hide segment-level differences.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>architecture</category>
      <category>learning</category>
    </item>
    <item>
      <title>How GenAI Genesis Began</title>
      <dc:creator>Alankrit Verma</dc:creator>
      <pubDate>Sat, 07 Mar 2026 05:12:44 +0000</pubDate>
      <link>https://forem.com/alankritverma/how-genai-genesis-began-523b</link>
      <guid>https://forem.com/alankritverma/how-genai-genesis-began-523b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alankrit Verma&lt;/strong&gt; came to the University of Toronto as a shy, math-driven student on scholarship who felt a deep responsibility to give back.&lt;/p&gt;

&lt;p&gt;That instinct led him into student leadership through &lt;strong&gt;AMACSS&lt;/strong&gt;, where he helped build a small experiment called &lt;strong&gt;AI Olympics&lt;/strong&gt; with &lt;strong&gt;39 participants&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That experiment revealed something bigger: students wanted a serious space to build, learn, and belong in AI.&lt;/p&gt;

&lt;p&gt;So Alankrit and his co-founder &lt;strong&gt;Adib Fallahpour&lt;/strong&gt; scaled that spark into &lt;strong&gt;GenAI Genesis&lt;/strong&gt; — first as a cross-campus student hackathon, and eventually into one of Canada’s largest student AI hackathons.&lt;/p&gt;

&lt;p&gt;Along the way, Alankrit helped lead the vision, website, sponsorships, partnerships, and long-term structure behind the event, including helping establish the &lt;strong&gt;GenAI Genesis Foundation&lt;/strong&gt; so the mission could sustain beyond a single organizing cycle.&lt;/p&gt;

&lt;p&gt;And now, in &lt;strong&gt;2026&lt;/strong&gt;, GenAI Genesis is entering its biggest and most ambitious chapter yet.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  From a 39-person experiment to one of Canada’s largest student AI hackathons
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;“Some communities are joined. Others are built because you cannot stop thinking about the version that should exist.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhnzvkfycj5rvl2riwe9f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhnzvkfycj5rvl2riwe9f.png" alt="GenAI Genesis team with a celebration cake"&gt;&lt;/a&gt;&lt;br&gt;GenAI Genesis team with the cake. Surprise cake courtesy of Hasleen Kaur (Head of Finance 2025, Co-Chair 2026) and Ivan Semenov (Head of Operations 2025, Co-Chair 2026).
  &lt;/p&gt;




&lt;p&gt;There are some things you plan carefully.&lt;/p&gt;

&lt;p&gt;And then there are some things that begin so quietly, so casually, that you do not realize until much later that you were standing at the start of something much bigger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GenAI Genesis&lt;/strong&gt; was one of those things.&lt;/p&gt;

&lt;p&gt;When I joined the University of Toronto, I was, in many ways, still a shy person.&lt;/p&gt;

&lt;p&gt;I was not the loudest voice in every room. I was still figuring myself out, still trying to understand what kind of life I wanted to build, and what kind of contribution I wanted to make.&lt;/p&gt;

&lt;p&gt;But I did know one thing with complete clarity: I had been given a rare opportunity, and I did not want to waste it.&lt;/p&gt;

&lt;p&gt;Coming to this country and this university on a scholarship meant a lot to me. It gave me the ability to study freely, dream more freely, and imagine a future I may not otherwise have had. And from the beginning, that created a very deep feeling in me: &lt;strong&gt;I had to give back to the community that had given so much to me.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At that time, I cared about many things at once.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I cared about &lt;strong&gt;math&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;I cared about &lt;strong&gt;building projects&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;I cared about &lt;strong&gt;recognition&lt;/strong&gt;, yes — but not just for ego. I wanted to build things that mattered.&lt;/li&gt;
&lt;li&gt;I cared about &lt;strong&gt;real-world impact&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;And I cared, very deeply, about &lt;strong&gt;community&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Math had been a big part of my identity for a long time. I had prepared seriously for the &lt;strong&gt;Euclid Mathematics Contest&lt;/strong&gt; and scored &lt;strong&gt;90/100&lt;/strong&gt;, and that experience mattered to me for more than just the number. Euclid is one of those milestones that gives you credibility, but more importantly, it gave me confidence. It made me more ambitious. It made me believe that I could build something meaningful. And it made me want to create spaces where other students could feel that same sense of challenge, excitement, and possibility.&lt;/p&gt;

&lt;p&gt;So when I came to &lt;strong&gt;U of T Scarborough&lt;/strong&gt;, I started looking around and asking myself a simple question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where is that energy here?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And honestly, at the time, I did not see enough of it.&lt;/p&gt;

&lt;p&gt;Especially in the computer science space, the student community was not really booming. This was the period after COVID, when many campus communities still felt quiet, fragmented, and difficult to revive. There was talent, but not enough momentum. Curiosity, but not enough structure. Ambition, but not enough spaces for it to gather.&lt;/p&gt;

&lt;p&gt;And somewhere along the way, I quietly made it my mission to help fix that.&lt;/p&gt;

&lt;p&gt;Not alone, of course. Communities are never built alone. But I wanted to be one of the people pushing hard in that direction.&lt;/p&gt;

&lt;p&gt;That instinct led me to &lt;strong&gt;AMACSS&lt;/strong&gt; — the &lt;strong&gt;Association of Mathematical and Computer Science Students&lt;/strong&gt;, the Departmental Student Association for the CMS department at the University of Toronto Scarborough.&lt;/p&gt;

&lt;p&gt;In my first year, I joined as a &lt;strong&gt;First-Year Representative Coordinator&lt;/strong&gt;, where I represented first-year computer science and math students to the association, and the association back to them. I also coordinated a team of &lt;strong&gt;seven people&lt;/strong&gt;, which turned out to be one of my first real lessons in leadership.&lt;/p&gt;

&lt;p&gt;Leadership, I learned very quickly, is not just about taking initiative. It is about understanding people. It is about assigning responsibility thoughtfully. It is about getting buy-in. It is about leading with grace when everyone has different levels of energy, skill, confidence, and commitment.&lt;/p&gt;

&lt;p&gt;I had always been someone who liked taking initiative, but AMACSS sharpened that instinct into something more deliberate.&lt;/p&gt;

&lt;p&gt;And in that chapter of my life, the first version of GenAI Genesis quietly appeared.&lt;/p&gt;

&lt;p&gt;Not as GenAI Genesis.&lt;/p&gt;

&lt;p&gt;Not yet.&lt;/p&gt;

&lt;p&gt;It started as something called &lt;strong&gt;AI Olympics&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Before Genesis, there was AI Olympics
&lt;/h2&gt;

&lt;p&gt;AI Olympics was the first real experiment.&lt;/p&gt;

&lt;p&gt;The original idea came from a mix of inspirations.&lt;/p&gt;

&lt;p&gt;Part of it came from my love for mathematics competitions and the kind of intellectual excitement they create. Part of it came from online hackathons I had participated in, where I had seen how energizing it could be when people come together to build under time pressure. I remember thinking again and again:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Why do we not have something like this at our university too?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At one point, I was brainstorming with &lt;strong&gt;Katrina Best&lt;/strong&gt;, who was the president at the time, about what might make for a strong event for first-year students. At first, we thought about doing something closer to a math contest. Then the idea evolved. I brainstormed with my team as well. Slowly, the concept shifted from Olympiad to something more build-oriented, more alive, more experimental.&lt;/p&gt;

&lt;p&gt;That is where &lt;strong&gt;AI Olympics&lt;/strong&gt; was born.&lt;/p&gt;

&lt;p&gt;The name came from that same spirit. We wanted something that felt like an Olympiad, but more modern, more hands-on, and more builder-focused. “AI Olympics” felt close enough to that energy, and at the time, it captured exactly what we were trying to do.&lt;/p&gt;

&lt;p&gt;It was a smaller hackathon-style event, around &lt;strong&gt;six to seven hours long&lt;/strong&gt;, with &lt;strong&gt;39 participants&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It was essentially the first-year team’s event through AMACSS, and leading it as First-Year Representative Coordinator made it feel especially personal.&lt;/p&gt;

&lt;p&gt;Many of them were beginners. The vibe in the room was not “elite competition” in the intimidating sense. It was much more like collective learning. People were curious. People were experimenting. People were just starting to understand what they could build.&lt;/p&gt;

&lt;p&gt;We taught participants how to use the tools. We gave them a website template they could plug their work into so they could build faster. We wanted to reduce friction and maximize momentum. We wanted them to feel like they could actually make something, even if they were just getting started.&lt;/p&gt;

&lt;p&gt;And maybe one of my favorite little memories from that day is how we kept ordering coffee from Tim Hortons — not once, not twice, but three times — because people kept wanting more, and apparently everyone had collectively decided that vanilla was the flavor of innovation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8u70vfxz05as93xfbc06.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8u70vfxz05as93xfbc06.jpeg" alt="AI Olympics"&gt;&lt;/a&gt;&lt;br&gt;AI Olympics
  &lt;/p&gt;

&lt;p&gt;Looking back, AI Olympics was small.&lt;/p&gt;

&lt;p&gt;But it was not small in meaning.&lt;/p&gt;

&lt;p&gt;Because it showed us something important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;People wanted a space to &lt;strong&gt;learn&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;People wanted a space to &lt;strong&gt;build&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;People wanted a space where AI felt &lt;strong&gt;exciting, approachable, social, and full of possibility&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The feedback made that obvious. People were interested in doing this again. They wanted to keep learning. They wanted to keep contributing. They wanted to build in public. They wanted more.&lt;/p&gt;

&lt;p&gt;And that was the moment the idea stopped feeling like a one-off event and started feeling like the beginning of a much larger mission.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Olympics was the spark.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GenAI Genesis was the system we built around that spark.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The moment it stopped being small
&lt;/h2&gt;

&lt;p&gt;Around that time, I was also working very closely with &lt;strong&gt;Adib Fallahpour&lt;/strong&gt;, who is not just my co-founder, but also a good friend of mine.&lt;/p&gt;

&lt;p&gt;I had first worked with Adib through my first-year team, and over time it became very clear that we were on the same wavelength in a lot of ways. He is a very kind person, a big thinker, and someone with strong vision. We both cared deeply about scaling this beyond its first version. We both felt that it should not remain a small campus event that people vaguely remembered. We wanted it to become something real.&lt;/p&gt;

&lt;p&gt;I still remember a moment from second year when Adib and I were housemates. He came into my room, and we started discussing what this thing could actually become. Not just another event. Not just another student initiative. But a serious hackathon. Something with real scale. Something that could create a home for people interested in AI, machine learning, software, and ambitious building more broadly.&lt;/p&gt;

&lt;p&gt;That conversation stayed with me.&lt;/p&gt;

&lt;p&gt;Because from that point onward, this stopped being a nice idea and started becoming a serious project.&lt;/p&gt;

&lt;p&gt;Like most ambitious student things, it began with a lot of conversations, a lot of hustle, and a slightly unreasonable amount of belief.&lt;/p&gt;

&lt;p&gt;We first tried to define the idea on paper:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What exactly was &lt;strong&gt;GenAI Genesis&lt;/strong&gt;?&lt;/li&gt;
&lt;li&gt;What would it look like at scale?&lt;/li&gt;
&lt;li&gt;What kind of experience were we trying to create?&lt;/li&gt;
&lt;li&gt;What problem were we solving?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The problem, at least to us, felt clear.&lt;/p&gt;

&lt;p&gt;At the time, there was not enough community in Toronto around this space — not the kind of entrepreneurial, energetic, builder-first AI ecosystem we wanted to see among students. There was talent, but not enough connected ambition. There were students interested in AI and ML, but not enough platforms bringing them together in a serious way.&lt;/p&gt;

&lt;p&gt;So we decided to build one.&lt;/p&gt;

&lt;p&gt;The name came together surprisingly quickly. We broke it into two parts: &lt;strong&gt;GenAI&lt;/strong&gt; and &lt;strong&gt;Genesis&lt;/strong&gt;. “Genesis” suggested beginning, emergence, evolution. And at the time, “GenAI” was the word in the air. The name reflected the moment, but the mission was always broader than just generative AI — it was about AI, machine learning, software, and the community around building them. Put together, it felt like a beginning worth naming.&lt;/p&gt;

&lt;p&gt;We did not have a full team immediately. At first, we were figuring it out from scratch. Both Adib and I were part of &lt;strong&gt;Google Developer Student Club&lt;/strong&gt;, and that gave us one starting point. We knew we could bring in people from there. Then we looked beyond Scarborough and started reaching across campuses, especially to St. George, where some of the strongest technical student communities already existed.&lt;/p&gt;

&lt;p&gt;That is how collaborations started taking shape with groups like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GDG&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UTMIST&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UofT AI&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;and later, &lt;strong&gt;CSSU&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But I want to be careful and clear here, because this matters to the story.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Important distinction:&lt;/strong&gt; GenAI Genesis was not a club-created event that I happened to be involved in.

&lt;p&gt;Those groups mattered enormously, and their support helped the vision scale far beyond what we could have done alone in the early stages. They brought expertise, reach, operational support, and legitimacy. But the mission itself — the initial push, the insistence that this had to exist — came from us.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;That distinction matters because founder stories can get flattened over time into partnerships, logos, and sponsor lists. But the truth is usually more human than that. It begins with a few people seeing a gap and deciding they are not willing to leave it empty.&lt;/p&gt;




&lt;h2&gt;
  
  
  The people who helped us scale it
&lt;/h2&gt;

&lt;p&gt;Some of the most important early support came from people who believed in the idea and helped us take it seriously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Richard&lt;/strong&gt;, from &lt;strong&gt;UTMIST&lt;/strong&gt;, played a crucial role in 2024. He was a senior to us and incredibly strong operationally. He helped us understand what it means to run something at scale, what it means to think through logistics properly, and how to turn energy into structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nimit&lt;/strong&gt;, through &lt;strong&gt;UofT AI&lt;/strong&gt;, also played a very important role in helping the initiative come together. Both Richard and Nimit helped us build the cross-campus support that allowed GenAI Genesis to grow beyond its first form.&lt;/p&gt;

&lt;p&gt;These collaborations mattered a lot. Not because the event “belonged” to those communities, but because they helped us bring the mission to the scale it deserved.&lt;/p&gt;

&lt;p&gt;Sometimes scaling an idea is not about finding people who will take it over.&lt;/p&gt;

&lt;p&gt;It is about finding people who understand it enough to help it rise.&lt;/p&gt;




&lt;h2&gt;
  
  
  A quick timeline
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;What happened&lt;/th&gt;
&lt;th&gt;Why it mattered&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Winter 2023&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;We ran &lt;strong&gt;AI Olympics&lt;/strong&gt; through AMACSS with &lt;strong&gt;39 participants&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;It proved there was real demand for a build-first AI space&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Winter 2024&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;We launched the first large-scale &lt;strong&gt;GenAI Genesis&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;The experiment became a serious institution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2025&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;We scaled dramatically with more sponsors, more prizes, and many more submissions&lt;/td&gt;
&lt;td&gt;The hackathon became a recognized force in the student AI ecosystem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2026&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;We are taking it to our biggest scale yet&lt;/td&gt;
&lt;td&gt;Bigger footprint, bigger ambition, bigger future&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;
  A tiny behind-the-scenes truth
  &lt;br&gt;
Every row in that table was held together by a lot of invisible work: outreach, relationship management, budget stress, website iterations, venue uncertainty, and a hundred tiny decisions that never show up in a recap post.&lt;br&gt;


&lt;/p&gt;




&lt;h2&gt;
  
  
  2024: when the idea met reality
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo95hb4v5y5z5p667kxfn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo95hb4v5y5z5p667kxfn.png" alt="Participants and organizers during GenAI Genesis 2024"&gt;&lt;/a&gt;&lt;br&gt;Getting Started with GenAI Genesis 2024
  &lt;/p&gt;

&lt;p&gt;The 2024 edition was the moment things started to feel very real.&lt;/p&gt;

&lt;p&gt;In winter 2024, we launched the first large-scale GenAI Genesis in downtown Toronto.&lt;/p&gt;

&lt;p&gt;Up until that point, the idea had energy. It had promise. It had momentum. But 2024 was when it had to survive the test every ambitious student project eventually faces:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Could we actually execute this at scale?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That year taught me a lot.&lt;/p&gt;

&lt;p&gt;And by “a lot,” I mean the kind of lessons that only appear when vision collides with logistics.&lt;/p&gt;

&lt;p&gt;We had to learn how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;work with a much larger team&lt;/li&gt;
&lt;li&gt;coordinate across campuses&lt;/li&gt;
&lt;li&gt;lead people with different styles, strengths, and expectations&lt;/li&gt;
&lt;li&gt;manage conflict and disagreement without letting it fracture the mission&lt;/li&gt;
&lt;li&gt;build trust with sponsors&lt;/li&gt;
&lt;li&gt;make big promises responsibly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And in the middle of all that, I was deeply involved in the work itself.&lt;/p&gt;

&lt;p&gt;From the very beginning until now, I have led the &lt;strong&gt;website side&lt;/strong&gt; of GenAI Genesis. Tech has always been one of the areas I stayed especially close to. I was also heavily involved in &lt;strong&gt;sponsorships and partnerships&lt;/strong&gt; — doing cold outreach, talking to organizations, building those relationships, and helping create the external support system that made the event possible.&lt;/p&gt;

&lt;p&gt;One of the most memorable parts of that journey was our connection with &lt;strong&gt;Google&lt;/strong&gt;, and how that relationship went from something that initially felt surreal to something that became a meaningful long-term thread in the GenAI Genesis story. There is a strange feeling when big names start trusting something you built. It is exciting, but it is also sobering. It makes you realize the stakes are now real.&lt;/p&gt;

&lt;p&gt;The 2024 edition brought in support from names including &lt;strong&gt;Google, Knockri, Wombo, Vector Institute, the Academic Advising &amp;amp; Career Centre at UTSC, and the Rotman School of Management&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We had around &lt;strong&gt;254 participants submit a project&lt;/strong&gt; and awarded roughly &lt;strong&gt;$3,000 in prizes&lt;/strong&gt;.&lt;br&gt;
But what I remember most is not just the number.&lt;/p&gt;

&lt;p&gt;I remember how much we had to figure out on the fly.&lt;/p&gt;

&lt;p&gt;Venue booking was a huge hassle. A lot of things were fragile. &lt;strong&gt;Judging&lt;/strong&gt;, especially, was something we did not have perfect prior experience with at that scale. And yet, when the time came, the team handled it with surprising grace. We made last-minute changes to make sure the judging process was fair, thoughtful, and well run. That moment stayed with me because it showed me something essential: even if we were new to this scale, we were capable of rising to it.&lt;/p&gt;

&lt;p&gt;That was the year GenAI Genesis stopped feeling like a hopeful experiment.&lt;/p&gt;

&lt;p&gt;It felt real.&lt;/p&gt;


  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuq2v3jbys38ol98qr0gr.png" alt="Participants and organizers during GenAI Genesis 2024"&gt;GenAI Genesis 2024
  



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;genesis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

  &lt;span class="na"&gt;v0&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AI&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Olympics"&lt;/span&gt;
  &lt;span class="na"&gt;participants&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;39&lt;/span&gt;
  &lt;span class="na"&gt;then&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;an experiment&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a room full of beginners&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a lot of coffee&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a lot of belief&lt;/span&gt;
  &lt;span class="na"&gt;now&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a cross-campus movement&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a large-scale AI hackathon&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;a serious community&lt;/span&gt;
  &lt;span class="na"&gt;constant&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;vision&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;people&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;momentum&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2025: scale changes everything
&lt;/h2&gt;

&lt;p&gt;Then came &lt;strong&gt;2025&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And 2025 felt different.&lt;/p&gt;

&lt;p&gt;This was the year when GenAI Genesis started feeling less like an event and more like an ecosystem.&lt;/p&gt;

&lt;p&gt;By then, we were no longer operating entirely from instinct. We had learned processes. We had built systems. We had a better understanding of what worked, what broke, what participants valued, and what scale actually requires. We planned earlier. We moved more formally. We operated with more clarity.&lt;/p&gt;

&lt;p&gt;The leadership structure also evolved.&lt;/p&gt;

&lt;p&gt;In the earlier chapter, the co-chair structure included &lt;strong&gt;me, Adib, Nimit, and Richard&lt;/strong&gt;. By 2025, the co-chairs were &lt;strong&gt;me, Adib, and Matthew Tamura&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Matthew had already been involved as a strong contributor in 2024 through UTMIST and was someone I deeply appreciated — thoughtful, visionary, and strong at leading a team properly. In 2025, he stepped into a bigger leadership role with us, and that made a real difference.&lt;/p&gt;

&lt;p&gt;We also worked hard to improve the participant experience in ways that went beyond the surface.&lt;/p&gt;

&lt;p&gt;We brought in more sponsors.&lt;br&gt;
We created more networking opportunities.&lt;br&gt;
We designed stronger supporting events during the hackathon.&lt;br&gt;
We sharpened logistics.&lt;br&gt;
We elevated the experience.&lt;/p&gt;

&lt;p&gt;And the scale reflected that.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;2025&lt;/strong&gt;, we had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;around &lt;strong&gt;$15,000 in awards&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;621 participants submitted a project&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;backing from &lt;strong&gt;Google, BWC, Cohere, AMD, CGI, RBC, Northeastern University, Edge.io Solutions, the Academic Advising &amp;amp; Career Centre, and the University of Toronto&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;support from partners including the &lt;strong&gt;United Nations Association in Canada, One Degree Cooler, Vector Institute, and Hack Canada&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the most exciting moments that year was when &lt;strong&gt;AMD&lt;/strong&gt; came in and supported us in a way that allowed participants to run more complex machine learning workloads on an AMD GPU-backed local cluster. That felt genuinely wild. It was one of those moments where you step back and realize the hackathon is not just getting bigger in numbers — it is getting more technically meaningful too.&lt;/p&gt;

&lt;p&gt;From the outside, growth can look glamorous.&lt;/p&gt;

&lt;p&gt;From the inside, it often looks like spreadsheets, calls, follow-ups, contingency planning, team alignment, venue negotiations, technical troubleshooting, partnership mapping, and a hundred open loops in your head at once.&lt;/p&gt;

&lt;p&gt;People usually see the lights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Founders remember the wiring.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And by 2025, I was exhausted. Truly.&lt;/p&gt;

&lt;p&gt;But it was also the kind of exhaustion that comes from building something you care about so deeply that you keep choosing it, again and again, even when it would be easier not to.&lt;/p&gt;

&lt;p&gt;There were many moments in those years where I could have spent my time doing something else for my résumé — some other project, some other opportunity, some other clean, convenient line on paper.&lt;/p&gt;

&lt;p&gt;And again and again, I chose GenAI Genesis.&lt;/p&gt;

&lt;p&gt;Because by then it was not just a project.&lt;/p&gt;

&lt;p&gt;It was a commitment.&lt;/p&gt;


 &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmayw9gih1uzvqhly8jd0.jpg" alt="Some of the winners at GenAI Genesis 2025"&gt;Some of the winners at GenAI Genesis 2025
  



&lt;h2&gt;
  
  
  What people see vs. what it takes
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;People usually experience a hackathon at the moment it becomes exciting.&lt;/p&gt;

&lt;p&gt;Founders experience it in the months before that, when it is still fragile.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What goes into building a hackathon like this?&lt;/p&gt;

&lt;p&gt;Not just posters and prize money.&lt;/p&gt;

&lt;p&gt;It looks more like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sponsor outreach and partnership management&lt;/li&gt;
&lt;li&gt;website design and technical infrastructure&lt;/li&gt;
&lt;li&gt;judge and mentor coordination&lt;/li&gt;
&lt;li&gt;cross-campus relationship building&lt;/li&gt;
&lt;li&gt;team alignment across different working styles&lt;/li&gt;
&lt;li&gt;planning future editions before the current one is even over&lt;/li&gt;
&lt;li&gt;making sure the vision survives internal complexity&lt;/li&gt;
&lt;li&gt;solving ten operational problems before breakfast&lt;/li&gt;
&lt;li&gt;keeping something founder-led while still making it collaborative&lt;/li&gt;
&lt;li&gt;doing a lot of invisible thinking about what the next step even is&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A polished event always has a chaotic prequel.&lt;/p&gt;

&lt;p&gt;And a surprising amount of inner work goes into making sure the chaos does not win.&lt;/p&gt;


&lt;h2&gt;
  
  
  What GenAI Genesis has meant to me
&lt;/h2&gt;

&lt;p&gt;At one level, GenAI Genesis is about &lt;strong&gt;AI and machine learning&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But if I am being honest, it has never only been about AI.&lt;/p&gt;

&lt;p&gt;It is about &lt;strong&gt;belonging&lt;/strong&gt;.&lt;br&gt;
It is about &lt;strong&gt;ambition&lt;/strong&gt;.&lt;br&gt;
It is about &lt;strong&gt;opportunity&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It is about building the kind of space I wish existed more abundantly when I first arrived.&lt;/p&gt;

&lt;p&gt;A place where students do not just come to listen, collect swag, and leave. A place where they come to make things. To meet each other. To stretch. To take themselves seriously. To find their people. To realize that they are more capable than they thought.&lt;/p&gt;

&lt;p&gt;That is what I wanted to create.&lt;/p&gt;

&lt;p&gt;And I think that is why this has become more than just a hackathon to me.&lt;/p&gt;

&lt;p&gt;It has become a community, a signal, a platform, and in some ways, living proof that if you build the right room, the right people will find each other inside it.&lt;/p&gt;

&lt;p&gt;I have also learned a lot about myself through this.&lt;/p&gt;

&lt;p&gt;I learned how passionate I am about the things I truly care about. I learned how much I care about my team. I learned that leadership is not something you perform; it is something you practice. I learned how much invisible thinking goes into visible outcomes. I learned that building something meaningful costs time, energy, sleep, and sometimes other opportunities.&lt;/p&gt;

&lt;p&gt;But I also learned that some things are worth choosing repeatedly.&lt;/p&gt;

&lt;p&gt;And this was one of them.&lt;/p&gt;


&lt;h2&gt;
  
  
  The foundation behind the future
&lt;/h2&gt;

&lt;p&gt;As GenAI Genesis grew, it became increasingly important to make sure it could sustain itself beyond just the intensity of one year, one organizing cycle, or one group of students.&lt;/p&gt;

&lt;p&gt;That is a big part of why I helped establish the &lt;strong&gt;GenAI Genesis Foundation&lt;/strong&gt; as an NGO, along with four other directors.&lt;/p&gt;

&lt;p&gt;That step mattered deeply to me.&lt;/p&gt;

&lt;p&gt;Because if GenAI Genesis was going to keep growing properly, it needed more than momentum.&lt;/p&gt;

&lt;p&gt;It needed structure.&lt;br&gt;
It needed continuity.&lt;br&gt;
It needed a long-term home.&lt;/p&gt;

&lt;p&gt;Founding the Foundation was part of making sure that what we built would not just peak.&lt;/p&gt;

&lt;p&gt;It would endure.&lt;/p&gt;

&lt;p&gt;And I am very proud of that.&lt;/p&gt;


&lt;h2&gt;
  
  
  People I want to thank
&lt;/h2&gt;

&lt;p&gt;No founder story is ever truly solo.&lt;/p&gt;

&lt;p&gt;And GenAI Genesis certainly was not.&lt;/p&gt;

&lt;p&gt;I started this with &lt;strong&gt;Adib Fallahpour&lt;/strong&gt;, my co-founder, and I want to begin there. Thank you, Adib, for building this vision with me from the early days, for dreaming big, for caring deeply, and for helping turn a small experiment into something much larger than either of us could have reached alone.&lt;/p&gt;

&lt;p&gt;I want to thank &lt;strong&gt;Richard&lt;/strong&gt;, who helped us significantly in 2024 through &lt;strong&gt;UTMIST&lt;/strong&gt;. Richard brought strong operational guidance at a time when we were still learning how to scale properly, and his support played an important role in helping us bring the hackathon to life at a bigger level.&lt;/p&gt;

&lt;p&gt;I also want to thank &lt;strong&gt;Nimit&lt;/strong&gt;, who helped us through &lt;strong&gt;UofT AI&lt;/strong&gt; and contributed meaningfully to the growth of the initiative in its earlier large-scale chapter. Cross-campus support mattered a lot, and Nimit was part of that story.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;2025&lt;/strong&gt;, I want to thank &lt;strong&gt;Matthew Tamura&lt;/strong&gt;, who joined me and Adib as a co-chair in 2025. Matthew brought a lot of clarity, vision, and leadership to that year, and I deeply appreciated building that edition alongside him.&lt;/p&gt;

&lt;p&gt;And for &lt;strong&gt;2026&lt;/strong&gt;, I want to thank &lt;strong&gt;Hasleen Kaur&lt;/strong&gt; and &lt;strong&gt;Ivan Semenov&lt;/strong&gt;, who are co-chairing this year alongside me. Both are wonderful people and wonderful friends, and I am genuinely grateful to be building this chapter with them.&lt;/p&gt;

&lt;p&gt;There are many people behind the scenes who have contributed to GenAI Genesis over the years — teammates, sponsors, organizers, mentors, judges, friends, and supporters — and I carry a lot of gratitude for all of them.&lt;/p&gt;

&lt;p&gt;Communities may remember the banner.&lt;/p&gt;

&lt;p&gt;But founders remember the people who helped hold it up.&lt;/p&gt;


&lt;h2&gt;
  
  
  And now, 2026
&lt;/h2&gt;

&lt;p&gt;And now we arrive here.&lt;/p&gt;

&lt;p&gt;What started as a 39-person experiment has grown into something far bigger, and in &lt;strong&gt;2026&lt;/strong&gt;, we are taking GenAI Genesis to its biggest scale yet.&lt;/p&gt;

&lt;p&gt;This year, we are going much, much bigger.&lt;/p&gt;

&lt;p&gt;We are preparing to bring together &lt;strong&gt;close to 1,000 people in person&lt;/strong&gt;. We are building across &lt;strong&gt;three major spaces at the University of Toronto&lt;/strong&gt; — &lt;strong&gt;Convocation Hall, Bahen, and Myhal&lt;/strong&gt; — to create an experience that is bigger not just in attendance, but in ambition, energy, and depth.&lt;/p&gt;

&lt;p&gt;This year feels different.&lt;/p&gt;

&lt;p&gt;Not because the mission has changed, but because the scale has finally caught up to the size of the vision.&lt;/p&gt;

&lt;p&gt;We are crossing into four digits.&lt;br&gt;
We are building across multiple buildings.&lt;br&gt;
We are thinking bigger than ever before.&lt;/p&gt;

&lt;p&gt;And for me personally, this year is meaningful in another way too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2026 will be my last year serving as Co-Chair of the hackathon.&lt;/strong&gt; After this, I will be moving into more of an advisory role.&lt;/p&gt;

&lt;p&gt;There is something beautiful about that.&lt;/p&gt;

&lt;p&gt;Because one of the deepest measures of building something well is whether it can continue growing beyond the chapter where you are the one carrying it most directly.&lt;/p&gt;

&lt;p&gt;That is what I want for GenAI Genesis.&lt;/p&gt;

&lt;p&gt;I want it to outgrow any one person, any one year, any one team.&lt;/p&gt;

&lt;p&gt;I want it to keep becoming a place where ambitious students find each other, where builders take themselves seriously, where new ideas are given room to breathe, and where community feels like a force multiplier rather than just a word on a poster.&lt;/p&gt;

&lt;p&gt;So if you have been watching from the sidelines, this is your sign.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Join us on March 13, 14, and 15, 2026.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Come build with us.&lt;br&gt;
Come meet the people shaping what comes next.&lt;br&gt;
Come be part of something that started small, but refused to stay small.&lt;/p&gt;

&lt;p&gt;And when you do, I hope you feel what I felt at the beginning of this whole journey:&lt;/p&gt;

&lt;p&gt;That strange, beautiful energy that appears when ambitious people gather around an idea and decide to make it real.&lt;/p&gt;

&lt;p&gt;That, in the end, is what GenAI Genesis has always been about.&lt;/p&gt;


&lt;h2&gt;
  
  
  Connect with me
&lt;/h2&gt;

&lt;p&gt;If this story resonated with you, feel free to connect with me online, follow GenAI Genesis, or reach out.&lt;/p&gt;

&lt;p&gt;I always love meeting people who care deeply about building communities, technology, and meaningful things.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://genaigenesis.ca/" rel="noopener noreferrer"&gt;Official GenAI Genesis Website&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://genai-genesis-2025.devpost.com/" rel="noopener noreferrer"&gt;GenAI Genesis 2025 Devpost&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://alankrit.me/" rel="noopener noreferrer"&gt;My Website&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.instagram.com/genaigenesis/" rel="noopener noreferrer"&gt;Follow GenAI Genesis on Instagram&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.linkedin.com/in/alankritverma/" rel="noopener noreferrer"&gt;Connect with me on LinkedIn&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://genaigenesis.ca/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;See the 2026 event, follow the journey, or reach out if you want to build something meaningful together.&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>ai</category>
      <category>hackathon</category>
      <category>community</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
