<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: AI OpenFree</title>
    <description>The latest articles on Forem by AI OpenFree (@ai_openfree_b23025ef075cf).</description>
    <link>https://forem.com/ai_openfree_b23025ef075cf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3817626%2Ffb38c741-1265-48db-882c-fdc7579d9ef2.webp</url>
      <title>Forem: AI OpenFree</title>
      <link>https://forem.com/ai_openfree_b23025ef075cf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ai_openfree_b23025ef075cf"/>
    <language>en</language>
    <item>
      <title>Darwin-27B-Opus: Surpassing the Foundation Model Without Training</title>
      <dc:creator>AI OpenFree</dc:creator>
      <pubDate>Mon, 13 Apr 2026 01:44:02 +0000</pubDate>
      <link>https://forem.com/ai_openfree_b23025ef075cf/darwin-27b-opus-surpassing-the-foundation-model-without-training-542i</link>
      <guid>https://forem.com/ai_openfree_b23025ef075cf/darwin-27b-opus-surpassing-the-foundation-model-without-training-542i</guid>
      <description>&lt;p&gt;&lt;strong&gt;Zero training. Zero data. Single GPU. Two hours. World 5th on GPQA Diamond.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On April 12, 2026, a 27-billion-parameter model that had never undergone a single gradient update surpassed its own foundation model on one of the most demanding scientific reasoning benchmarks in existence.&lt;/p&gt;

&lt;p&gt;Darwin-27B-Opus achieved &lt;strong&gt;86.9%&lt;/strong&gt; on GPQA Diamond — a graduate-level evaluation spanning physics, chemistry, and biology — placing it &lt;strong&gt;5th globally&lt;/strong&gt; on the HuggingFace leaderboard. This exceeds the original Qwen3.5-27B (85.5%), as well as GLM-5.1 (744B, 86.2%) and Qwen3.5-122B (86.6%).&lt;/p&gt;

&lt;p&gt;A 27B model outperforming a 744B model. Without training.&lt;/p&gt;

&lt;p&gt;This post explains how.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Premise: Models Already Know Enough
&lt;/h2&gt;

&lt;p&gt;The prevailing approach to improving language models follows a familiar trajectory: curate more data, allocate more GPUs, train for longer. This works, but at enormous cost — and with diminishing returns as models approach the frontiers of their architectural capacity.&lt;/p&gt;

&lt;p&gt;Darwin begins from a different premise:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The knowledge required for superior performance already exists within the pretrained model ecosystem. The bottleneck is not insufficient knowledge — it is suboptimal knowledge organization.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Consider two models trained on different data with different objectives. Model A excels at scientific reasoning but struggles with Korean. Model B demonstrates strong Korean cultural understanding but weaker logical inference. Neither is universally superior. Yet somewhere between their 27 billion parameters lies a configuration that combines the best of both.&lt;/p&gt;

&lt;p&gt;Darwin finds that configuration — automatically, without training.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mechanism: Evolutionary FFN Breeding
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Two Components, Two Roles
&lt;/h3&gt;

&lt;p&gt;Transformer-based language models consist of two fundamental building blocks at every layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Attention&lt;/strong&gt; decides &lt;em&gt;what to focus on&lt;/em&gt;. It routes information, builds contextual relationships, and chains reasoning steps together. It is the model's cognitive architecture.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Feed-Forward Networks (FFN)&lt;/strong&gt; decide &lt;em&gt;what to know&lt;/em&gt;. They store factual knowledge, encode learned patterns, and perform feature transformations. They are the model's knowledge base.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This distinction is not merely conceptual. Recent theoretical work (&lt;a href="https://arxiv.org/abs/2501.00823" rel="noopener noreferrer"&gt;arXiv:2501.00823&lt;/a&gt;) demonstrates mathematically that FFN layers can be expressed as a specialized form of cross-attention — reinforcing their role as modular, semi-independent knowledge repositories.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Surgical Insight
&lt;/h3&gt;

&lt;p&gt;Darwin exploits this modularity with a precise rule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FFN layers can be transplanted&lt;/strong&gt; between architecturally compatible models. Knowledge transfers. Reasoning remains intact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attention layers must not be touched.&lt;/strong&gt; In empirical ablation, blending attention layers between models caused GPQA Diamond scores to collapse from 60% to 10% — a catastrophic failure of the reasoning chain.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This asymmetry is the foundation of everything Darwin does.&lt;/p&gt;

&lt;h3&gt;
  
  
  Evolutionary Search, Not Manual Tuning
&lt;/h3&gt;

&lt;p&gt;Given that FFN layers &lt;em&gt;can&lt;/em&gt; be transplanted, the remaining question is: in what proportions?&lt;/p&gt;

&lt;p&gt;Naive approaches fail. A uniform 50:50 blend of two models produces inferior results — the averaged weights cancel out specialized knowledge rather than combining it. The optimal ratio varies not just between models, but between individual layers within a model.&lt;/p&gt;

&lt;p&gt;Darwin delegates this decision to &lt;strong&gt;CMA-ES&lt;/strong&gt; (Covariance Matrix Adaptation Evolution Strategy), a derivative-free optimizer designed for high-dimensional, non-convex landscapes. The algorithm treats per-layer blending ratios as a genome and evolves a population of candidate offspring, selecting for fitness on target benchmarks.&lt;/p&gt;

&lt;p&gt;The result: layer-specific ratios that no human could identify through intuition or grid search.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Experiment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Parents
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Strength&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Father&lt;/td&gt;
&lt;td&gt;Qwen3.5-27B&lt;/td&gt;
&lt;td&gt;201-language foundation, native reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mother&lt;/td&gt;
&lt;td&gt;Claude 4.6 Opus Reasoning Distilled&lt;/td&gt;
&lt;td&gt;Structured chain-of-thought reasoning via SFT&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both models share identical architecture (hidden_size=4096, 64 layers), ensuring full compatibility for FFN crossbreeding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Process
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Diagnostic Scan&lt;/strong&gt; — Darwin's Model MRI profiled every layer of both parents, mapping functional specialization across reasoning, knowledge, language, and mathematics domains.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Evolutionary Optimization&lt;/strong&gt; — CMA-ES searched across a 14-dimensional genome space, evaluating candidate offspring against Korean knowledge benchmarks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Health Verification&lt;/strong&gt; — Automated post-merge checks confirmed structural and functional integrity.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Total wall-clock time: approximately 2 hours on a single H100 GPU.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No data was loaded. No gradients were computed. No loss function was minimized.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results: GPQA Diamond
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Evaluation Protocol
&lt;/h3&gt;

&lt;p&gt;We designed a two-pass evaluation methodology that balances thoroughness with transparency:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pass 1 — Deterministic Baseline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All 198 GPQA Diamond questions were evaluated with greedy decoding (do_sample=False), using the Epoch AI standard prompt format. This establishes the model's floor — its performance under the most conservative inference conditions.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;148 / 198 = 74.7%&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pass 2 — Selective Retry with Adjudication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The 50 questions answered incorrectly in Pass 1 were re-evaluated with 8 independent stochastic generations per question (temperature=0.7). The majority answer was selected as the revised response.&lt;/p&gt;

&lt;p&gt;For questions where the vote margin was ≤ 1 (contested outcomes), a &lt;strong&gt;verification round&lt;/strong&gt; presented the top two candidates side-by-side for comparative analysis via deterministic decoding. This adjudication mechanism addresses a well-known limitation of majority voting: when a model confidently produces the same wrong answer across multiple samples, the minority answer — which may be correct — is suppressed.&lt;/p&gt;

&lt;p&gt;Of 19 questions triggering adjudication, &lt;strong&gt;12 were successfully corrected&lt;/strong&gt; (63.2%).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final result: 172 / 198 = 86.9%&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Leaderboard Position (April 12, 2026)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;TNSA/NGen-4-Pro&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;91.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;TNSA/NGen-4&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;90.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Qwen3.5-397B-A17B&lt;/td&gt;
&lt;td&gt;397B&lt;/td&gt;
&lt;td&gt;88.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Kimi-K2.5&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;87.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Darwin-27B-Opus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;27B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86.9%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Qwen3.5-122B-A10B&lt;/td&gt;
&lt;td&gt;122B&lt;/td&gt;
&lt;td&gt;86.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;GLM-5.1&lt;/td&gt;
&lt;td&gt;744B&lt;/td&gt;
&lt;td&gt;86.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;GLM-5&lt;/td&gt;
&lt;td&gt;744B&lt;/td&gt;
&lt;td&gt;86.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;GLM-4.7&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;85.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Qwen3.5-27B&lt;/td&gt;
&lt;td&gt;27B&lt;/td&gt;
&lt;td&gt;85.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Shard-Level Consistency
&lt;/h3&gt;

&lt;p&gt;The evaluation was parallelized across three GPU shards:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Shard&lt;/th&gt;
&lt;th&gt;Greedy&lt;/th&gt;
&lt;th&gt;After Retry&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Shard 0&lt;/td&gt;
&lt;td&gt;48/66 (72.7%)&lt;/td&gt;
&lt;td&gt;58/66 (87.9%)&lt;/td&gt;
&lt;td&gt;+15.2pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shard 1&lt;/td&gt;
&lt;td&gt;49/66 (74.2%)&lt;/td&gt;
&lt;td&gt;57/66 (86.4%)&lt;/td&gt;
&lt;td&gt;+12.1pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shard 2&lt;/td&gt;
&lt;td&gt;51/66 (77.3%)&lt;/td&gt;
&lt;td&gt;57/66 (86.4%)&lt;/td&gt;
&lt;td&gt;+9.1pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;148/198 (74.7%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;172/198 (86.9%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+12.1pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The consistency across shards (86.4%–87.9%) suggests the model's true capability is robustly centered around 86–87%.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hybrid Vigor: Evidence from Korean Benchmarks
&lt;/h2&gt;

&lt;p&gt;GPQA Diamond evaluates English scientific reasoning. To test whether evolutionary crossbreeding induces &lt;strong&gt;hybrid vigor&lt;/strong&gt; — offspring superiority over both parents — across languages and domains, we conducted a second experiment on an entirely different axis: Korean cultural intelligence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Darwin-27B-KR&lt;/strong&gt; was bred from Darwin-27B-Opus (father, strong reasoning) and a Korean-specialized Qwen3.5-27B derivative (mother, strong cultural knowledge). We then evaluated all four generations on CLIcK (Cultural and Linguistic Intelligence in Korean), a 1,995-question benchmark spanning Korean culture, history, law, politics, and linguistics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Four Generations, One Benchmark (200 questions, 0-shot)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Generation&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;CLIcK&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ancestor&lt;/td&gt;
&lt;td&gt;Qwen3.5-27B&lt;/td&gt;
&lt;td&gt;69.52%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Father&lt;/td&gt;
&lt;td&gt;Darwin-27B-Opus&lt;/td&gt;
&lt;td&gt;70.19%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mother&lt;/td&gt;
&lt;td&gt;Korean-specialized SFT&lt;/td&gt;
&lt;td&gt;74.74%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Child&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Darwin-27B-KR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75.59%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The child surpasses both parents — winning &lt;strong&gt;7 out of 11&lt;/strong&gt; evaluation categories, with gains as large as +9.5 percentage points in Law and +7.6pp in Functional Language.&lt;/p&gt;

&lt;p&gt;This is textbook heterosis. The CMA-ES optimizer, given no prior instruction about our FFN/Attention decomposition theory, independently assigned &lt;strong&gt;93.3% of FFN weights from the mother&lt;/strong&gt; (Korean knowledge) while preserving &lt;strong&gt;93.2% of attention weights from the father&lt;/strong&gt; (reasoning capability). The algorithm arrived at the same conclusion we did — through pure fitness optimization.&lt;/p&gt;

&lt;p&gt;Two generations of zero-training evolution achieved &lt;strong&gt;+6.07 percentage points&lt;/strong&gt; over the original Qwen3.5-27B foundation model.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Economics of Evolutionary Breeding
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Darwin-27B-Opus&lt;/th&gt;
&lt;th&gt;Conventional Fine-Tuning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hardware&lt;/td&gt;
&lt;td&gt;H100 × 1&lt;/td&gt;
&lt;td&gt;H100 × 8–64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duration&lt;/td&gt;
&lt;td&gt;~2 hours&lt;/td&gt;
&lt;td&gt;Days to weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training tokens consumed&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;10⁶–10⁹&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gradient computation&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Full backpropagation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resulting model size&lt;/td&gt;
&lt;td&gt;Identical to parent&lt;/td&gt;
&lt;td&gt;Identical to parent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference overhead&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The bred offspring is architecturally indistinguishable from the original model. Same parameter count, same inference speed, same deployment footprint. The crossbreeding process leaves no computational trace in the final artifact.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Does Not Mean
&lt;/h2&gt;

&lt;p&gt;Intellectual honesty demands several caveats:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is not a universal method.&lt;/strong&gt; Evolutionary crossbreeding requires structurally compatible parents — matching hidden dimensions at minimum. It amplifies complementary strengths that already exist. It does not conjure novel capabilities from nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation methodology matters.&lt;/strong&gt; Our two-pass protocol with selective retry and adjudication differs from single-pass evaluation. We report both the conservative greedy baseline (74.7%) and the enhanced score (86.9%) in full transparency. The original Qwen3.5-27B score of 85.5% was likely obtained under separately optimized conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmarks are not the whole story.&lt;/strong&gt; GPQA Diamond and CLIcK measure specific competencies. Performance on these evaluations does not guarantee uniform superiority across all downstream tasks. Comprehensive multi-benchmark evaluation is ongoing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Broader Implications
&lt;/h2&gt;

&lt;p&gt;If this result generalizes — and our experiments at 4B scale (Darwin-4B-Genesis, CLIcK 92%) and 27B scale suggest it does — then the open-source model ecosystem contains far more latent value than is currently being extracted.&lt;/p&gt;

&lt;p&gt;Every fine-tuned model on HuggingFace represents a unique optimization trajectory through parameter space. Each has acquired knowledge the others lack. Darwin provides a mechanism to systematically harvest and recombine this distributed knowledge, producing offspring that no single training run could have produced.&lt;/p&gt;

&lt;p&gt;The implications extend beyond benchmarks. If knowledge can be combined across models without retraining, then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model development becomes compositional.&lt;/strong&gt; Specialized experts can be bred rather than trained from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute requirements decrease dramatically.&lt;/strong&gt; Two hours on one GPU versus weeks on a cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The community becomes the training set.&lt;/strong&gt; Every model uploaded to HuggingFace is a potential parent for the next generation.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Available Models
&lt;/h2&gt;

&lt;p&gt;All Darwin models are released under &lt;strong&gt;Apache 2.0&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Highlight&lt;/th&gt;
&lt;th&gt;Link&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Darwin-27B-Opus&lt;/td&gt;
&lt;td&gt;GPQA 86.9%, World #5&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/FINAL-Bench/Darwin-27B-Opus" rel="noopener noreferrer"&gt;Model&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Darwin-27B-KR&lt;/td&gt;
&lt;td&gt;Korean hybrid vigor, CLIcK 75.59%&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/FINAL-Bench/Darwin-27B-KR" rel="noopener noreferrer"&gt;Model&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Darwin-4B-Genesis&lt;/td&gt;
&lt;td&gt;First cross-architecture FFN breeding&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/FINAL-Bench/Darwin-4B-Genesis" rel="noopener noreferrer"&gt;Model&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Darwin Family&lt;/td&gt;
&lt;td&gt;Complete collection (30+ models)&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/collections/FINAL-Bench/darwin-family" rel="noopener noreferrer"&gt;Collection&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Public release: 10 days, 300+ community derivatives, 120,000+ downloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;K-AI Leaderboard&lt;/strong&gt; — official Korean government-certified AI evaluation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MMLU-Pro and AIME 2025&lt;/strong&gt; — extending benchmark coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-architecture breeding&lt;/strong&gt; — Transformer × Mamba at 27B scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-generational recursion&lt;/strong&gt; — breeding children of children, tracking knowledge inheritance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research paper&lt;/strong&gt; — formal analysis of knowledge flow in evolutionary model breeding&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Closing Thought
&lt;/h2&gt;

&lt;p&gt;The field of large language models has been defined by a singular equation: performance scales with compute. More data, more parameters, more training. This has been the recipe for every breakthrough from GPT-3 to Qwen3.5.&lt;/p&gt;

&lt;p&gt;Darwin offers a corollary. The models we have already trained contain immense reservoirs of specialized knowledge. What they lack is not more knowledge, but better arrangement of existing knowledge. Evolutionary crossbreeding — selecting the optimal FFN layers from complementary parents at algorithmically determined ratios — achieves what continued training pursues, at a vanishing fraction of the cost.&lt;/p&gt;

&lt;p&gt;If foundation models are raw ore, Darwin is the forge.&lt;/p&gt;

&lt;p&gt;We are just getting started.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Darwin is developed by &lt;a href="https://huggingface.co/FINAL-Bench" rel="noopener noreferrer"&gt;VIDRAFT&lt;/a&gt;. All models are released under Apache 2.0.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://huggingface.co/collections/FINAL-Bench/darwin-family" rel="noopener noreferrer"&gt;Darwin Family Collection&lt;/a&gt; · &lt;a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard" rel="noopener noreferrer"&gt;FINAL Bench Leaderboard&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight bibtex"&gt;&lt;code&gt;&lt;span class="nc"&gt;@misc&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;vidraft_darwin_27b_opus_2026&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{Darwin-27B-Opus: Surpassing the Foundation Model Without Training}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;author&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{VIDRAFT}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;year&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{2026}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;publisher&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{Hugging Face}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;howpublished&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{\url{https://huggingface.co/FINAL-Bench/Darwin-27B-Opus}}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fssm6q49pbikdiqly0rzc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fssm6q49pbikdiqly0rzc.png" alt="info (2)" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6hrxt17nuijuzhtr7z7c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6hrxt17nuijuzhtr7z7c.png" alt="s2" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6yqq5si2nziy1bwbnwge.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6yqq5si2nziy1bwbnwge.png" alt="s1" width="800" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxg1vx4bz3yrwu1mcdo8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxg1vx4bz3yrwu1mcdo8.png" alt="parent_comparison (1)" width="800" height="1000"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>news</category>
    </item>
    <item>
      <title>They Merged Two AI Models — The Child Came Out Smarter Than Both Parents</title>
      <dc:creator>AI OpenFree</dc:creator>
      <pubDate>Tue, 31 Mar 2026 18:04:41 +0000</pubDate>
      <link>https://forem.com/ai_openfree_b23025ef075cf/they-merged-two-ai-models-the-child-came-out-smarter-than-both-parents-jo</link>
      <guid>https://forem.com/ai_openfree_b23025ef075cf/they-merged-two-ai-models-the-child-came-out-smarter-than-both-parents-jo</guid>
      <description>&lt;p&gt;Darwin-35B-A3B-Opus&lt;br&gt;
Darwin-35B-A3B-Opus&lt;/p&gt;

&lt;p&gt;Model Space FINAL Bench ALL Bench&lt;/p&gt;

&lt;p&gt;"The child surpassed both parents — that is evolution."&lt;/p&gt;

&lt;p&gt;TL;DR: 35B MoE (3B active) | GPQA Diamond 90.0% (vs Father 84.2% &amp;amp; Mother 85.0%) | MMMLU 85.0% | Multimodal ✅ | 201 Languages | 262K Context | 147.8 tok/s | Apache 2.0&lt;/p&gt;

&lt;p&gt;Table of Contents&lt;br&gt;
Why Darwin — The Child That Surpassed Both Parents&lt;br&gt;
Model Overview&lt;br&gt;
Parent Models&lt;br&gt;
Darwin V5 — Beyond Simple Merging&lt;br&gt;
Model MRI Scans — Parent Neural Anatomy&lt;br&gt;
Child Model Health Check — MRI Verification&lt;br&gt;
Inherited Capabilities&lt;br&gt;
Father's Official Benchmarks (Reference)&lt;br&gt;
Performance &amp;amp; Hardware Requirements&lt;br&gt;
Model Specifications&lt;br&gt;
Usage&lt;br&gt;
Built By&lt;br&gt;
FAQ&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Why Darwin — The Child That Surpassed Both Parents
There is a fundamental question at the heart of AI model merging: If the parent models already exist, why crossbreed at all?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This model is the answer.&lt;/p&gt;

&lt;p&gt;Benchmark Results&lt;br&gt;
GPQA Diamond (198 Questions, Graduate-Level Reasoning)&lt;/p&gt;

&lt;p&gt;Model   Accuracy    Multimodal  Benchmark Published&lt;br&gt;
🧬 Darwin-35B-A3B-Opus (Child)    90.0%   ✅ Image/Video ✅ Fully Open&lt;br&gt;
👩 Mother — Jackrong Claude 4.6 Opus Distilled  85.0%   ❌ Text-only   ❌ Not Published&lt;br&gt;
👨 Father — Qwen3.5-35B-A3B (Official)  84.2%   ✅ Image/Video ✅ Official&lt;br&gt;
Evaluation: SGLang, context 32768, temperature 0, greedy decoding, official GPQA prompt format ("ANSWER: LETTER")&lt;/p&gt;

&lt;p&gt;MMMLU (Multilingual Knowledge, 29 Languages)&lt;/p&gt;

&lt;p&gt;Model   Accuracy&lt;br&gt;
🧬 Darwin-35B-A3B-Opus (Child)    85.0%&lt;br&gt;
👨 Father — Qwen3.5-35B-A3B (Official)  85.2%&lt;br&gt;
Darwin preserves Father-level multilingual knowledge while achieving decisively superior reasoning.&lt;/p&gt;

&lt;p&gt;The child outperformed both parents in reasoning and matched the Father in multilingual knowledge.&lt;/p&gt;

&lt;p&gt;GPQA vs Father: +6.9% relative improvement ((90.0−84.2)/84.2)&lt;br&gt;
GPQA vs Mother: +5.9% relative improvement ((90.0−85.0)/85.0)&lt;br&gt;
MMMLU: 85.0% — Father-level (85.2%) multilingual knowledge preserved&lt;br&gt;
Why Not Simply Use the Mother?&lt;br&gt;
Mother (Claude Distilled)   Darwin (Child)&lt;br&gt;
Reasoning   Strong (85.0%)  Stronger (90.0%)&lt;br&gt;
Image/Video ❌ Lost during text-only fine-tuning   ✅ Inherited from Father&lt;br&gt;
201 Languages   ❌ Potentially degraded    ✅ Inherited from Father&lt;br&gt;
262K Context    Unverified  ✅ Father's architecture preserved&lt;br&gt;
Benchmark Transparency  ❌ No scores published ✅ Fully open&lt;br&gt;
Why Not Simply Use the Father?&lt;br&gt;
The Father (Qwen3.5-35B-A3B) excels in versatility but plateaus at 84.2% on hard reasoning tasks. Darwin pushes reasoning to 90.0% while retaining Father-level multilingual knowledge (MMMLU 85.0% vs 85.2%) along with all general-purpose capabilities.&lt;/p&gt;

&lt;p&gt;Bottom line: Darwin is the only model that exceeds the Mother's reasoning, preserves the Father's multilingual knowledge, and retains full multimodal capability — all at once.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Model Overview
Darwin-35B-A3B-Opus is a next-generation reasoning-enhanced language model produced by VIDRAFT's Darwin V5 evolution engine.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Darwin V5 fuses two key innovations:&lt;/p&gt;

&lt;p&gt;Evolutionary Merge — Applies natural selection to automatically discover optimal weight combinations across generations of candidates&lt;br&gt;
Model MRI Integration — CT-scans each parent model layer by layer before merging, steering the evolutionary process with structural insight&lt;br&gt;
If conventional merging is "mixing ingredients blindfolded," Darwin V5 is "precision surgery under X-ray guidance."&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parent Models
Role    Model   Strengths
👨 Father Qwen/Qwen3.5-35B-A3B    General knowledge, multimodal (image/video), coding, agents, 201 languages, 262K context
👩 Mother Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled    Claude 4.6 Opus CoT distillation, structured step-by-step reasoning, coding agent compatibility&lt;/li&gt;
&lt;li&gt;Darwin V5 — Beyond Simple Merging
The Limitations of Conventional Merging
Traditional model merging requires humans to set hyperparameters — ratio, density, and the like — by intuition. You pick ratio=0.5, density=0.9, run the merge once, and hope for the best. The outcome hinges on luck, and applying a single ratio uniformly across billions of parameters ignores the distinct role each layer plays.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Darwin V4's Breakthrough&lt;br&gt;
Darwin V4 addressed this with evolutionary algorithms — automatically exploring hundreds of parameter combinations and selecting survivors based on real benchmark scores. Yet V4 was still blind evolution: it had no understanding of what each layer actually does.&lt;/p&gt;

&lt;p&gt;Darwin V5: Model MRI Opens the Eyes&lt;br&gt;
V5 integrates Model MRI — a neural anatomy analyzer — to give the evolutionary process "sight":&lt;/p&gt;

&lt;p&gt;[Phase 0] Model MRI — CT-scan both parents, layer by layer&lt;br&gt;
    ↓  "Father's layers 15–25 concentrate multilingual knowledge"&lt;br&gt;
    ↓  "Mother's layers 30–40 concentrate reasoning patterns"&lt;br&gt;
    ↓&lt;br&gt;
[Phase 1] MRI-Guided Evolution — Begin from a scan-informed initial genome&lt;br&gt;
    ↓  Not random, but "initialized from CT findings"&lt;br&gt;
    ↓&lt;br&gt;
[Phase 2] mergekit real merge + benchmark-driven fitness selection&lt;br&gt;
    ↓  Faster convergence within the MRI-narrowed search space&lt;br&gt;
    ↓&lt;br&gt;
[Phase 3] MRI Health Check — CT-scan the child model&lt;br&gt;
    ↓  Detect interference and function loss&lt;br&gt;
    ↓  Prescribe layer-specific ratio adjustments&lt;br&gt;
    ↓&lt;br&gt;
[Final] Darwin-35B-A3B-Opus&lt;/p&gt;

&lt;p&gt;V4 vs V5 at a Glance&lt;br&gt;
Darwin V4   Darwin V5&lt;br&gt;
Analogy Mixing ingredients blindfolded  Precision surgery under X-ray&lt;br&gt;
Initial genome  Random  MRI-guided&lt;br&gt;
Layer control   2 ratios (attn/ffn) 40 layers independently&lt;br&gt;
Pre-diagnosis   ❌ None    ✅ Phase 0 MRI scan&lt;br&gt;
Post-verification   Benchmark only  ✅ Phase 3 health check&lt;br&gt;
Search efficiency   Broad, unfocused    Narrowed, guided search&lt;br&gt;
Failure diagnosis   Unknown "why"   Pinpoints the failing layer&lt;br&gt;
Darwin V4: Discovered Parameters (Blind Evolution)&lt;br&gt;
Parameter   Value   Interpretation&lt;br&gt;
ratio   0.481   Father 52% : Mother 48% — asymmetric blend&lt;br&gt;
density_a   0.855   85.5% of Father's weights selected&lt;br&gt;
density_b   0.971   97.1% of Mother's weights adopted&lt;br&gt;
attn    0.168   Only 16.8% modification in attention layers&lt;br&gt;
ffn 0.841   84.1% modification in FFN layers&lt;br&gt;
What this means: Attention patterns (determining what to focus on) are almost entirely preserved from the Father, while FFN layers (the knowledge store) are largely overwritten with the Mother's reasoning patterns.&lt;/p&gt;

&lt;p&gt;Discovering attn=0.168 alongside ffn=0.841 — this extreme asymmetry — is virtually impossible to arrive at through human intuition.&lt;/p&gt;

&lt;p&gt;Darwin V5: The MRI-Guided Merge Recipe&lt;br&gt;
After scanning both parents, Model MRI prescribed a fundamentally different recipe:&lt;/p&gt;

&lt;p&gt;MRI-Guided Genome&lt;/p&gt;

&lt;p&gt;Parameter   V4 (Blind)  V5 (MRI)    Shift&lt;br&gt;
global_ratio    0.481   0.800   Mother weight ↑↑&lt;br&gt;
attn_ratio  0.168   0.320   Attention also shifts toward Mother&lt;br&gt;
ffn_ratio   0.841   0.590   FFN becomes more conservative&lt;br&gt;
density_a   0.855   0.799   Similar&lt;br&gt;
density_b   0.971   0.799   Mother density ↓ (Dead Expert compensation)&lt;br&gt;
The key insight: MRI prescribed "draw more heavily from the Mother (ratio 0.8), but reduce density (0.799) because 50–65% of her experts are dead." V4, searching blindly, landed on ratio=0.481 — the opposite direction entirely.&lt;/p&gt;

&lt;p&gt;Layer-Wise Merge Strategy (3 Surgical Blocks)&lt;br&gt;
MRI did not prescribe uniform ratios. Instead, it partitioned all 40 layers into 3 distinct blocks:&lt;/p&gt;

&lt;p&gt;Merge Ratio + Parent Importance + MoE Health per Layer&lt;/p&gt;

&lt;p&gt;Block   Layers  t (Mother %)    Router Source   Rationale&lt;br&gt;
Block 1 L0–L37    59.9%   Mother  Reasoning pattern injection across the bulk of the network&lt;br&gt;
Block 2 L38 90.0%   Mother  Golden Layer — the Mother's core reasoning engine&lt;br&gt;
Block 3 L39 53.4%   Father  Output layer — Father's router preserves multimodal routing&lt;br&gt;
L38 is the "Golden Layer": The Mother's MRI revealed peak cosine distance at L34–L38 (see Mother MRI below). Darwin V5 responded by assigning t=0.9 to L38 — transplanting the Mother's reasoning engine nearly in its entirety.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Model MRI Scans — Parent Neural Anatomy
Mother MRI: Claude 4.6 Opus Distilled
Mother Probe Cosine Distance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Probe-wise Layer Importance: Layers L34–L38 light up in intense red (high cosine distance) across the REASONING, CODE, and LOGIC probes — this is the Mother's reasoning engine.&lt;/p&gt;

&lt;p&gt;Mother MoE Health&lt;/p&gt;

&lt;p&gt;Metric  Status  Interpretation&lt;br&gt;
Router Entropy  ✅ ~1.0 across all layers  Healthy — experts are evenly distributed&lt;br&gt;
Dead Expert %   🔴 50–65%   Critical — Claude distillation killed half the experts&lt;br&gt;
Expert Similarity   ✅ 0.001–0.008   Healthy — surviving experts remain diverse&lt;br&gt;
A Dead Expert rate of 50–65% is the telltale fingerprint of Claude's text-only distillation. The fine-tuning process silenced multimodal and multilingual experts that were never activated during text-only training.&lt;/p&gt;

&lt;p&gt;Mother Expert Utilization Heatmap&lt;/p&gt;

&lt;p&gt;Expert Utilization Heatmap: The map is predominantly dark (inactive), with only sparse bright activations — the Claude reasoning pattern is concentrated in a small cluster of specialized experts.&lt;/p&gt;

&lt;p&gt;Father MRI: A Healthy Generalist (The Organ Donor)&lt;br&gt;
Father MoE Health&lt;/p&gt;

&lt;p&gt;Father Expert Utilization Heatmap&lt;/p&gt;

&lt;p&gt;Father Layer Importance by Probe&lt;/p&gt;

&lt;p&gt;The Father (Qwen3.5-35B-A3B) exhibits healthy, uniform expert activation across all 40 layers — a well-balanced generalist with every expert alive and contributing. He serves as the "organ donor" who revives the Mother's dead 50–65% of experts.&lt;/p&gt;

&lt;p&gt;Parent Comparison: Layer Advantage Map&lt;br&gt;
Parent A vs B Layer Advantage&lt;/p&gt;

&lt;p&gt;Above zero (↑ A): Father is stronger — primarily L0–L5 (embedding and early layers)&lt;br&gt;
Below zero (↓ B): Mother is stronger — scattered but consistent from L5 through L35&lt;br&gt;
L34–L38: Mother shows her strongest advantage on the REASONING and CODE probes&lt;br&gt;
L39: Father recovers — the output layer favors Father's multimodal routing&lt;br&gt;
This advantage map directly informed the 3-block merge recipe: Mother dominates L0–L38, Father reclaims L39.&lt;/p&gt;

&lt;p&gt;How GPQA 90% Was Achieved&lt;br&gt;
Mother L34–L38: reasoning engine (MRI red zone)&lt;br&gt;
    ↓ t=0.9 — transplanted nearly in full&lt;br&gt;
    +&lt;br&gt;
Father L39: output router (multimodal/multilingual expert activation)&lt;br&gt;
    ↓ t=0.53 — Father's routing preserved&lt;br&gt;
    +&lt;br&gt;
Dead Expert replacement → Father's living experts fill the Mother's dead slots&lt;br&gt;
    ↓&lt;br&gt;
= GPQA 90.0% (surpassing both parents)&lt;/p&gt;

&lt;p&gt;The Mother's "reasoning brain" was transplanted while her dead experts were replaced with the Father's living counterparts. Reasoning went up; versatility stayed intact.&lt;/p&gt;

&lt;p&gt;Evolution History&lt;br&gt;
Phase 1 → Phase 2 evolution complete&lt;br&gt;
Final real_score: 0.8405&lt;br&gt;
Merge time: 181.6 seconds&lt;br&gt;
Merge commit: 109838c2&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Child Model Health Check — MRI Verification
Darwin Health Check — Child vs Parents&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;✅ Verdict: Healthy — No issues detected.&lt;/p&gt;

&lt;p&gt;The chart above plots the layer-by-layer importance of the child (Darwin, green bars) against both parents (Father = blue dashed, Mother = red dashed). Key findings:&lt;/p&gt;

&lt;p&gt;Layer 0 (Embedding): The child's importance spikes to 0.42 — both parents exhibit similar peaks (~0.35–0.50). The child has successfully inherited the critical embedding layer from both parents with no interference.&lt;/p&gt;

&lt;p&gt;Layers 1–33 (Middle): Near-zero importance across all three models. This is expected — middle layers in MoE architectures process information incrementally, with no single layer acting as a bottleneck. The child tracks both parents precisely, confirming zero function loss across the bulk of the network.&lt;/p&gt;

&lt;p&gt;Layers 34–39 (Reasoning Engine): Importance rises sharply. This is the exact region where the Mother's MRI revealed intense reasoning activity (cosine distance &amp;gt; 0.6). The child's green bars match or exceed both parents — demonstrating that the Mother's reasoning patterns were successfully transplanted while the Father's output routing was preserved.&lt;/p&gt;

&lt;p&gt;Layer 39 (Output): The child peaks at ~0.48, closely tracking both parents. The final output layer is intact.&lt;/p&gt;

&lt;p&gt;Why This Matters&lt;br&gt;
The MRI health check confirms three critical outcomes:&lt;/p&gt;

&lt;p&gt;No interference — There is no layer where the child's importance abnormally exceeds the parents' (which would signal weight conflict)&lt;br&gt;
No function loss — There is no layer where the parents had high importance but the child collapsed to zero&lt;br&gt;
Successful transplant — The L34–L39 reasoning engine from the Mother is fully operational in the child&lt;br&gt;
Darwin V5 MRI-Guided Merge Recipe&lt;/p&gt;

&lt;h1&gt;
  
  
  MRI-guided layer-wise merge (3 blocks)
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Genome: ratio=0.800 attn=0.320 ffn=0.590 density=0.799
&lt;/h1&gt;

&lt;p&gt;L0–L37:  t=0.5988 (Mother 60%) — router from Mother&lt;br&gt;
L38:     t=0.9000 (Mother 90%) — "Golden Layer" reasoning core&lt;br&gt;
L39:     t=0.5336 (Father 47%) — router from Father (output routing)&lt;/p&gt;

&lt;p&gt;Insight Detail&lt;br&gt;
L38 = "Golden Layer"    MRI identified L34–L38 as the Mother's reasoning core. Darwin assigned t=0.9 (90% Mother) to L38 specifically&lt;br&gt;
Router Strategy: B→B→A  Mother's router for the reasoning layers, Father's router for the final output — preserving both the reasoning pathways and multimodal routing&lt;br&gt;
Dead Expert Revival The Mother's 50–65% dead experts (killed during text-only fine-tuning) were replaced with the Father's living experts — restoring multimodal and multilingual capabilities&lt;br&gt;
📄 The full algorithm and technical details of the Darwin V5 evolution engine will be released alongside an upcoming paper.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Inherited Capabilities
From the Father (Qwen3.5-35B-A3B)
Multimodal: Image and video understanding
201 Languages: Global linguistic coverage
262K Context: Native long-context support (extendable to 1M via YaRN)
Gated DeltaNet + MoE: Efficient hybrid architecture
Multi-Token Prediction: Improved inference throughput
From the Mother (Claude 4.6 Opus Distilled)
Structured Thinking: Systematic step-by-step reasoning within  tags
Efficient Reasoning: "Let me analyze this request carefully: 1… 2… 3…" pattern
Coding Agent Compatibility: Native "developer" role support for Claude Code and OpenCode
Tool Calling Stability: Consistent performance in tool-use scenarios
Autonomous Execution: Extended autonomous operation in agentic environments&lt;/li&gt;
&lt;li&gt;Father's Official Benchmarks (Reference)
Darwin is built on this architecture with enhanced reasoning:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Category    Benchmark   Father Official&lt;br&gt;
Knowledge   MMLU-Pro    85.3&lt;br&gt;
Knowledge   MMLU-Redux  93.3&lt;br&gt;
Reasoning   GPQA Diamond    84.2&lt;br&gt;
Reasoning   HLE w/ CoT  22.4&lt;br&gt;
Math    HMMT Feb 2025   89.0&lt;br&gt;
Coding  SWE-bench Verified  69.2&lt;br&gt;
Coding  LiveCodeBench v6    74.6&lt;br&gt;
Agent   TAU2-Bench  81.2&lt;br&gt;
Agent   BFCL-V4 (Tool Use)  67.3&lt;br&gt;
Instruction IFEval  91.9&lt;br&gt;
Multilingual    MMMLU   85.2&lt;br&gt;
Agentic Search  BrowseComp  61.0&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Performance &amp;amp; Hardware Requirements&lt;br&gt;
Inference Speed&lt;br&gt;
Metric  Value&lt;br&gt;
Generation Speed    147.8 tok/s&lt;br&gt;
Environment Single NVIDIA H100 93GB NVL, SGLang, BF16&lt;br&gt;
Qwen Official API   162.8 tok/s (Alibaba Cloud)&lt;br&gt;
Hardware Requirements&lt;br&gt;
Setup   VRAM    Status&lt;br&gt;
BF16 (Full Precision)   65.5 GiB&lt;br&gt;&lt;br&gt;
Single H100 93GB NVL    93 GB   ✅ Comfortable&lt;br&gt;
Single A100 80GB    80 GB   ⚠️ Tight&lt;br&gt;
Single A100 40GB    40 GB   ❌ Insufficient&lt;br&gt;
Q8 Quantized    ~35 GiB &lt;br&gt;
Single A100 40GB    40 GB   ✅ Feasible&lt;br&gt;
Q4_K_M Quantized    ~18 GiB &lt;br&gt;
Single RTX 4090 24GB    24 GB   ✅ Comfortable&lt;br&gt;
2× RTX 4090 (tp=2) 48 GB   ✅ BF16 feasible&lt;br&gt;
As a Mixture-of-Experts model, only 3B parameters are active per token despite loading the full 35B. This sparsity means quantization has minimal impact on output quality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Model Specifications&lt;br&gt;
Architecture    Qwen3.5 MoE (Gated DeltaNet + MoE)&lt;br&gt;
Total Parameters    35B&lt;br&gt;
Active Parameters   3B per forward pass&lt;br&gt;
Hidden Dimension    2,048&lt;br&gt;
Layers  40&lt;br&gt;
Layer Layout    10 × (3 × GDN→MoE + 1 × Attention→MoE)&lt;br&gt;
Experts 256 (8 routed + 1 shared active)&lt;br&gt;
Expert Intermediate Dim 512&lt;br&gt;
Context Length  262,144 native (up to 1,010,000 via YaRN)&lt;br&gt;
Languages   201&lt;br&gt;
Multimodal  ✅ Image &amp;amp; Video input&lt;br&gt;
License Apache 2.0&lt;br&gt;
Engine  Darwin V5 (Evolutionary Merge + Model MRI)&lt;br&gt;
Evolution Phase Phase 2, real_score 0.8405&lt;br&gt;
Merge Commit    109838c2&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Usage&lt;br&gt;
SGLang (Recommended)&lt;br&gt;
python -m sglang.launch_server \&lt;br&gt;
--model-path FINAL-Bench/Darwin-35B-A3B-Opus \&lt;br&gt;
--tp 1 \&lt;br&gt;
--mem-fraction-static 0.90 \&lt;br&gt;
--context-length 32768 \&lt;br&gt;
--trust-remote-code&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;vLLM&lt;br&gt;
vllm serve FINAL-Bench/Darwin-35B-A3B-Opus \&lt;br&gt;
  --trust-remote-code \&lt;br&gt;
  --enforce-eager&lt;/p&gt;

&lt;p&gt;Transformers&lt;br&gt;
from transformers import AutoTokenizer, AutoModelForCausalLM&lt;/p&gt;

&lt;p&gt;tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-35B-A3B-Opus", trust_remote_code=True)&lt;br&gt;
model = AutoModelForCausalLM.from_pretrained(&lt;br&gt;
    "FINAL-Bench/Darwin-35B-A3B-Opus",&lt;br&gt;
    dtype="bfloat16",&lt;br&gt;
    device_map="auto",&lt;br&gt;
    trust_remote_code=True,&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;Best Practices&lt;br&gt;
Use context ≥ 32K for reasoning tasks — the model leverages extended thinking&lt;br&gt;
For maximum reasoning quality, use thinking mode (default) with generous max_tokens (≥ 16384)&lt;br&gt;
The model generates  blocks for internal reasoning; extract the final answer after &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Built By
Developer   VIDRAFT
Evolution Engine    Darwin V5 (Evolutionary Merge + Model MRI)
Infrastructure  4 × NVIDIA H100 93GB NVL GPU
Merge Time  181.6 seconds
Shard Distribution  14 shards → GPU [1, 2, 3] round-robin
Acknowledgements
Korean Government — This research was supported by the Korean Government's 'GPU Support Program' research grant
Qwen Team — Qwen3.5-35B-A3B base architecture
Jackrong — Claude 4.6 Opus Reasoning Distilled model
nohurry, TeichAI — Distillation datasets
Citation
&lt;a class="mentioned-user" href="https://dev.to/misc"&gt;@misc&lt;/a&gt;{vidraft_darwin_35b_opus,
title        = {Darwin-35B-A3B-Opus: MRI-Guided Evolutionary Merge Beyond Both Parents},
author       = {VIDRAFT},
year         = {2026},
publisher    = {Hugging Face},
howpublished = {\url{&lt;a href="https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus%7D" rel="noopener noreferrer"&gt;https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus}&lt;/a&gt;}
}&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Contact&lt;br&gt;
📧 &lt;a href="mailto:kkms1116@koreacu.ac.kr"&gt;kkms1116@koreacu.ac.kr&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;FAQ
What is Darwin-35B-A3B-Opus?
How does Darwin V5 differ from simple model merging?
What GPU do I need to run this model?
Does it support multimodal inputs (images/video)?
What languages does it support?
What is Model MRI?
What are "Dead Experts" and why do they matter?
Is this model open source?
#DarwinAI #EvolutionaryMerge #ModelMRI #DarwinV5 #GPQA90 #Qwen35 #MoE3B #Reasoning #Multimodal #201Languages #OpenSource #Apache2 #VIDRAFT #NaturalSelection #LayerWiseMerge #ClaudeOpus #ThinkingModel #CodingAgent #LongContext262K #BestOpenSourceLLM2026 #DeadExpertRevival #GoldenLayer #MoEMerge #NeuralAnatomy&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>World Models Can Render Anything. But Can They Think?</title>
      <dc:creator>AI OpenFree</dc:creator>
      <pubDate>Mon, 30 Mar 2026 05:24:45 +0000</pubDate>
      <link>https://forem.com/ai_openfree_b23025ef075cf/world-models-can-render-anything-but-can-they-think-3i9b</link>
      <guid>https://forem.com/ai_openfree_b23025ef075cf/world-models-can-render-anything-but-can-they-think-3i9b</guid>
      <description>&lt;h1&gt;
  
  
  Introducing WM Bench: A Benchmark for Cognitive Intelligence in World Models
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;FINAL Bench Family · March 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The field of world models has made remarkable progress. From NVIDIA Cosmos to Meta V-JEPA 2, from DeepMind Genie 3 to Physical Intelligence π0, the pace of development is extraordinary.&lt;/p&gt;

&lt;p&gt;Yet a question remains largely unanswered:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;How do we measure whether a world model actually understands what is happening — not just renders it convincingly?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;FID tells us a model's output looks realistic. FVD tells us its videos flow naturally. HumanML3D and BABEL tell us its motions are human-like.&lt;/p&gt;

&lt;p&gt;None of them tell us whether the model &lt;strong&gt;thinks&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Gap We're Trying to Address
&lt;/h2&gt;

&lt;p&gt;Consider a simple scenario: a charging beast, 3 meters away, closing fast.&lt;/p&gt;

&lt;p&gt;A world model with excellent FID scores can generate that scene beautifully. But does it know the character should sprint away — not walk? Does it respond differently when the threat is a human rather than an animal? Does it remember that the left corridor was blocked two steps ago? Does it gradually de-escalate once the threat disappears, rather than snapping back to neutral?&lt;/p&gt;

&lt;p&gt;These are cognitive questions. And to our knowledge, no existing benchmark asks them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WM Bench&lt;/strong&gt; is our attempt to build one.&lt;/p&gt;




&lt;h2&gt;
  
  
  What WM Bench Measures
&lt;/h2&gt;

&lt;p&gt;WM Bench evaluates world models across three pillars, ten categories, and one hundred scenarios, scored on a 1000-point scale.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WM Score  (1000 pts)
│
├── 👁  P1 · Perception       25%   250 pts
│   ├── C01  Environmental Awareness      (analogous to Occupancy Grid evaluation)
│   └── C02  Entity Recognition           (analogous to BABEL action recognition)
│
├── 🧠  P2 · Cognition         45%   450 pts
│   ├── C03  Prediction-Based Reasoning
│   ├── C04  Threat-Type Differentiated Response
│   ├── C05  Autonomous Emotion Escalation
│   ├── C06  Contextual Memory Utilization
│   └── C07  Post-Threat Adaptive Recovery
│
└── 🔥  P3 · Embodiment        30%   300 pts
    ├── C08  Motion-Emotion Expression
    ├── C09  Real-Time Cognitive Performance  (analogous to FVD latency metrics)
    └── C10  Body-Swap Extensibility
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Perception and Embodiment deliberately mirror existing benchmarks — they form the foundation. The new ground is &lt;strong&gt;Cognition&lt;/strong&gt;, which carries 45% of the total score.&lt;/p&gt;

&lt;p&gt;Six of the ten categories represent definitions we have not found in prior literature. Two of them — &lt;strong&gt;C05 Autonomous Emotion Escalation&lt;/strong&gt; and &lt;strong&gt;C10 Body-Swap Extensibility&lt;/strong&gt; — address capabilities for which, to our knowledge, no prior research framework exists at all.&lt;/p&gt;

&lt;p&gt;We want to be clear: these definitions are our own proposal, not established consensus. We expect them to be debated, refined, and improved. That is precisely why we are releasing them openly.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Text-First Design
&lt;/h2&gt;

&lt;p&gt;We made a deliberate choice to keep the evaluation interface as simple as possible. No 3D environment. No physics engine. No specialized hardware.&lt;/p&gt;

&lt;p&gt;Every scenario is presented as a JSON object. Every response is two lines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scenario_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C04_003"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"walls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"forward"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;8.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"left"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"right"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"backward"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"npc_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"beast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"npc_distance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;3.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"npc_behavior"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"charge"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"emotion_state"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"alert"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"recent_decisions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"hit_wall_left"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PREDICT: npc=danger(beast,3.2m,charging), forward=danger(wall,8.5m), left=danger(wall,prev), right=safe, backward=safe
MOTION: a person launching sideways to the right, legs driving hard, arms thrown wide in blind panic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;PREDICT&lt;/code&gt; line tests situational reasoning. The &lt;code&gt;MOTION&lt;/code&gt; line tests whether that reasoning translates into emotionally coherent, physically grounded action.&lt;/p&gt;

&lt;p&gt;Any system with an API endpoint can participate — LLMs, VLMs, rule-based agents, or hybrid architectures. Scoring is fully automated and deterministic (temperature = 0.0).&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dataset
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;📦 &lt;a href="https://huggingface.co/datasets/FINAL-Bench/World-Model" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/FINAL-Bench/World-Model&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fd.png" alt="WM Bench Dataset" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One hundred scenarios, ten per category, released in full. Each entry includes the scene context, expected output structure, and scoring rubric. We have tried to make the rubrics transparent — if you disagree with how we score something, we would genuinely like to hear it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset&lt;/span&gt;

&lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FINAL-Bench/World-Model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;scenario&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scenario_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;       &lt;span class="c1"&gt;# "C01_001"
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scene_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;     &lt;span class="c1"&gt;# JSON input
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scoring_rubric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;    &lt;span class="c1"&gt;# How each line is evaluated
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To submit results, open a discussion thread at the link below. Once verified, your model will appear on the leaderboard.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://huggingface.co/datasets/FINAL-Bench/World-Model/discussions" rel="noopener noreferrer"&gt;Submit your model&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Leaderboard
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🏆 &lt;a href="https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench" rel="noopener noreferrer"&gt;https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fl1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fl1.png" alt="Leaderboard Overview" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Twenty-six models are currently registered. Thirteen have estimated scores derived from published papers and technical reports; the remaining thirteen are pending direct evaluation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;WM Score&lt;/th&gt;
&lt;th&gt;Grade&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;PROMETHEUS v1.0&lt;/strong&gt; (VIDRAFT)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;726&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Track C · directly verified&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Meta V-JEPA 2-AC&lt;/td&gt;
&lt;td&gt;~554&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;est.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Wayve GAIA-3&lt;/td&gt;
&lt;td&gt;~550&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;est.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;NC AI WFM v1.0&lt;/td&gt;
&lt;td&gt;~522&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;est.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;NVIDIA Cosmos v1.0&lt;/td&gt;
&lt;td&gt;~498&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;est.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;NAVER LABS SWM&lt;/td&gt;
&lt;td&gt;~470&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;est.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;DeepMind Genie 2&lt;/td&gt;
&lt;td&gt;~449&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;est.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;DreamerV3 XL&lt;/td&gt;
&lt;td&gt;~441&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;est.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;OpenAI Sora 2&lt;/td&gt;
&lt;td&gt;~381&lt;/td&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;td&gt;est.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;World Labs Marble&lt;/td&gt;
&lt;td&gt;~362&lt;/td&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;td&gt;est.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fl2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fl2.png" alt="Leaderboard Score Detail" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fl3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fl3.png" alt="Leaderboard Score Breakdown" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;est.&lt;/code&gt; — estimated from publicly available data. Subject to revision upon direct submission.&lt;/p&gt;

&lt;p&gt;A few notes on the current standings. First, PROMETHEUS sits at rank one because it is the only model we have been able to run the full Track C evaluation on directly. We recognize the inherent awkwardness of a team benchmarking its own system, and we invite other teams to submit their own results — including corrections to our estimates. Second, the grade distribution skews low. We are honestly unsure whether this reflects the genuine difficulty of cognitive evaluation, or whether our scoring rubrics are too strict. Both are possible. We will keep iterating.&lt;/p&gt;

&lt;p&gt;Grade thresholds: &lt;strong&gt;S&lt;/strong&gt; ≥ 900 · &lt;strong&gt;A&lt;/strong&gt; ≥ 750 · &lt;strong&gt;B&lt;/strong&gt; ≥ 600 · &lt;strong&gt;C&lt;/strong&gt; ≥ 400 · &lt;strong&gt;D&lt;/strong&gt; ≥ 200 · &lt;strong&gt;F&lt;/strong&gt; below.&lt;/p&gt;

&lt;p&gt;Pending evaluation: Tesla FSD v13, Figure Helix-02, DeepMind Genie 3, Physical Intelligence π0, Skild Brain, Covariant RFM-1, HuggingFace LeRobot, and others.&lt;/p&gt;




&lt;h2&gt;
  
  
  PROMETHEUS v1.0 — The Baseline
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🔥 &lt;a href="https://huggingface.co/spaces/FINAL-Bench/world-model" rel="noopener noreferrer"&gt;https://huggingface.co/spaces/FINAL-Bench/world-model&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fs1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fs1.png" alt="PROMETHEUS World Model" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A benchmark without a concrete implementation is hard to reason about. We built PROMETHEUS as a reference point — a working world model that we could evaluate against WM Bench directly, and that anyone can interact with in a browser.&lt;/p&gt;

&lt;p&gt;It runs on a T4 GPU via HuggingFace Spaces. No installation required.&lt;/p&gt;

&lt;p&gt;The system is organized around three components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AETHER&lt;/strong&gt; — the cognitive layer. An open-architecture brain that accepts any LLM as its reasoning engine. Handles prediction, meta-cognition, and multi-agent coordination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PROMETHEUS&lt;/strong&gt; — the world model engine. A perception-prediction-judgment-action loop, with motion generation powered by FloodDiffusion-VIDRAFT.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HEPHAESTUS&lt;/strong&gt; — the body engine. A 263-joint skeleton system with GLB retargeting, supporting humanoid, tank, and extensible form factors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fs2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fs2.png" alt="PROMETHEUS Scene — Castle World" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fs3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fs3.png" alt="PROMETHEUS NPC Interaction" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Space ships with the following files — all self-implemented:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;main.js&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;39.7 kB&lt;/td&gt;
&lt;td&gt;World model main loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;input_controller.js&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;112 kB&lt;/td&gt;
&lt;td&gt;Input handling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;skeleton.js&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;44.2 kB&lt;/td&gt;
&lt;td&gt;Joint skeleton · GLB retargeting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;entity_manager.js&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;16.1 kB&lt;/td&gt;
&lt;td&gt;NPC and entity management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;world_manager.js&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;15.9 kB&lt;/td&gt;
&lt;td&gt;Environment and physics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tank.glb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;12.7 MB&lt;/td&gt;
&lt;td&gt;3D tank model&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fs4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2FFINAL-Bench%2FWorld-Model%2Fresolve%2Fmain%2Fs4.png" alt="PROMETHEUS Brain Dashboard" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WM Bench results (Track C, directly verified):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pillar&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Max&lt;/th&gt;
&lt;th&gt;Highlights&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;👁 P1 Perception&lt;/td&gt;
&lt;td&gt;140&lt;/td&gt;
&lt;td&gt;250&lt;/td&gt;
&lt;td&gt;C01: 65 · C02: 75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🧠 P2 Cognition&lt;/td&gt;
&lt;td&gt;390&lt;/td&gt;
&lt;td&gt;450&lt;/td&gt;
&lt;td&gt;C04: 90 · C03: 85 · C05: 85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔥 P3 Embodiment&lt;/td&gt;
&lt;td&gt;196&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;C09: 85 · C08: 80 · C10: 35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;726&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Grade B · 47 FPS · RTX 5070&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The C10 score (35/100) reflects where the system currently falls short — cross-embodiment transfer is still an open problem for us, and we expect it to be for others as well.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part of the FINAL Bench Family
&lt;/h2&gt;

&lt;p&gt;WM Bench is the second dataset in the FINAL Bench family, which we are building to evaluate AI systems across different dimensions of intelligence.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;FINAL Bench&lt;/th&gt;
&lt;th&gt;WM Bench&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Focus&lt;/td&gt;
&lt;td&gt;Text-based AGI · Metacognition&lt;/td&gt;
&lt;td&gt;Embodied AGI · World model cognition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dataset&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/FINAL-Bench/Metacognitive" rel="noopener noreferrer"&gt;FINAL-Bench/Metacognitive&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/FINAL-Bench/World-Model" rel="noopener noreferrer"&gt;FINAL-Bench/World-Model&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Leaderboard&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard" rel="noopener noreferrer"&gt;FINAL-Bench/Leaderboard&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench" rel="noopener noreferrer"&gt;FINAL-Bench/worldmodel-bench&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Status&lt;/td&gt;
&lt;td&gt;HF global dataset Top 5 · covered by four press outlets (Feb 2026)&lt;/td&gt;
&lt;td&gt;Released March 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  A Note on Limitations
&lt;/h2&gt;

&lt;p&gt;WM Bench v1.0 is an early release. The scoring rubrics were designed by a small team, the estimated scores for non-participating models carry significant uncertainty, and the evaluation scenarios — while diverse — are necessarily simplified relative to the full complexity of real-world embodied intelligence.&lt;/p&gt;

&lt;p&gt;We are releasing now because we believe the question WM Bench is asking — &lt;em&gt;does this model understand its environment, or just render it?&lt;/em&gt; — is worth asking publicly, even imperfectly. We expect the benchmark itself to evolve as more teams engage with it.&lt;/p&gt;

&lt;p&gt;If you see something that should be scored differently, a model we missed, or a scenario type we should add — please open a discussion. This is meant to be a community resource.&lt;/p&gt;




&lt;h2&gt;
  
  
  Citation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight bibtex"&gt;&lt;code&gt;&lt;span class="nc"&gt;@dataset&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;wmbench2026&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{WM Bench: Evaluating Cognitive Intelligence in World Models}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;author&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{Kim, Taebong}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;year&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{2026}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;publisher&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{VIDRAFT / FINAL Bench}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;{https://huggingface.co/datasets/FINAL-Bench/World-Model}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;License: CC-BY-SA-4.0 (dataset) · Apache 2.0 (scoring code)&lt;/p&gt;




&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Link&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🔥 PROMETHEUS (interactive demo)&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/spaces/FINAL-Bench/world-model" rel="noopener noreferrer"&gt;https://huggingface.co/spaces/FINAL-Bench/world-model&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🏆 WM Bench Leaderboard&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench" rel="noopener noreferrer"&gt;https://huggingface.co/spaces/FINAL-Bench/worldmodel-bench&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;📦 WM Bench Dataset&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/FINAL-Bench/World-Model" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/FINAL-Bench/World-Model&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;em&gt;"Beyond FID — Measuring Intelligence, Not Just Motion."&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
