<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Zhixiang Li</title>
    <description>The latest articles on Forem by Zhixiang Li (@zhixiangli).</description>
    <link>https://forem.com/zhixiangli</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3799148%2Fc4774bbb-5f63-4584-9645-4b1d3021c6fc.png</url>
      <title>Forem: Zhixiang Li</title>
      <link>https://forem.com/zhixiangli</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/zhixiangli"/>
    <language>en</language>
    <item>
      <title>How I Built a Readable AlphaZero From Scratch — A Deep Dive Into the Code</title>
      <dc:creator>Zhixiang Li</dc:creator>
      <pubDate>Sun, 01 Mar 2026 09:52:29 +0000</pubDate>
      <link>https://forem.com/zhixiangli/how-i-built-a-readable-alphazero-from-scratch-a-deep-dive-into-the-code-4dn</link>
      <guid>https://forem.com/zhixiangli/how-i-built-a-readable-alphazero-from-scratch-a-deep-dive-into-the-code-4dn</guid>
      <description>&lt;p&gt;Most AlphaZero repositories fall into one of two traps: they're either so heavily optimised that the algorithm is buried under infrastructure, or they're toy demos that don't actually produce a strong player. I wanted something in the middle — clean enough to &lt;em&gt;read&lt;/em&gt;, strong enough to &lt;em&gt;beat you at Gomoku&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The result is &lt;strong&gt;&lt;a href="https://github.com/zhixiangli/alphazero-board-games" rel="noopener noreferrer"&gt;alphazero-board-games&lt;/a&gt;&lt;/strong&gt;: a lightweight AlphaZero implementation covering Gomoku (9×9 and 15×15) and Connect4, with pretrained checkpoints you can play against immediately.&lt;/p&gt;

&lt;p&gt;In this post I'm going to pull apart every major component and explain exactly what's happening and &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Picture: What AlphaZero Actually Does
&lt;/h2&gt;

&lt;p&gt;Before we touch any code, let's lock down the algorithm at a conceptual level, because there's a lot of confusion in blog posts that conflate AlphaGo, AlphaGoZero, and AlphaZero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AlphaZero (2017)&lt;/strong&gt; learns entirely from self-play — no human games, no handcrafted features. The training loop has three interlocked components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A residual neural network&lt;/strong&gt; with two heads: a &lt;strong&gt;policy head&lt;/strong&gt; (probability distribution over moves) and a &lt;strong&gt;value head&lt;/strong&gt; (estimated win probability from –1 to +1).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monte Carlo Tree Search (MCTS)&lt;/strong&gt; guided by the network — the network's policy priors bias &lt;em&gt;which&lt;/em&gt; branches MCTS explores; the network's value replaces random rollouts at leaf nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A self-play RL loop&lt;/strong&gt; where the network plays against itself, generates game records, and trains on them. A stronger network generates better data, which trains an even stronger network.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The feedback loop is self-bootstrapping. Starting from random weights, within enough iterations the agent discovers strategies that have taken humans centuries to develop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Repository Layout
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alphazero/          ← shared core (game API, MCTS, network, RL loop)
gomoku_9_9/         ← 9×9 Gomoku rules + trainer + terminal player
gomoku_15_15/       ← 15×15 Gomoku rules + trainer + terminal player
connect4/           ← Connect4 rules + trainer + terminal player
scripts/            ← utility shell scripts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The clean separation between the &lt;strong&gt;core engine&lt;/strong&gt; (&lt;code&gt;alphazero/&lt;/code&gt;) and the &lt;strong&gt;game presets&lt;/strong&gt; is the most important architectural decision in the repo. Adding a new game means implementing one interface — you never touch the MCTS or training code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Component 1: The Abstract Game API
&lt;/h2&gt;

&lt;p&gt;Every game in this project implements the same abstract interface. Roughly, it exposes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;get_current_player()&lt;/code&gt; — whose turn it is (encoded as +1 / –1 for the two players)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;get_valid_moves()&lt;/code&gt; — a flat boolean mask over the action space&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;apply_move(action)&lt;/code&gt; — mutate the state with a move&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;get_game_result()&lt;/code&gt; — returns &lt;code&gt;None&lt;/code&gt; (ongoing), &lt;code&gt;+1&lt;/code&gt; (current player won), &lt;code&gt;–1&lt;/code&gt; (lost), or &lt;code&gt;0&lt;/code&gt; (draw)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;encode_state()&lt;/code&gt; — converts the board into a tensor suitable for the neural network&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last method deserves its own paragraph.&lt;/p&gt;

&lt;h3&gt;
  
  
  Board Encoding: Planes Over Pixels
&lt;/h3&gt;

&lt;p&gt;Rather than feeding a raw 2D board matrix into the network, the state is encoded as a &lt;strong&gt;stack of binary planes&lt;/strong&gt;. For a two-player game at step &lt;em&gt;t&lt;/em&gt;, a typical encoding looks like:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plane&lt;/th&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Current player's stones&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Opponent's stones&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;(optional) Whose turn it is, broadcast across the board&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This multi-plane representation gives the convolutional network the same spatial locality information a human sees: where &lt;em&gt;my&lt;/em&gt; stones are, where &lt;em&gt;theirs&lt;/em&gt; are, and who moves next — all without any numeric encoding tricks that might confuse early conv layers.&lt;/p&gt;

&lt;p&gt;For Gomoku the action space is simply every empty intersection: &lt;code&gt;board_rows × board_cols&lt;/code&gt; possible moves. For Connect4 it's just the 7 columns. The network always outputs a flat vector of size equal to the action space, and the valid-move mask is applied &lt;em&gt;after&lt;/em&gt; the softmax to zero out illegal moves before re-normalising.&lt;/p&gt;




&lt;h2&gt;
  
  
  Component 2: The Residual Policy/Value Network
&lt;/h2&gt;

&lt;p&gt;The network is the brain. Its architecture mirrors the one in the original DeepMind paper, scaled down to be trainable on consumer hardware.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture Summary
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: (batch, planes, H, W)
  │
  ▼
Conv Block (conv → BN → ReLU)
  │
  ▼
Residual Blocks × N  ─── (conv → BN → ReLU → conv → BN → skip-add → ReLU)
  │
  ├──▶  Policy Head  →  FC → softmax → π (action probabilities)
  │
  └──▶  Value Head   →  FC → FC → tanh →  v ∈ (–1, +1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why residual blocks?&lt;/strong&gt; Vanilla deep CNNs suffer from vanishing gradients. Residual connections (skip connections that add the input to the output of a two-conv stack) let gradients flow directly backwards through the identity path, enabling much deeper networks to train reliably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why two heads sharing a backbone?&lt;/strong&gt; Policy and value are tightly correlated: a position that's good for one player tends to have a narrower set of good moves, not just a higher value. Sharing the convolutional feature extractor forces the network to learn spatial representations that are useful for &lt;em&gt;both&lt;/em&gt; tasks simultaneously. The two heads then specialise on top of shared features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;tanh&lt;/code&gt; on the value head?&lt;/strong&gt; The game result is always in &lt;code&gt;{–1, 0, +1}&lt;/code&gt; (loss / draw / win). &lt;code&gt;tanh&lt;/code&gt; naturally squashes the value head's output into &lt;code&gt;(–1, +1)&lt;/code&gt;, which matches that training signal without needing any normalisation.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Network Actually Learns
&lt;/h3&gt;

&lt;p&gt;After sufficient self-play training, the policy head learns to assign high probability to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Moves that extend winning threats&lt;/li&gt;
&lt;li&gt;Moves that block the opponent's winning threats&lt;/li&gt;
&lt;li&gt;Moves that create multiple simultaneous threats (forks)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The value head learns the &lt;em&gt;game-theoretic value&lt;/em&gt; of a position — essentially "if both players play perfectly from here, who wins?" Early in training it's wild guesses. After training it's accurate enough that MCTS rarely needs to search very deep to get a good evaluation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Component 3: Monte Carlo Tree Search (MCTS)
&lt;/h2&gt;

&lt;p&gt;This is where the magic really happens. MCTS is the search algorithm that &lt;em&gt;uses&lt;/em&gt; the neural network to play the game. Each MCTS search starts at the current board position (the root) and runs &lt;code&gt;N&lt;/code&gt; simulations.&lt;/p&gt;

&lt;h3&gt;
  
  
  One Simulation: Four Steps
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Selection.&lt;/strong&gt; From the root, descend the tree by repeatedly picking the child with the highest &lt;strong&gt;PUCT score&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PUCT(s, a) = Q(s, a)  +  c_puct × P(s, a) × √(ΣN(s, b)) / (1 + N(s, a))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Q(s, a)&lt;/code&gt; — the running average value of taking action &lt;code&gt;a&lt;/code&gt; from state &lt;code&gt;s&lt;/code&gt; (exploitation)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;P(s, a)&lt;/code&gt; — the &lt;strong&gt;prior probability&lt;/strong&gt; from the neural network's policy head (exploration bias)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;N(s, a)&lt;/code&gt; — the visit count for this edge&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;c_puct&lt;/code&gt; — a constant controlling the exploration–exploitation tradeoff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The formula has a beautiful property: a move with a high prior &lt;code&gt;P&lt;/code&gt; gets explored early (when visit counts are low, the second term is large). But as it gets visited and its &lt;code&gt;Q&lt;/code&gt; value is refined, the exploration bonus shrinks. Moves that consistently return good results rise to the top; moves that looked promising but turned out weak get deprioritised.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Expansion.&lt;/strong&gt; When we reach a node that has never been visited (a leaf), we query the neural network. The network returns &lt;code&gt;(π, v)&lt;/code&gt; — a policy vector and a value scalar. We:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store &lt;code&gt;π&lt;/code&gt; as the &lt;strong&gt;prior probabilities&lt;/strong&gt; for all children of this node&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;v&lt;/code&gt; as the value estimate (instead of playing a random rollout to the end of the game)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 3 — Backup.&lt;/strong&gt; Propagate the value &lt;code&gt;v&lt;/code&gt; back up the path to the root, updating &lt;code&gt;Q(s, a)&lt;/code&gt; and &lt;code&gt;N(s, a)&lt;/code&gt; for every edge traversed. Critically, values are &lt;strong&gt;flipped at each ply&lt;/strong&gt; because the game alternates between players: a value of &lt;code&gt;+0.8&lt;/code&gt; for the player-to-move is &lt;code&gt;–0.8&lt;/code&gt; from their opponent's perspective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4 — Move Selection.&lt;/strong&gt; After all &lt;code&gt;N&lt;/code&gt; simulations, the final move is chosen proportional to visit counts at the root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;π_mcts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;∝&lt;/span&gt; &lt;span class="nc"&gt;N&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At &lt;code&gt;temperature = 1.0&lt;/code&gt; (early training) the selection is stochastic — exploration is maximised. At &lt;code&gt;temperature → 0&lt;/code&gt; (competitive play) the move with the most visits is chosen deterministically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why MCTS Visit Counts Beat Raw Policy Priors
&lt;/h3&gt;

&lt;p&gt;Here's an important subtlety: we don't just play the move with the highest &lt;strong&gt;policy prior&lt;/strong&gt;. We play the move with the most &lt;strong&gt;MCTS visits&lt;/strong&gt;. Why?&lt;/p&gt;

&lt;p&gt;Because MCTS is doing one-step lookahead (and more, recursively). A move with a mediocre prior might get visited if its children look promising. Over &lt;code&gt;N&lt;/code&gt; simulations, the visit count represents a much stronger signal than the raw policy prior — it has been &lt;em&gt;refined&lt;/em&gt; by actually exploring the consequences. This is why AlphaZero is stronger than a greedy policy network alone: MCTS is doing iterative, guided search on top of the network's intuition.&lt;/p&gt;




&lt;h2&gt;
  
  
  Component 4: The Self-Play RL Loop
&lt;/h2&gt;

&lt;p&gt;Training data isn't downloaded — it's &lt;em&gt;generated&lt;/em&gt; by the model playing against itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Data Generation Process
&lt;/h3&gt;

&lt;p&gt;For each game of self-play:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with the current board state.&lt;/li&gt;
&lt;li&gt;Run MCTS for &lt;code&gt;simulation_num&lt;/code&gt; steps → get &lt;code&gt;π_mcts&lt;/code&gt; (a probability distribution over moves).&lt;/li&gt;
&lt;li&gt;Sample a move from &lt;code&gt;π_mcts&lt;/code&gt; (with temperature) and apply it.&lt;/li&gt;
&lt;li&gt;Record the tuple &lt;code&gt;(encoded_board, π_mcts)&lt;/code&gt; for this step.&lt;/li&gt;
&lt;li&gt;Repeat until the game ends with result &lt;code&gt;r ∈ {–1, 0, +1}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Assign the value label &lt;code&gt;z&lt;/code&gt; to each step: &lt;code&gt;+r&lt;/code&gt; if that step was made by the winning player, &lt;code&gt;–r&lt;/code&gt; if by the losing player.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The training dataset is a collection of &lt;code&gt;(board_state, π_mcts, z)&lt;/code&gt; triples.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Training Objective
&lt;/h3&gt;

&lt;p&gt;The network is trained to minimise a combined loss:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = MSE(v, z) + CrossEntropy(π_network, π_mcts) + λ‖θ‖²
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;value loss&lt;/strong&gt; &lt;code&gt;MSE(v, z)&lt;/code&gt; trains the value head to correctly predict game outcomes.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;policy loss&lt;/strong&gt; &lt;code&gt;CrossEntropy(π_network, π_mcts)&lt;/code&gt; trains the policy head to match MCTS's refined move distribution (not just game outcomes — this is the crucial difference from plain policy gradient RL).&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;L2 regularisation&lt;/strong&gt; &lt;code&gt;λ‖θ‖²&lt;/code&gt; prevents overfitting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The brilliant insight: the MCTS-refined distribution &lt;code&gt;π_mcts&lt;/code&gt; is a &lt;em&gt;better&lt;/em&gt; target for the policy than the raw game result. It captures not just "this player won" but &lt;em&gt;which moves MCTS found most promising&lt;/em&gt;. The network bootstraps off its own search.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Augmentation via Symmetry
&lt;/h3&gt;

&lt;p&gt;Gomoku and Connect4 have geometric symmetries. A position and its mirror image are strategically identical. The training pipeline exploits this: each recorded game position is augmented with rotations and reflections, multiplying the effective dataset size for free.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Training Loop in Practice
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;while True:
    # Phase 1: Generate data
    games = self_play(model, num_games=N, simulation_num=S)
    replay_buffer.add(games)

    # Phase 2: Train
    for batch in replay_buffer.sample(batch_size):
        loss = compute_loss(model, batch)
        optimizer.step(loss)

    # Phase 3: Evaluate (optional)
    # compare new model vs old model in head-to-head play
    # keep the winner
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;train_interval&lt;/code&gt; parameter in the trainer controls how many self-play games are collected before each training phase — a key hyperparameter for balancing data freshness vs. compute cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  Playing Immediately (No Training Required)
&lt;/h2&gt;

&lt;p&gt;The repo includes pretrained checkpoints in each game's &lt;code&gt;data/&lt;/code&gt; directory, so you can skip all of the above and just play:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install with uv (Python 3.12+)&lt;/span&gt;
uv &lt;span class="nb"&gt;sync&lt;/span&gt;

&lt;span class="c"&gt;# Play Gomoku 15×15 in your terminal&lt;/span&gt;
uv run python &lt;span class="nt"&gt;-m&lt;/span&gt; gomoku_15_15.stdio_play &lt;span class="nt"&gt;--human-color&lt;/span&gt; W &lt;span class="nt"&gt;--simulation-num&lt;/span&gt; 400

&lt;span class="c"&gt;# Play Connect4&lt;/span&gt;
uv run python &lt;span class="nt"&gt;-m&lt;/span&gt; connect4.stdio_play &lt;span class="nt"&gt;--human-color&lt;/span&gt; B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Move format is intuitive: &lt;code&gt;E5&lt;/code&gt; or &lt;code&gt;E 5&lt;/code&gt; for Gomoku (column letter + row number), column number for Connect4.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--simulation-num&lt;/code&gt; flag directly controls AI strength. At 400 simulations per move the AI is strong but quick. Push it to 1200+ if you want a serious challenge (and don't mind waiting a few seconds per move).&lt;/p&gt;




&lt;h2&gt;
  
  
  Training Your Own Models
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start training from scratch (uses default config)&lt;/span&gt;
uv run python &lt;span class="nt"&gt;-m&lt;/span&gt; gomoku_9_9.trainer

&lt;span class="c"&gt;# Override hyperparameters from the CLI&lt;/span&gt;
uv run python &lt;span class="nt"&gt;-m&lt;/span&gt; gomoku_15_15.trainer &lt;span class="nt"&gt;-simulation_num&lt;/span&gt; 1200 &lt;span class="nt"&gt;-train_interval&lt;/span&gt; 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key hyperparameters to tune:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Effect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;simulation_num&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MCTS simulations per move. Higher → stronger AI, slower self-play&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;train_interval&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Games of self-play between training steps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;learning_rate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Standard NN learning rate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;batch_size&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Training batch size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;num_residual_blocks&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Depth of the network&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;num_filters&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Width of conv layers (capacity)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On a modern laptop with a CPU (no GPU required), the 9×9 Gomoku model starts showing real strategy within a few hours of training. The 15×15 model needs more compute but the pretrained checkpoint is already strong.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Extend It: Adding a New Game
&lt;/h2&gt;

&lt;p&gt;The cleanest feature of this architecture is that adding a new game is a well-defined, isolated task. You need to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a new directory, e.g. &lt;code&gt;tictactoe/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Implement the &lt;code&gt;Game&lt;/code&gt; abstract interface with your rules: valid moves, move application, win/draw detection, and board encoding.&lt;/li&gt;
&lt;li&gt;Create a &lt;code&gt;trainer.py&lt;/code&gt; that instantiates your game class and calls the shared training loop.&lt;/li&gt;
&lt;li&gt;Create a &lt;code&gt;stdio_play.py&lt;/code&gt; that instantiates your game and the MCTS player for terminal play.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The MCTS, the residual network, the RL loop, the replay buffer, the loss function — you inherit all of that for free. The only game-specific code is rules + board encoding. This is the right abstraction boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Makes This Implementation Different
&lt;/h2&gt;

&lt;p&gt;There are many AlphaZero repos out there. Here's what I was optimising for with this one:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Readability over performance.&lt;/strong&gt; The MCTS implementation is single-threaded and synchronous. A production system would batch neural network evaluations across parallel tree simulations. That's faster, but harder to read. This codebase is designed to be &lt;em&gt;understood&lt;/em&gt;, not to break speed records.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batteries included.&lt;/strong&gt; The pretrained checkpoints mean you get a working demo in 30 seconds. Most repos make you train for hours before you see anything interesting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modern Python tooling.&lt;/strong&gt; The project uses &lt;code&gt;uv&lt;/code&gt; for dependency management and &lt;code&gt;pyproject.toml&lt;/code&gt; for configuration. No &lt;code&gt;requirements.txt&lt;/code&gt; version conflicts. No &lt;code&gt;conda&lt;/code&gt; environment hell. Just &lt;code&gt;uv sync&lt;/code&gt; and you're running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-game from day one.&lt;/strong&gt; The abstract game API and shared core were designed upfront, not retrofitted. Gomoku 9×9, Gomoku 15×15, and Connect4 all live in the same repo and share every line of non-game-specific code.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Key Insight Worth Internalising
&lt;/h2&gt;

&lt;p&gt;If you read nothing else from this post, read this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AlphaZero doesn't learn &lt;em&gt;moves&lt;/em&gt;. It learns to &lt;em&gt;evaluate positions&lt;/em&gt;, and then uses search (MCTS) to turn those evaluations into moves.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The network is trained to predict two things: the probability distribution over good moves (policy), and whether the current position is winning (value). MCTS uses these predictions to efficiently explore the game tree. The training data comes from MCTS itself — the network learns to be a better evaluator, which makes MCTS search better, which generates better training data, and the cycle continues.&lt;/p&gt;

&lt;p&gt;It's a beautiful closed loop. And it works from &lt;em&gt;nothing&lt;/em&gt; — no human games, no domain knowledge, just the rules of the game and enough compute.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It, Fork It, Break It
&lt;/h2&gt;

&lt;p&gt;The project is on GitHub at &lt;strong&gt;&lt;a href="https://github.com/zhixiangli/alphazero-board-games" rel="noopener noreferrer"&gt;zhixiangli/alphazero-board-games&lt;/a&gt;&lt;/strong&gt; under the Apache-2.0 license.&lt;/p&gt;

&lt;p&gt;A few things I'd love to see people build on top of it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A new game&lt;/strong&gt;: Othello, TicTacToe, or even something like Hex&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A stronger training pipeline&lt;/strong&gt;: async self-play, batched MCTS evaluation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An evaluation harness&lt;/strong&gt;: automatically pit new checkpoints against old ones&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A web UI&lt;/strong&gt;: replace the terminal &lt;code&gt;stdio_play&lt;/code&gt; with a browser interface&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you find a bug, have a question about the MCTS implementation, or want to discuss a design decision, open an issue or drop a comment below. And if you find the project useful — a ⭐ on GitHub goes a long way!&lt;/p&gt;

&lt;p&gt;Happy coding. ♟️&lt;/p&gt;

</description>
      <category>alphazero</category>
      <category>reinforcementlearning</category>
      <category>deeplearning</category>
      <category>python</category>
    </item>
    <item>
      <title>I Built an AI Arena and Trained AlphaZero to Play Gomoku: Here’s How</title>
      <dc:creator>Zhixiang Li</dc:creator>
      <pubDate>Sun, 01 Mar 2026 01:58:34 +0000</pubDate>
      <link>https://forem.com/zhixiangli/i-built-an-ai-arena-and-trained-alphazero-to-play-gomoku-heres-how-588m</link>
      <guid>https://forem.com/zhixiangli/i-built-an-ai-arena-and-trained-alphazero-to-play-gomoku-heres-how-588m</guid>
      <description>&lt;p&gt;Building a board game AI is a fantastic way to dive into Reinforcement Learning and search algorithms. But once you've built your AI, a new problem arises: &lt;strong&gt;How do you actually test it against other algorithms?&lt;/strong&gt; If you write a classic Minimax agent in Java and an AlphaZero model in Python, how do you make them fight?&lt;/p&gt;

&lt;p&gt;To solve this, I built a two-part ecosystem:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/zhixiangli/gomoku-battle" rel="noopener noreferrer"&gt;Gomoku Battle&lt;/a&gt;&lt;/strong&gt;: A cross-language, cross-system arena for AI agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/zhixiangli/alphazero-board-games" rel="noopener noreferrer"&gt;AlphaZero Board Games&lt;/a&gt;&lt;/strong&gt;: A lightweight, readable AlphaZero implementation trained to dominate the arena.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is a deep dive into the architecture of both projects and how they work together.&lt;/p&gt;




&lt;h2&gt;
  
  
  🏟️ Part 1: The Arena (Gomoku Battle)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/zhixiangli/gomoku-battle" rel="noopener noreferrer"&gt;zhixiangli/gomoku-battle&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal of &lt;code&gt;gomoku-battle&lt;/code&gt; was to create a pluggable, language-agnostic referee system. I wanted to be able to write an AI in any language, plug it into the arena, and watch it play in a UI.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture
&lt;/h3&gt;

&lt;p&gt;I built the platform using Java, splitting it into specialized modules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;gomoku-battle-core&lt;/code&gt;: Handles the board state, win/loss rule checking, and pattern utilities.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gomoku-battle-dashboard&lt;/code&gt;: A JavaFX-based UI that provides real-time visualizations of the matches.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gomoku-battle-console&lt;/code&gt;: The referee that manages the game loop and handles inter-process communication (IPC).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cross-Language Communication via &lt;code&gt;stdio&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The secret sauce of &lt;code&gt;gomoku-battle&lt;/code&gt; is how it talks to the agents. Instead of forcing agents to implement a specific language interface, the console spawns each AI as a separate &lt;strong&gt;subprocess&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Communication happens purely over standard input/output (&lt;code&gt;stdin&lt;/code&gt; / &lt;code&gt;stdout&lt;/code&gt;) using JSON.&lt;br&gt;
When it's an agent's turn, the referee sends the board state as a JSON string containing the SGF (Smart Game Format) sequence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"NEXT_BLACK"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"rows"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"columns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"chessboard"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"B[96];W[a5];B[a4];W[95]"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent processes the state, calculates the best move, and simply prints its decision to &lt;code&gt;stdout&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"rowIndex"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"columnIndex"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This design completely decouples the AI logic from the game engine. You can configure your agents in a simple &lt;code&gt;battle.properties&lt;/code&gt; file, pointing the engine to a Java &lt;code&gt;.jar&lt;/code&gt; or a Python script using &lt;code&gt;uv run&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To provide a baseline, I included &lt;code&gt;gomoku-battle-alphabetasearch&lt;/code&gt;, a classical AI using Alpha-Beta Pruning. But I wanted something stronger.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 Part 2: The Brain (AlphaZero Board Games)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/zhixiangli/alphazero-board-games" rel="noopener noreferrer"&gt;zhixiangli/alphazero-board-games&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To beat the Alpha-Beta baseline, I implemented the algorithm that conquered Go and Chess: &lt;strong&gt;AlphaZero&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Many AlphaZero repositories out there are either overly complex, tied heavily to one specific game, or require massive compute just to see a working demo. I built &lt;code&gt;alphazero-board-games&lt;/code&gt; to be &lt;strong&gt;clean, modular, and instantly playable.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Implementation Details
&lt;/h3&gt;

&lt;p&gt;The project is built with Python 3.12+ and uses a modular architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Shared Core (&lt;code&gt;alphazero/&lt;/code&gt;)&lt;/strong&gt;: This is the heart of the engine. It contains the abstract Game API, the Monte Carlo Tree Search (MCTS) implementation, the Neural Network definitions (Residual Policy/Value networks), and the self-play Reinforcement Learning loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Game Presets&lt;/strong&gt;: I implemented the specific rules for &lt;code&gt;gomoku_9_9&lt;/code&gt;, &lt;code&gt;gomoku_15_15&lt;/code&gt;, and &lt;code&gt;connect4&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Instead of traditional rollouts, the &lt;strong&gt;MCTS&lt;/strong&gt; in this project traverses the tree until it reaches a leaf node, then queries the &lt;strong&gt;Residual Network&lt;/strong&gt;. The network outputs two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Policy ($p$)&lt;/strong&gt;: A probability distribution over possible moves (where to look).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Value ($v$)&lt;/strong&gt;: An evaluation of the current board state from &lt;a href="https://dev.towho%20is%20winning"&gt;-1, 1&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These predictions guide the MCTS to focus only on promising branches, vastly reducing the search space compared to traditional Alpha-Beta pruning.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Batteries Included"
&lt;/h3&gt;

&lt;p&gt;To make the repo developer-friendly, I included &lt;strong&gt;pretrained checkpoints&lt;/strong&gt; in the &lt;code&gt;data/&lt;/code&gt; directories. You don't need to spend hours training a model to see it work. You can just clone the repo and immediately play against the AI in your terminal using the included &lt;code&gt;stdio_play.py&lt;/code&gt; scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python &lt;span class="nt"&gt;-m&lt;/span&gt; gomoku_15_15.stdio_play &lt;span class="nt"&gt;--human-color&lt;/span&gt; W &lt;span class="nt"&gt;--simulation-num&lt;/span&gt; 400

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And if you &lt;em&gt;do&lt;/em&gt; want to train your own models, the training loop is highly configurable right from the CLI!&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚔️ The Clash: Alpha-Beta vs. AlphaZero
&lt;/h2&gt;

&lt;p&gt;Because of the decoupled design, hooking the Python AlphaZero model into the Java Gomoku arena takes exactly one line in &lt;code&gt;battle.properties&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;player.white.cmd&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;uv run --project alphazero-board-games python gomoku-battle-alphazero/alphazero_adapter.py --simulation-num=5000&lt;/span&gt;
&lt;span class="py"&gt;player.white.alias&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;AlphaZero&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you run the battle, you get to watch a real-time visualization in the JavaFX dashboard of the Deep Learning model outsmarting the classical search algorithm.&lt;/p&gt;

&lt;p&gt;The AlphaZero agent evaluates far fewer positions than the Alpha-Beta agent, but because its neural network has developed an intuition for spatial patterns and influence, its moves are profoundly more strategic.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 Try It Yourself!
&lt;/h2&gt;

&lt;p&gt;I built these projects to be hacked on, learned from, and extended.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Want to practice implementing Minimax, Monte Carlo, or a custom heuristic? Fork &lt;strong&gt;&lt;a href="https://github.com/zhixiangli/gomoku-battle" rel="noopener noreferrer"&gt;Gomoku Battle&lt;/a&gt;&lt;/strong&gt;, write a quick script in your favorite language, and see if it can beat my baseline.&lt;/li&gt;
&lt;li&gt;Want to learn how AlphaZero actually works under the hood, or train an AI to play Connect4? Check out &lt;strong&gt;&lt;a href="https://github.com/zhixiangli/alphazero-board-games" rel="noopener noreferrer"&gt;AlphaZero Board Games&lt;/a&gt;&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you find the projects interesting or helpful for learning AI and system design, &lt;strong&gt;I'd love it if you gave them a ⭐️ on GitHub!&lt;/strong&gt; Let me know in the comments if you have any questions about the MCTS implementation or the JavaFX integration! Happy coding! 👨‍💻♟️&lt;/p&gt;

</description>
      <category>ai</category>
      <category>alphazero</category>
      <category>deeplearning</category>
      <category>reinforcementlearning</category>
    </item>
  </channel>
</rss>
