Forem: Stat Phantom

When Chaos Wins: Adding Noise Improved My Snake AI's Stability

Stat Phantom — Sun, 17 May 2026 07:20:58 +0000

Greetings all! Continuing the series where I build Rainbow DQN one component at a time on Snake. The first post covered encoding, the second covered memory, the third covered PER hurting performance. This one is about a truly WTF?! moment I stumbled into while evaluating the models.

When you evaluate a model that uses noisy networks, you turn the noise off. You're not training, so why would you keep the exploration noise active? You want the clean, deterministic policy. The model's best guess, no randomness. That's what you do, it's basically an axiom in machine learning.

So I did just that. And the evaluation scores were significantly worse than training. Not slightly. Significantly.

What Noisy Networks Do (Quick Recap)

Standard DQN uses epsilon-greedy exploration: pick a random action X% of the time, decay that percentage over training. Simple, dumb, works.

Noisy networks replace this with something smarter. Each linear layer in the network gets learnable noise parameters (sigma weights). During training, the network adds noise to its own weights, producing slightly different outputs each forward pass. The network learns how much noise to apply. Early in training, sigma values are high and the agent explores broadly. As training progresses and the agent gets more confident, sigma can shrink. For evaluation, you set sigma to zero. Clean output. Textbook.

The Evaluation Gap

Running evaluations across multiple training checkpoints, I noticed something was off. Not subtly off. The deterministic eval scores were wildly inconsistent.

Some checkpoints averaged 78. Others averaged 18. The training curve at these same points? Perfectly stable. The model was learning consistently the whole time, but deterministic evaluation was telling a completely different story depending on which checkpoint I happened to evaluate.

First instinct: it's a bug. Checked the eval pipeline, checked the checkpoint loading, checked the environment seeding. Everything was fine. The model genuinely performed this erratically when noise was turned off. So if it's not a bug... what is it?

The Bimodal Trap

The ep450K checkpoint was where it got properly weird. Deterministic eval produced a strongly bimodal distribution: roughly 25% of episodes scored near zero, while 75% scored above 80. The average landed at 59, but that number is completely meaningless when your distribution is two separate peaks with a canyon between them.

So what's going on? The deterministic policy has traps. Specific game states where the mean-weight Q-values for two or more actions are nearly identical. Without noise, the agent picks the same action every single time it hits that state. If that action happens to be the wrong one? Stuck. It loops, it crashes, it scores zero. 25% of episodes starting from certain initial states hit these traps every time.

Now. Same checkpoint, same evaluation seeds, noise turned back on:

The bimodal failure mode vanished. Gone. The p25 jumped from 2 to 59. The average climbed from 59 to 73. The standard deviation dropped from 42 to 26. The noise nudges the agent out of those deterministic traps. Not randomly, not chaotically, but because the learned noise provides just enough variation in the Q-values to stop the agent getting stuck in a degenerate action loop.

The noise isn't exploration overhead left over from training. It's a load-bearing part of the learned policy.

This wasn't a one-off. The pattern held at every checkpoint from ep50K through ep450K. Stochastic eval beat deterministic eval at every single point. Lower variance, higher consistency, fewer catastrophic zero-score episodes. The sigma values aren't residual training artifacts waiting to be zeroed out. They're doing actual work.

Why Snake Makes This Worse

Snake has a property that makes deterministic policies especially vulnerable to traps: a single wrong turn can be immediately fatal.

Picture a snake at length 100+ navigating a tight corridor of its own body. The optimal action and the second-best action might differ by a tiny margin in Q-value. Deterministic policy picks the same one every time. If that action leads into a dead end three moves later, the agent dies. Every time. From that state. Noise provides enough Q-value perturbation to occasionally pick the second-best action, which might be the one that actually survives.

In environments with more breathing room (wide open Atari levels, games where one wrong move doesn't kill you), deterministic policies don't develop these traps as severely. The longer the snake gets, the more traps exist, and the more the noise matters.

What This Means In Practice

If you're using noisy networks and evaluating with mean weights, your evaluation scores may not just be noisy. They can be structurally misleading. The deterministic policy can have failure modes that simply don't exist in the trained stochastic policy.

Before assuming deterministic eval shows the "true" performance of your agent, run a stochastic eval comparison. If the scores diverge, your agent has learned to depend on its noise.

Honest Caveats

Single architecture, single game. This was observed on C51 + dueling + noisy on Snake. Games with more forgiving state dynamics may not exhibit the same bimodal failure mode.

Noise can grow too large. At one late-stage checkpoint, sigma values had grown large enough that stochastic eval actually dropped below deterministic. There's a Goldilocks zone where noise is productive. Past that zone it becomes destructive. The finding is not "always evaluate with noise." The finding is "don't assume deterministic eval is automatically better."

Training scores remain the most reliable metric. For the ablation study, training window averages computed identically across all runs are the primary comparison, sidestepping the whole question entirely.

If you've observed similar eval divergence with noisy networks, or if you have environments where deterministic eval reliably matches training performance, I'd like to hear about it in the comments.

This work is part of ongoing research and the findings are planned to be submitted as a peer-reviewed paper.

References

Peer-Reviewed

Fortunato et al. (2018) - "Noisy Networks for Exploration" - ICLR 2018. arXiv: 1706.10295

Hessel et al. (2018) - "Rainbow: Combining Improvements in Deep Reinforcement Learning" - AAAI 2018. DOI: 10.1609/aaai.v32i1.11796

Bellemare et al. (2017) - "A Distributional Perspective on Reinforcement Learning" - ICML 2017. arXiv: 1707.06887

Removing PER From Rainbow DQN Set a New Snake AI World Record

Stat Phantom — Sat, 09 May 2026 08:32:53 +0000

Greetings all! Quick context: this is part of an ongoing series where I'm building Rainbow DQN one component at a time on Snake and measuring what each piece actually does. The first post covered the encoding, the second covered a memory optimisation. This one is about the finding I've been teasing: which Rainbow component hurts performance on Snake.

The answer is Prioritised Experience Replay (PER). Removing it from Rainbow DQN didn't just match performance. It set a new world record of ~~153~~ 156 on a 20×20 grid, smashing the previous record of 134 set by full Rainbow (with PER), and nearly 2.5× the best published peer-reviewed result of 62 (Sebastianelli et al., 2021).

The component that Hessel et al. (2018) ranked as one of Rainbow's two most important pieces actively hurts on some games such as snake.

What Is PER? (And Why Does Everyone Use It?)

Prioritised Experience Replay changes how an agent samples from its replay buffer. Instead of uniform random sampling (every stored transition has equal probability of being replayed), PER assigns a priority to each transition based on its TD error. Transitions the agent got most wrong get replayed most often.

The intuition is thus: why waste training steps on transitions the agent already understands well? Focus on the hard ones. Replay the failures. Learn from mistakes. Push yourself. insert 'Just Do It!' meme here

To prevent this biased sampling from corrupting the gradient, PER applies importance sampling (IS) weights that mathematically correct for the non-uniform distribution. A parameter called beta controls how aggressively this correction is applied, and is annealed from a low value (0.4) toward 1.0 over training.

Hessel et al.'s 2018 Rainbow paper tested each component's contribution by removing them one at a time. PER and multi-step returns were the two most impactful. Remove either one and performance dropped the most. This result, measured on Atari, became the received wisdom in the DRL Gaming community: PER is essential.

And for some reason, nobody asked whether that ranking holds on tasks that look nothing like Atari.

The Bug I Found First

Before I could even evaluate PER properly, I had to fix a misconfiguration that most multi-environment setups will hit without realising.

PER's beta parameter is annealed over beta_anneal_steps gradient steps. The default values in most implementations are calibrated for single-environment training where roughly one gradient step happens per episode. My setup runs 2048 parallel environments with 4 gradient steps per global step. That's approximately 8,192 gradient steps per episode.

The result? With a beta_anneal_steps of 100,000 (a common default), beta reached 1.0 by episode ~12. Not 12,000. Yes you read that right, twelve. The IS correction was fully engaged before the agent had learned anything at all. The training wheels came off before one foot was even on the pedal. For the remaining ~300,000 episodes of training, PER was running with maximum gradient suppression against priorities that were pure noise.

Gradient norms confirmed it: they were approximately 4× lower than equivalent non-PER runs. The agent was being actively throttled.

After identifying this, I recalibrated beta_anneal_steps to 6,000,000 (covering ~300,000 episodes at the actual gradient-steps-per-episode rate) and ran again from scratch. The corrected run did show improvement over the non-PER baseline.

So, PER fixed, job done, moving on? NOPE!

Fixed PER Still Underperforms

The corrected PER run outperformed the dueling+noisy baseline by a meaningful but modest margin. Not the dramatic improvement you'd expect from one of Rainbow's "top two components." The improvement was there, it just wasn't impressive.

This raised a question for me. If PER barely helps without C51 (distributional output), what happens when C51 is present? C51 fundamentally changes the nature of the TD error. In standard DQN, the TD error is a scalar: predicted Q minus target Q. PER uses this scalar as its priority signal. Simple, clean, well-defined.

In C51, the "error" is a KL divergence between two probability distributions. It's not a scalar residual in the same sense. Most Rainbow implementations approximate a priority from this distributional loss, but it's exactly that: an approximation. If the priority signal is noisier in the distributional setting, PER is making sampling decisions on worse information while still applying the full IS correction penalty.

The only way to test this was to run Rainbow with and without PER and compare directly.

The Head-to-Head

Full Rainbow (with PER) vs C51 without PER. Same architecture, same hyperparameters, same encoding, same hardware, same training seed. The only difference: PER on or off.

Both models evaluated at the ep50K snapshot: 10 segments × 2,000 episodes (20,000 total per model), deterministic policy, seeds 0–19,999.

C51 without PER outperforms full Rainbow across every single metric. Not by a little. The weakest C51 segment (avg 31.47) far exceeds the strongest Rainbow segment (avg 22.91). There is zero overlap between the two distributions. This isn't noise. This is a structural difference.

At the training level, C51 overtook Rainbow in record score around episode 153K and maintained the lead through the end of both runs. The final records: 153 (C51 without PER) vs 134 (full Rainbow with PER).

Removing PER didn't just fail to hurt. It was the single change that pushed the model from 134 to a world record of ~~153~~ 156.

Why PER Hurts on Snake

This result isn't random bad luck. There are structural reasons why PER is a poor fit for Snake, and they generalise to any task with similar properties.

Dense rewards reduce TD error variance. PER's priority mechanism works best when the replay buffer contains a mix of genuinely informative rare transitions and common boring ones. In sparse-reward environments (long Atari episodes, complex RPGs), most transitions carry little signal, and PER correctly surfaces the rare valuable ones. Snake hands out food frequently. The reward signal is dense. TD errors across transitions are relatively homogeneous. There isn't enough variance in transition informativeness for priority sampling to do meaningful work.

Parallel environments already ensure diversity. One of PER's core benefits in single-environment training is making rare or unusual game states available for replay more often. With 2048 environments running simultaneously, the replay buffer is already populated with massively diverse experience at every step. The agent sees rare states regularly just from the volume of parallel play. PER's diversity benefit is structurally preempted by the parallelism.

IS weight correction suppresses gradients. The IS correction is mathematically necessary to prevent biased gradients, but it comes at a cost: it down-weights the very transitions PER most wants to learn from. In a dense-reward setting where TD errors are already relatively uniform, this correction may be net-harmful. You pay the gradient suppression overhead without the corresponding benefit of surfacing genuinely informative transitions.

C51 makes PER's priority signal worse. In standard DQN, the TD error is a clean scalar. In C51, the "error" is derived from a KL divergence between distributions, an approximation that may not faithfully represent which transitions are most informative in the distributional sense. PER is making sampling decisions on a noisier signal while still applying the full IS penalty.

These four factors compound. Each one individually would weaken PER's contribution. Together, they explain why removing PER entirely produces a better model than including it.

This Isn't Just My Finding

Pan et al. and Ivgi et al. have independently documented PER underperforming in dense-reward or high-parallelism settings. Both identify that PER's advantage is largest when rewards are sparse and TD errors vary substantially across transitions. This lends external validity to what I observed here and suggests the finding is not specific to Snake or to my implementation.

The practical recommendation: before including PER in your setup, ask whether your task has sparse rewards and rare informative transitions. If it doesn't, PER's overhead (IS correction, priority tracking, beta calibration complexity) may outweigh its benefit. The fact that Hessel et al. found PER essential on Atari does not mean it's essential on your task.

Honest Caveats

Tested across multiple seeds. The primary comparison shown above is from a single training seed, but the PER vs no-PER comparison has been tested across 5 seeds. The results are somewhat chaotic at the individual seed level, with some seeds showing a smaller gap and occasional flips. But the mean across all 5 seeds shows a positive effect from removing PER. The relative ranking holds on average, even if individual seeds can be noisy. This is consistent with the structural arguments above: PER's disadvantage on dense-reward tasks is systematic, not a seed-specific fluke.

Dense-reward specific. This finding is about PER on Snake, which is a dense-reward task with frequent food collection and relatively uniform state visitation. PER may still be valuable on sparse-reward, long-horizon tasks. The claim is not "PER is useless." The claim is "PER is not universally beneficial, and the conditions under which it helps are narrower than the literature implies."

Beta calibration. The PER run used the corrected beta annealing schedule. The comparison is against properly-configured PER, not the misconfigured version. The misconfiguration is documented because it's a real pitfall that anyone using PER in a multi-environment setup will hit, but the head-to-head result stands on the corrected run.

What's Next

The ablation study continues. The PER finding is one piece of a larger investigation into how each Rainbow component contributes in a dense-reward, parallel-environment setting. The full ablation ladder, from standard DQN through full Rainbow, is being built one component at a time.

If you've observed PER underperforming on dense-reward tasks, or if you have counterexamples where PER helped significantly despite frequent rewards, I'd like to hear about it in the comments.

This work is part of ongoing research and the findings are planned to be submitted as a peer-reviewed paper.

If you're new to this series:

Stat Phantom

Apr 25

A CNN Grid Encoding for Snake AI That DOUBLES! the Best Published Score

#ai #machinelearning #deeplearning #cnn

Comments 2

10 min read

Stat Phantom

May 1

2 Lines of Code Saved 6.4x Memory on My Snake AI

#ai #programming #machinelearning #deeplearning

Comments

6 min read

References

Peer-Reviewed

Hessel et al. (2018) - "Rainbow: Combining Improvements in Deep Reinforcement Learning" - AAAI 2018. DOI: 10.1609/aaai.v32i1.11796

Schaul et al. (2016) - "Prioritized Experience Replay" - ICLR 2016. arXiv: 1511.05952

Bellemare et al. (2017) - "A Distributional Perspective on Reinforcement Learning" - ICML 2017. arXiv: 1707.06887

Sebastianelli et al. (2021) - "A Deep Q-Learning based approach applied to the Snake game" - 29th Mediterranean Conference on Control and Automation (MED). DOI: 10.1109/MED51440.2021.9480232

2 Lines of Code Saved 6.4x Memory on My Snake AI

Stat Phantom — Fri, 01 May 2026 06:36:43 +0000

Greetings all! In my previous post I covered Binary Plane Encoding, a 3-channel grid representation for Snake that doubled the best published score. Three binary channels: head, body, apple. For details check my previous post.

But there was a fourth channel I left out. Direction. The snake's current heading, encoded as a uint8 (0 = up, 1 = right, 2 = down, 3 = left), is painted uniformly across a 20×20 plane due to matrix shape requirements. That's 400 elements carrying exactly 2 bits of information. A 1,600× overhead at the channel level.

Worse, that one integer channel with its 2 bits was blocking the entire state from being bit-packed. The other three grid channels are binary, meaning they could be packed at 1 bit per element. But the direction channel with its scoffs 2 bits, can't. So the replay buffer sees the state as uint8 instead of binary. One channel, 2 bits, holding back one more step of memory optimisation, forcing 1,600 bytes per state instead of 250 (20 × 20 grid, ×4 channels, 1 byte per channel = 1,600 vs 20 × 20 grid, ×5 channels, 1 bit per element / 8 = 250).

This follow-up post is about fixing that, and the pitfalls along the way.

The First Attempt

Four cardinal directions. Two bits encode four states. So the intuitive replacement is two binary channels instead of one integer channel: one bit for North/South, one bit for East/West. Compact, geometric, obvious.

Except it doesn't work. Walk through it:

North and West both map to 0,0 - Collision.

The failure is subtle because the scheme seems right. Four directions, four possible bit combinations, should be a clean fit. But the scheme tries to answer "is there a north/south component?" and "is there an east/west component?" Cardinal movement is strictly one-dimensional. The perpendicular component is always exactly zero. What does the E/W bit say when the snake is moving north? It's not moving east. It's also not moving west. Both map to 0. "Not moving east" is identical to "not moving west" in a single bit.

Two bits should be enough for four directions. They are. Just not those two bits.

Ask Better Questions

The collision happens because the N/S + E/W scheme asks the wrong questions for cardinal movement. The fix isn't more bits. It's better questions.

The correct encoding uses two bits derived geometrically:

Axis bit: which axis is the snake travelling along? (0 = vertical, 1 = horizontal)

Sign bit: which direction on that axis? (0 = negative, 1 = positive)

All four directions get unique codes. The axis bit answers "which axis?" and the sign bit answers "which end?" Both questions always have exactly one answer for cardinal movement. No ambiguity, no collisions. The specific sign convention (whether north is positive or negative) doesn't matter as long as it's internally consistent. The CNN will learn whatever mapping you give it.

The first attempt was asking the wrong questions. Once you ask the right ones, two bits is plenty.

For anyone wondering about diagonal games (8 directions), the axis + sign scheme breaks because a diagonal is on both axes simultaneously. The general solution there is a 4-channel one-hot: one binary plane per cardinal direction, with two planes active for a diagonal. But for Snake, cardinal-only, the 2-channel scheme is the right choice. Don't build the generality you don't need.

The Memory Maths

This is where the change pays off. The state goes from (4, 20, 20) with one integer channel to (5, 20, 20) with all binary channels. Yes, adding a channel saves memory. That sounds backwards but the maths checks out.

Before (4-channel, uint8 storage): 4 × 20 × 20 = 1,600 elements at 1 byte each = 1,600 bytes per state. A 1-million-transition replay buffer (storing both state and next state): 3.2 GB.

After (5-channel, binary bit-packed): 5 × 20 × 20 = 2,000 elements. Every value is now 0 or 1, so each element can be packed at 1 bit, 8 elements per byte. ⌈2,000 / 8⌉ = 250 bytes per state. The same buffer: 500 MB.

6.4× reduction. Adding one channel, removing 2.7 GB.

To put this in perspective: the grid encoding stored naively as float32 (before any compression) would be 6,400 bytes per state, or 12.8 GB for a 1M-transition buffer. The first post's uint8 storage cut that to 3.2 GB (4× reduction). This post's binary bit-packing cuts it again to 500 MB. Across both changes, that's a 25.6× total reduction from the uncompressed float32 starting point.

And compared to the pixel-based approaches from the first post? Wei et al.'s RGB inputs would need approximately 49 GB for the same buffer. Binary Plane Encoding with binary cardinal directions brings that to 500 MB. Nearly a 98× difference. A 1-million-transition replay buffer now fits comfortably in the VRAM of a gaming laptop, hell, it fits in some EPYC CPU caches (AMD's Genoa-X packs up to 1,152 MB of L3). With pixel inputs, it wouldn't fit on most workstations.

Two Lines of Code

The implementation change is in snake_cnn_env.py. Replace the single integer direction plane with two binary planes:

# Before: one integer channel
# grid[3] = self._direction  # 0, 1, 2, or 3

  grid[3] = float(self._direction % 2 == 1)   # axis: 0=vertical, 1=horizontal
  grid[4] = float(0 < self._direction < 3)     # sign: 0=negative, 1=positive

Update input_channels from 4 to 5 in the model config. Done. We now store 5 channels instead of 4, but each channel is 1 bit instead of 8. One extra channel, massively less storage.

One real cost: changing input_channels changes the shape of the first convolutional weight tensor. Existing checkpoints can't be loaded into a 5-channel model. This requires a fresh training run, so schedule the change at a natural break point, not mid-experiment.

torch.unpackbits Doesn't Exist

The CPU side of bit-packing is trivial. np.packbits and np.unpackbits have existed in NumPy since 2010. Pack on write, unpack on read. Done.

So just implement it on the GPU side right? WRONG. The natural PyTorch equivalent would be torch.unpackbits, which... doesn't exist? The function is absent from the stable API entirely, and importing it raises an AttributeError. This is a genuine gap in PyTorch that anyone implementing binary storage on CUDA will hit.

The community workaround I found uses bitmasks:

mask = 2 ** torch.arange(8, dtype=torch.uint8, device=x.device).reshape(8, 1)
unpacked = (x.unsqueeze(-1) & mask).bool().int().flip(dims=[1])

This works. It preserves the original bit values, converts them to binary via .bool().int(), and flips the bit order to match MSB-first convention. Four operations, correct output.

But I don't need to preserve the original mask values, I just need 0s and 1s. I thought I could do better, and I wouldn't be a programmer if I didn't try for no other reason except... shrugs I wanted to?

shifts   = torch.arange(7, -1, -1, device=packed.device, dtype=torch.uint8)
unpacked = ((packed.unsqueeze(-1) >> shifts) & 1)   # (B, packed_size, 8)
unpacked = unpacked.reshape(B, -1)[:, :n_elems]     # drop padding bits

Each packed byte is broadcast against 8 shift values [7, 6, 5, 4, 3, 2, 1, 0], right-shifting to move each successive bit into the least significant position. Bitwise & with 1 isolates it. Two operations instead of four. No .bool().int() needed because >> shift & 1 always yields binary output directly. No .flip() needed because the descending shift range already produces MSB-first order. Fewer intermediate tensors in VRAM during sampling.

The mask approach also has a shape bug: it's written for a 1D input (flat array of bytes) and breaks on a batched 2D input (B, packed_size). The shift approach handles batched GPU sampling correctly from the start.

Both are fully device-resident with no CPU-GPU transfer. But two operations beats four, and not allocating intermediate tensors matters when batch size and state shape are large. Will reducing two ops make a difference? Probably not, but I saw the OPportunity and took it. And yes, I said that just for the joke.

So, two lines of code changed the state representation to allow bit-packing and saved a lot of storage with no loss of data.

What's Next

This is part of an ongoing series building Rainbow DQN incrementally and measuring each component on Snake. The state representation work runs in parallel to the algorithm comparison. It doesn't change which Rainbow components help or hurt, but a 6.4× memory reduction means larger buffers, more parallel environments, or training on hardware that previously couldn't fit the buffer.

The algorithm results are the next post.

If you've hit the torch.unpackbits gap yourself, or found a cleaner solution than bitwise shifts for GPU-side bit unpacking, I'd like to hear about it in the comments.

This work is part of ongoing research and the findings are planned to be submitted as a peer-reviewed paper.

If you missed the first post in this series:

Stat Phantom

Apr 25

A CNN Grid Encoding for Snake AI That DOUBLES! the Best Published Score

#ai #machinelearning #deeplearning #cnn

Comments 2

10 min read

A CNN Grid Encoding for Snake AI That DOUBLES! the Best Published Score

Stat Phantom — Sat, 25 Apr 2026 04:39:23 +0000

A traditional Snake game grid has only 4 states each grid point can be in: empty, head, body, or apple. And for some reason every published Snake AI paper either throws away spatial information by condensing the game state into a handful of hand-picked numbers, or buries entity identity under layers of raw pixel data that the network has to untangle. Incredibly wasteful.

The solution? Binary Plane Encoding. Using it, a CNN-based model reached a record score of 125 on a 20×20 grid in 2.5 hours on a single RTX 2070, doubling the best published result of 62 (even the average is consistently above this record). This post explains the encoding, why it works, and explores why nobody in the Snake DRL space has tried it before.

The Two Camps

The published literature on deep reinforcement learning for Snake spans 2018 to 2025 and splits into two approaches to state representation.

Camp one: hand-crafted feature vectors. Sebastianelli et al. (2021) and Kommalapati et al. (2025) both use 11 binary features fed to a fully-connected network. Three danger flags (is there a wall or body segment directly ahead, to the left, to the right), four direction flags (which way is the snake currently heading), and four food-relative flags (is the apple above, below, left, right of the head). The network receives a pre-digested summary of the game state. It never sees the grid. It never learns spatial relationships. A human decided what matters and encoded that decision directly into the input.

This works well. Sebastianelli achieved a best score of 62 on a 20×20 grid with vanilla DQN and this 11-feature representation, and uses very little resources... at least initially, but then a hard ceiling is quickly reached. The network cannot discover and learn spatial patterns because it never sees the spatial layout. And the features themselves are Snake-specific. Those 11 binary values encode what a Snake expert thinks matters. They would be meaningless for any other game. If you want an agent that can generalise beyond a single environment, this is a dead end.

Camp two: raw pixels. Wei et al. (2018) and Tushar & Siddique (2022) both train from screenshots. Wei uses 64×64 RGB frames stacked four deep, giving 64×64×12 input. Tushar converts to binary (any non-zero pixel becomes 1) at 84×84, also four frames stacked, giving 84×84×4.

The pixel approach is game-agnostic, which is its strength. But the cost is significant. Tushar's binary encoding collapses head, body, and apple into a single value. In any individual frame, every occupied cell looks identical. The agent can only figure out what's what by watching how things move across four stacked frames: food stays still, the snake moves. A single frame on its own contains zero identity information. Wei's RGB encoding preserves colour and therefore identity, but at the cost of massive input dimensionality and redundant spatial resolution (64×64 pixels to represent a 20×20 logical grid).

Both pixel approaches were tested on 12×12 grids, reaching best scores of 17 (Wei) and 20 (Tushar). Neither has been applied to 20×20.

Beyond the peer-reviewed literature, informal projects show similar patterns. A supervised learning approach on GitHub (Huynh, 2020) uses 7 hand-crafted features with a Keras network and reaches a best of 46, average 22 on 20×20. A Medium article (Schoberg, 2020) compares deterministic algorithms rather than learned policies, reaching 67 on 20×20 with a collision-avoiding shortest-path algorithm (no neural network involved at all).

Across all of it, every neural network approach uses either compressed feature vectors or raw pixel grids.

The Gap

Here is the part that surprised me. Multi-channel grid encoding is not a new idea. It is the standard state representation in board game AI.

AlphaZero (Silver et al., 2018) represents chess, Go, and Shogi as multi-channel binary planes. Each piece type, colour, and game-state feature gets its own channel. The network receives a spatial tensor where every channel encodes a different semantic category of information about the board. MuZero extends this. The representation is well-established, well-understood, and has been proven at the highest levels of game AI.

Snake fundamentally runs on a grid with set positions entities can occupy. It mirrors the exact class of problem where channel-per-entity encoding has proven effective, yet no published Snake DRL paper, and no self-published project I have found, attempts this representation. (Although this not appearing in published papers isn't surprising to me. As someone who this month had to go through over 2,100 papers, most papers just follow pre-existing trends.)

All of the pre-existing Snake DRL literature either pre-computes features and discards spatial representation, or captures raw pixels and forces the network to spend capacity on visual processing before it can even begin to learn the game.

This is the gap. Not a novel encoding technique, but an established one applied to a domain that has ignored it.

The Encoding

The state representation is a 20×20×3 binary tensor. Three channels, each covering the full grid:

Channel 0 (head): 1 at the head position, 0 everywhere else.

Channel 1 (body): 1 at each body segment position, 0 elsewhere.

Channel 2 (apple): 1 at the apple position, 0 everywhere else.

Every value is exactly 0 or 1. A single frame provides complete, unambiguous game state. What is the head, where is the body, where is the food. No temporal stacking required. No entity disambiguation through motion inference. No feature engineering.

The construction from game state is straightforward:

import numpy as np

def encode_state(grid_size, head_pos, body_positions, apple_pos):
    state = np.zeros((3, grid_size, grid_size), dtype=np.uint8)

    # Channel 0: head
    state[0, head_pos[0], head_pos[1]] = 1

    # Channel 1: body
    for segment in body_positions:
        state[1, segment[0], segment[1]] = 1

    # Channel 2: apple
    state[2, apple_pos[0], apple_pos[1]] = 1

    return state

That produces 20×20×3 = 1,200 values per state. Compare that to the pixel approaches: Tushar's binary encoding produces 84×84×4 = 28,224 values (23× larger), and Wei's RGB produces 64×64×12 = 49,152 values (41× larger). The grid encoding captures strictly more semantic information in a fraction of the space.

The information hierarchy makes this concrete:

Approach	Entity identity per frame	Full spatial layout	Game-agnostic
Binary Plane Encoding (this model)	Yes, perfect	Yes	Partial (any grid game)
RGB pixels (Wei et al.)	Yes, via colour	Approximate	Yes
Binary pixels (Tushar)	No (needs 4 frames)	Approximate	Yes
Feature vectors (Sebastianelli)	Yes, pre-computed	No	No (Snake-specific)

The only representation in the reviewed literature that provides perfect entity identity, full spatial layout, and game-agnostic structure without additional processing.

The CNN Architecture

The model processing this encoding is deliberately compact:

Two convolutional layers with 32 and 64 channels respectively, 3×3 kernels with same padding, followed by a single MaxPool2d that halves the spatial dimensions from 20×20 to 10×10. Two dense layers of 512 and 256 units. Mish activation throughout.

The network also uses a dueling architecture (separate value and advantage streams) and NoisyLinear layers replacing standard linear layers in the fully-connected head, providing learned exploration noise instead of epsilon-greedy.

This is not a large network. It doesn't need to be. The compact input representation means the convolutional backbone doesn't need depth. Two 3×3 layers with a single pooling stage are sufficient to capture the spatial relationships that matter in a 20×20 grid: proximity to walls, body segment density in nearby regions, and relative food position. The encoding has already done the hard work of structuring the information. The CNN just needs to read it.

Previous Records

The meaningful comparisons are grouped by grid size, since raw scores are not directly comparable across different board dimensions.

20×20 Grid

The only published peer-reviewed result on a 20×20 Snake grid is Sebastianelli et al. (2021). They used an MLP with 11 hand-crafted binary features and vanilla DQN, testing 13 hyperparameter configurations across evaluation runs. Their best single score was 62.

This work, using Binary Plane Encoding with a CNN and Rainbow DQN (incorporating C51 distributional output, dueling architecture, noisy exploration, prioritised replay, and 3-step returns), achieved a record of 125 on the same grid. over double.

This isn't a cherry-picked peak. Across 55,000 episodes of sustained training, the rolling average holds between 60 and 70, and the median between 64 and 74. Sebastianelli's best single game of 62 sits below this model's average. The p10 floor (the score that 90% of episodes exceed) holds around 30, meaning even the worst games routinely outperform most published baselines. The p90 reaches into the high 90s, with individual episodes regularly breaking 100. Training to this point took approximately 2.5 hours on a single RTX 2070.

An important caveat: this is not an encoding-only comparison. The improvement comes from changes across multiple axes simultaneously. State representation (grid encoding vs feature vector), architecture (CNN vs MLP), algorithm (Rainbow DQN vs vanilla DQN), and training scale (2048 parallel environments vs a smaller setup). The encoding is the enabling change that made the architecture and training scale feasible on consumer hardware, but the doubling should not be attributed to the encoding alone.

12×12 Grid

Direct score comparison across grid sizes doesn't work because a 12×12 grid has a maximum possible score of approximately 141 food items versus approximately 399 for 20×20. Board coverage (score divided by maximum possible) provides a normalised metric:

Work	Grid	Best Score	Board Coverage
Wei et al. (2018)	12×12	17	~12%
Tushar & Siddique (2022)	12×12	20	~14%
Sebastianelli et al. (2021)	20×20	62	~16%
This model	20×20	125	~31%

The gap persists across normalisation. At 31% board coverage, this approach covers roughly double the grid fraction of the nearest published result and more than double the pixel-based CNN approaches.

Informal results (not peer-reviewed)

For completeness: a supervised learning project (Huynh, 2020) on 20×20 achieved a best of 46, and a deterministic shortest-path algorithm (Schoberg, 2020) reached 67 on 20×20. The latter is not a learned policy. Neither is peer-reviewed.

Why It Works

The encoding's advantage operates on two levels.

Information quality. The network receives exactly the information it needs to play Snake, in a spatial format that CNNs are designed to process, with zero noise or redundancy. Each channel answers one question: where is the head, where is the body, where is the food. There is no ambiguity to resolve, no motion to infer, no irrelevant visual detail to filter out.

Pixel inputs have a problem where the network must first learn to segment the image (such as determining what's the snake's body and what's the background). After this it then needs to learn to interpret the spatial relationships between the segments. With Binary Plane Encoding, this segmentation is pre-constructed, leaving the network to devote its entire capacity to learning the actual game instead of learning how to see in the first place.

Information density. At 1,200 values per state stored as uint8, a replay buffer holding 1,000,000 transitions fits comfortably in approximately 1.6GB of VRAM. This made a GPU-resident replay buffer and 2048 parallel environments possible on a single RTX 2070 with 8GB of VRAM.

For comparison, storing Tushar's 84×84×4 binary inputs at the same buffer capacity would need approximately 28GB. Wei's 64×64×12 RGB inputs would need approximately 49GB. Neither fits on consumer hardware. You would need multiple high-end GPUs or cloud infrastructure to achieve the same training scale with pixel-based inputs.

The compact encoding didn't just improve information quality. It made the training infrastructure possible. 2048 parallel environments with a GPU-resident buffer meant the replay buffer reached useful diversity faster, the distributional RL gradient signal had richer data to work with, and the agent surpassed all previous records before reaching 100,000 training episodes.

Honest Caveats

This encoding is a privileged state representation. The agent receives information extracted directly from the game's internal data structures: exact head position, exact body segment positions, exact apple position. A human player has access to the same logical information through visual perception, but this agent receives it pre-structured without any perceptual processing.

The model plateaued at 125 (over 50,000 simulations without it budging), but a subsequent run using a variant algorithm has already broken that record, so we know this isn't the ceiling for the encoding. The more interesting question is whether pixel-based approaches could ever reach these scores given enough compute. Theoretically yes, but whether it's achievable in practice is unknown. Imperfections in the visual pipeline may compound through training, but that hypothesis hasn't been tested and the performance cost of segmentation quality on Snake hasn't been quantified. Whether the gap is recoverable or structural is an open question and one worth testing properly. If you take this on, I'd love to see what you find.

Cross-paper comparisons to Sebastianelli et al. and the pixel-based approaches should be read with the privileged state in mind. The improvement reflects the combined effect of encoding quality, architecture, algorithm, and training scale. Isolating each factor's individual contribution is the purpose of the ablation study this encoding supports.

What's Next

Binary Plane Encoding is the foundation for a systematic ablation study on Rainbow DQN applied to Snake. The study adds one component at a time (Double DQN, noisy exploration, dueling architecture, prioritised experience replay, C51 distributional output), measuring each component's individual contribution in a dense-reward, vectorised-environment setting.

Early results have already produced some surprises about which Rainbow components help and which ones hurt on a task like Snake. That is the next post.

If you have experience with alternative state representations for grid-based game AI, or if you have seen Binary Plane Encoding applied to Snake in work I haven't found, I'd genuinely like to hear about it in the comments.

This work is part of ongoing research and the findings are planned to be submitted as a peer-reviewed paper.

References

Peer-Reviewed

Sebastianelli et al. (2021) - "A Deep Q-Learning based approach applied to the Snake game" - 29th Mediterranean Conference on Control and Automation (MED). DOI: 10.1109/MED51440.2021.9480232

Kommalapati et al. (2025) - "Building an AI Snake Powered by Deep Reinforcement Learning and Deep Q-Learning" - IEEE 7th International Symposium on Advanced Electrical and Communication Technologies (ISAECT). DOI: 10.1109/ISAECT68904.2025.11318716

Wei et al. (2018) - "Autonomous Agents in Snake Game via Deep Reinforcement Learning" - IEEE International Conference on Agents (ICA), Singapore. DOI: 10.1109/AGENTS.2018.8460004

Tushar & Siddique (2022) - "A Memory Efficient Deep Reinforcement Learning Approach For Snake Game Autonomous Agents" - IEEE 16th International Conference on Application of Information and Communication Technologies (AICT). DOI: 10.1109/AICT55583.2022.10013603

Silver et al. (2018) - "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play" - Science 362, 1140-1144. DOI: 10.1126/science.aar6404

Informal / Community Work

Huynh (2020) - Supervised learning Snake AI. GitHub Repository

Schoberg (2020) - Deterministic algorithms for Snake. Medium Article