Forem: Daniel King

I trained a sprite model with agents. The data was the bottleneck.

Daniel King — Wed, 06 May 2026 09:47:04 +0000

I just published pixel-llm, a small autoregressive transformer that generates 32x32 pixel art sprites of reef sea creatures. About 2.9 million parameters, a 64-colour palette, runs on consumer hardware. Built end to end through agent sessions, with me steering rather than typing.

The output is sub-par. I am sharing it anyway, because the way it failed taught me something I did not expect.

The setup was narrow on purpose. I picked sea creatures because the visual vocabulary is constrained: a few zones (shallows, twilight, midnight, abyss, hadal) and a few categories (reef fish, grazer, coral, jellyfish, cephalopod, plus an abyssal catch-all). A small, well-defined domain felt like the right shape for a small model. Six categories, five zones, thirty cells in the grid. Tractable on paper.

The model itself fell out fast. Agents wrote the transformer, the KV-cache inference loop, the sprite breeding via partial completion, and the post-process palette-aware shader. That last piece is the strongest output. The model produces flat colour-indexed sprites and a separate procedural shader applies directional light and ambient occlusion, staying inside the 64-colour palette by walking pre-computed luminance ramps.

When the categories worked, you can see what I was after. When they did not, you can see that too: two of the six categories (cephalopod and one abyssal column) never converged. Pure noise, regardless of sampling temperature.

I iterated the training data four times. A procedural synthetic generator. Wikimedia Commons photographs, downloaded and palette-quantised. Sprite sheet extraction from OpenGameArt. A mixed corpus stitched together from all three. The validation loss kept going down. The samples for those two categories kept looking wrong. The other four held up well enough to look at.

That is the part I want to flag. Loss is not taste. The agentic loop has a fast, local correctness signal for the code: does it run, does the loss go down, does it not crash. It does not have a corresponding signal for the data. Whether a corpus is the right shape for a problem is a slow, aesthetic judgment that arrives after a training run, after staring at sample grids, after a cycle measured in hours rather than seconds. Agents cannot close that loop on their own yet.

So the work split cleanly. The model code, training scaffold, sampler, breeder, and shader were straightforward agent output. The data choices were the part where I had to keep showing up.

This connects back to something I wrote about in April: when agents take over execution, the premium activity is the layer above. For a coding agent that layer is verification. For a research-flavoured agent loop, it is data curation: deciding what the model should see, recognising when the existing corpus is wrong, and recognising when the iteration has hit its ceiling.

Knowing when to stop is itself the call. After the fourth dataset I judged that the agentic loop had run out of useful moves for this architecture. The next step would not be more data, it would be a different model shape. I called time, wrote the README honestly, and shipped.

The repo is up at github.com/danfking/pixel-llm with the sample images and a fuller writeup. The interesting thing in there is not the trained model. It is the trail.

Verification is the expensive thing now

Daniel King — Thu, 23 Apr 2026 09:59:06 +0000

Martin Fowler's latest fragments post collects several ideas about how AI is reshaping software development. The one that stuck with me is Ajey Gore's argument: as coding agents take over execution, verification becomes the premium activity.

Gore puts it bluntly. Instead of ten engineers building, you might have three engineers plus seven people defining acceptance criteria and designing tests. The bottleneck moves from "can we write the code?" to "do we know whether the code is right?"

This matches what I see daily. I run multiple Claude Code sessions in parallel, each producing working code at a pace I couldn't match alone. The hard part is never the generation. The hard part is knowing whether what came out actually does what I intended, handles the edges I care about, and doesn't quietly break something else. And it's not just me who needs to know. My team members need to look at that same output and reach the same confidence, often without the context I had when I prompted it.

The cultural shift Gore describes is the part most teams will struggle with. Your Monday standup changes. Instead of "what did we ship?" the question becomes "what did we validate?" Instead of tracking output, you're tracking whether the output was right. That reframes what it means to be productive. An engineer who catches a subtle misalignment in generated code before it ships has done more valuable work than one who prompted three features into existence without checking them.

This connects to something else Fowler highlights in the same post: Margaret-Anne Storey's concept of "intent debt," where the goals guiding a system are poorly documented or maintained. If you can't clearly articulate what the system should do, you can't verify that it does it. Intent debt was always a problem, but it was partially hidden when the same person writing the code also held the intent in their head. When an agent writes the code, that implicit knowledge gap becomes a concrete failure mode.

I think the teams that figure out verification workflows early will have a real advantage. Not just automated tests (though those matter), but the whole practice of clearly stating intent, reviewing output critically, and building confidence that what shipped is what was meant.

🔔 Small productivity hack that's changed how I work with Claude Code

Daniel King — Wed, 22 Apr 2026 10:05:16 +0000

I typically have half a dozen Claude Code sessions running at once, spread across different terminals and monitors, some hidden behind other windows. The visual "done" indicator is easy to miss when you're not looking at the right terminal.

About a month ago I added a global hook that plays a short chime whenever any session finishes a task. Two minutes to configure. Can't live without it now.

The difference is about flow. Before, I'd either stare at a terminal waiting, or context-switch and then keep interrupting myself to check which session was ready. Now the chime pulls me back at exactly the right moment. I stay in whatever I'm doing until I hear it, then go find the session that needs me. It's kept me in a flow state in a way I genuinely didn't expect from something so simple.

How to set it up

Drop this into your ~/.claude/settings.json:

{
  "hooks": {
    "Notification": [
      {
        "matcher": "",
        "hooks": [
          {
            "type": "command",
            "command": "powershell.exe -NoProfile -Command \"(New-Object Media.SoundPlayer 'C:\\Windows\\Media\\chimes.wav').PlaySync()\""
          }
        ]
      }
    ]
  }
}

That's it. Windows has the sound file built in. For Mac/Linux, swap the command for afplay or paplay with a sound file of your choice.

The empty matcher means it fires on every notification, regardless of which project or session triggered it. Claude Code sends a notification whenever a session finishes and is waiting for input, which is exactly the moment you want to know about.