Forem: Nic Lydon

The Machine Zone: Ignition

Nic Lydon — Sat, 02 May 2026 15:40:40 +0000

I installed Claude Code on March 12. Thirty-five days later I had written 557,000 lines of code across fifteen repositories that had not existed before. None of them had existed in my name before March 17. I had never owned a git repository. I typed git init on something of my own for the first time in my life on a Tuesday morning at 11:46 a.m. Eastern. Four weeks later I had fifteen repositories and half a million lines.

I work in cybersecurity. I am forty-five years old. I have been working in technology for twenty years. I know what sustainable output looks like. This was not it.

The largest single project is ARIA (Adaptive Responsive Intelligent Assistant), my personal assistant and behavioral-DNA tool, at 1,033 commits and 220,536 net lines. Second is Nexus, a centralized data lake that stitches my iMessage, Gmail, health, and calendar history together, at 369 commits. Third is Chancery, an agent-orchestration and observability layer, at 322 commits. Fourth is niclydon.com, a redesign of my personal site, at 164 commits. Forge, my home-lab LLM gateway, runs on a stack I rebuilt inside the window. Broadside, an AI drafting pipeline, reads from CHANGES.md and git history across every project I maintain and writes posts in my voice.

These are not side projects. This is my whole stack, remade. They function. They save me time. Some of them save my family time. I am not going to stand here and say I regret what I built, because I do not regret what I built.

What I regret, insofar as I regret anything, is the rate at which this happened.

The mechanic

Claude Code is an agentic coding assistant that runs in your terminal. You describe what you want to build, and it writes the code, runs tests, fixes bugs, and commits the results. The interaction is conversational: you type a request, it takes actions, you see the results, you respond. Each exchange takes seconds.

The pattern that emerged was simple: I would ask for a feature. It would build it. I would see something adjacent that needed fixing. I would ask for that. It would fix it. I would notice something else. The next request was always one keystroke away. The gap between wanting the next action and getting it rounded to whatever the API latency was — usually under two seconds.

This is what I mean by "the loop." Not a metaphor. The literal interaction pattern: request, response, next request. Variable-ratio reinforcement with near-zero latency.

What the body said

I have the Apple Health export. The shell history. The git metadata. The Claude Code session transcripts. The billing records. All of it lives on disk. I pulled the numbers because adjectives rot and numbers are harder to argue with.

Baseline window is January 16 through March 11. The ignition week is March 12 through March 19.

Steps. Baseline median: 12,250 per day. Ignition week median: 1,636.5 per day. That is an 86.6 percent drop. The single lowest day in my ninety-day record is March 19 at 243 steps.

Sleep. Baseline median nightly sleep: 5.88 hours. Five of the eight nights in the ignition week have no primary sleep detected at all. The Apple Watch could not find a contiguous block long enough to call it a night. My longest bracketed gap without meaningful sleep during the week is approximately forty-eight hours.

Sleep midpoint. Baseline median: 3:39 a.m. Ignition median: 5:43 a.m. A shift of two hours and four minutes. Wake time moved four hours and forty-eight minutes later. Sleep time moved twelve minutes later. I was not going to bed earlier and sleeping in. I was going to bed at roughly my old clock and not waking up until much further into the morning. The body was compensating in the one direction it had left.

Heart rate variability. Baseline median: 74.7 ms. Ignition median: 66.4 ms. An eleven percent drop, and no recovery thirty days later.

Photos taken. Baseline median: 104 per day. Ignition median: three per day. A 97.2 percent drop in life-documentation activity over one week. The away-from-home share collapsed harder: 87 percent of my baseline photos were taken somewhere other than my house. During the ignition week: 25 percent.

The phone stopped going places.

The receipts

March 12, 9:19 p.m. ET. Anthropic welcome email. "Ship your first commit in 5 minutes."

March 13, 8:56 p.m. ET. First API credit cutoff. I had been running the agent for less than twenty-four hours.

March 13, 9:23 p.m. ET. $95.63 API credit top-up. Twenty-seven minutes after the cutoff. I did not reflect. I bought more.

March 17, 11:46 a.m. ET. The first git commit in the history of any repository I have ever owned. I have shipped code at work, on contract, as a hobbyist. I had never run git init on my own machine and lived with the result.

March 17, later that day. Apple receipt for Claude Max 20x at $249.99. I pivot from Pro to Max mid-morning on a Monday. The commitment is made with a tap.

March 19. 243 steps.

March 22, 9:31 p.m. ET. Third API cutoff. March 22, 9:33 p.m. ET. Next top-up. Seventy-eight seconds.

Across the first ten days: $305.52 in API top-ups on top of the $249.99 Max subscription. I spent more on Claude credit that week than I spent on groceries that month.

April 2, 1:50 a.m. ET. My aunt, who was in hospice, passed away.

April 4, two days after her death, is the highest-activity Claude Code day in my entire ninety-day record. 23,476 events. 21.7 active hours. I slept 102 minutes. I took one photo. My longest unbroken session ran from April 3 at 8:34 p.m. to April 4 at 11:16 p.m. The session crossed the day-after-her-death barrier without pausing.

I shipped production code that day that is still running.

April 5, 9:02 p.m. ET. The largest single API credit top-up of the thirty-five days, $106.25, goes through. Three days after my aunt's death.

I am not going to dramatize any of this. I am listing it because when I say "the loop compounds with grief rather than interrupting for it," these are the receipts.

The machine zone

There is a researcher named Natasha Dow Schüll who spent more than a decade inside Las Vegas casinos watching slot machine players. Her book is called Addiction by Design. The thing she names is not addiction in the chemical sense. It is a state players call "the zone." A suspension of self, a narrowing of attention, a sense of the outside world fading. The machines are engineered for it. Variable reward schedules, near-misses that register as almost-wins, sensory feedback tuned to a frequency just below conscious attention.

B.F. Skinner demonstrated the mechanism in the 1950s with pigeons and a lever. Variable-ratio reinforcement — reward coming at unpredictable intervals — produces more persistent behavior than any other schedule. Persistent meaning: the pigeon will keep pressing the lever long after the reward has stopped. Harder to extinguish. More compulsive.

The Civilization loop is a variable-ratio reinforcement schedule with a progress bar attached. Every turn yields something, some turns yield a great deal, and the next turn is always one click away. The human who plays it is not broken. The human is operating correctly inside a system that was engineered to produce exactly this behavior.

Claude Code is a variable-ratio reinforcement schedule with a diff attached. Every tool call yields something, some yield the feature you were trying to build, and the next tool call is always one keystroke away.

I ran the numbers on all 2,009 Claude Code session transcripts on my two machines. The cache-read token share — the fraction of context tokens served from Anthropic's prompt cache instead of a full re-encode — is 98.32 percent across the cohort. During the week my aunt died it was 99.25 percent. The gap between "I want the next action" and "the next action arrives" is whatever the cache-read latency is. My local compute cost per additional turn has rounded to zero.

The same work at sixty percent of the throughput would have ended with my body still calibrated, my inner circle still met on weekends, my camera roll full of my cat instead of empty, and my April 2 free for my aunt. The same work was available at that rate. The loop is not what made the work happen. The loop is what made me do it at a speed that did damage.

Both of those things are true and they sit inside the same person and they are not going to resolve into the simpler version.

What I wrote next

This was the first of three. The other two are on Substack, since they go further into territory that isn't really dev.to-shaped:

If you want to read about what the loop did to my closest relationships, that's Part II: Twenty-eight Times Slower. My median text reply time to the people closest to me went from 1.1 minutes to 31.2 minutes in twelve days. The interesting part wasn't that it dropped. It was the shape — broadcast, not silence.

If you want to read about what the loop kept building even after I thought it was over — agents that talked like me, a system writing biographies of the person building it — that's Part III: The Rate.

I Fixed My LLM OOM Crashes by Shrinking the Draft Model (Speculative Decoding on Real Hardware)

Nic Lydon — Fri, 01 May 2026 23:27:56 +0000

The fix was swapping a 4B draft model for a 0.6B one in my speculative decoding config. That's the whole punchline. But the path there touched every assumption I had about how spec decode interacts with VRAM budgets on consumer hardware, so here's the full story.

TL;DR

Change	Result
4B draft → 0.6B draft	~2 GiB saved, same MoE throughput
Embedding parallelism 16 → 8	~8 GiB freed
Combined	Dropped from ~97 GiB to ~87.7 GiB, no more OOM

Spec decode isn't free. You're paying VRAM for both models simultaneously.

The Setup

I run a local LLM inference gateway on two AMD-based mini PCs — GMKTec EVO-X2 boxes with Strix Halo APUs and 160 GB of unified memory each. The gateway serves around 20 models through llama-swap, a process manager that loads and evicts models on demand behind an OpenAI-compatible API. Think of it as a poor man's model router: one port per logical model, llama-swap starts the right llama.cpp process on request, and idle models get evicted when memory gets tight.

Speculative Decoding (Quick Context)

Speculative decoding pairs a large target model with a smaller draft model. The draft proposes tokens cheaply; the target verifies them in a single forward pass. When the draft is right — and for well-matched model families, it often is — you get roughly 1.5–2× throughput. The important detail that bites people: both models are resident in memory at the same time.

The Bad Assumption

I was running a blanket policy: every Qwen3-family model gets the Qwen3-4B draft. Four billion parameters felt like the safe middle ground — big enough to draft well, small enough to fit. Or so I thought.

The Crash

The problem surfaced when I tried to load qwen3.5-122b-a10b (roughly 71 GiB at Q4_K_M) alongside my always-resident embedding model. On paper, the embedding model was supposed to run around 16 GiB. In practice:

embed:              ~23.8 GiB
122B + 4B draft:   ~73.6 GiB
─────────────────────────────
total:             ~97.6 GiB
available:         ~96.0 GiB

Intermittent OOM crashes followed.

The Diagnosis

Pulling real numbers from rocm-smi told a different story than my estimates. The embedding model was actually consuming 23.8 GiB, not 16. The culprit was KV cache pre-allocation: with parallelism set to 16 and context at 8,192 tokens, the runtime was pre-allocating 16 full-context-length KV cache slots simultaneously, and that adds up fast.

Two Knobs, Both Pulled

At that point I had two levers: reduce embedding parallelism, or shrink the draft model. I did both.

Dropping embedding parallelism from 16 to 8 freed roughly 8 GiB while keeping context length at 8,192 tokens, which still comfortably covers my p99 usage around 2,532 tokens. On the draft side, the key insight was that not every model needs the same draft. A 0.6B draft — about 0.4 GiB — performs nearly as well as the 4B for MoE architectures, where sparse activation already limits how much a larger draft model can contribute. Total consumption dropped from roughly 97 GiB to around 87.7 GiB. Stable, no crashes.

What I Learned

Measure actual VRAM usage, not estimated usage. They are not the same number.
Draft model sizing should follow model architecture, not a one-size-fits-all policy.
KV cache pre-allocation scales with parallelism — and it will surprise you.
Spec decode costs memory. Budget for two models, not one.
Working inside tight constraints forces you to understand your system at a level that comfortable headroom never would.