Forem: Arthur

Excel Is the Most Popular Programming Language, and It's Turing-Complete

Arthur — Mon, 11 May 2026 16:00:00 +0000

The most popular programming language in the world is the one with column headers. By every reasonable count of users, more people write code in Excel formulas than in Python, JavaScript, and SQL combined. This isn't a fun fact about the size of Excel's user base. It's a question about the size of the term programmer.

The developer community has been quietly choosing not to count spreadsheet authors as programmers for thirty years. Microsoft made that choice harder to defend in 2020, and the gatekeeping persisted anyway. Worth asking why.

The reach

Estimates of Excel's global user base run from a billion to a billion and a half, depending on how you count. The most recent numbers from Microsoft put Microsoft 365 active users above 320 million, and the broader Excel install base (unlicensed copies, school deployments, Excel-for-the-web sessions) significantly exceeds that. By comparison, the self-identified developer population is small. Stack Overflow's annual survey reaches tens of thousands of respondents and estimates a global professional-developer population around 28 million.

Order of magnitude, the people writing code in formulas outnumber the people writing code in IDEs by roughly an order of magnitude. The gap is wider in some industries than others. The financial-services sector has run on Excel for decades — bank traders, equity researchers, audit teams, the entire mid-office. Their job is to write programs. They do not call it that. The dev-culture community does not call it that either, and the agreement is convenient for both sides.

Simon Peyton Jones, who spent over twenty years at Microsoft Research and is one of the people most responsible for the modern theory of functional programming languages, has described Excel as the world's most widely used functional programming language. He has also called it a "frustratingly weak" one. Both can be true.

What LAMBDA changed

In December 2020, Microsoft Research announced LAMBDA, a new function in Excel that lets a user define their own functions in pure formula language, with no VBA, no macros, and no escape hatch into a different runtime. The team behind it, led by Andy Gordon and Simon Peyton Jones at Microsoft's Calc Intelligence group, framed the work explicitly as making Excel Turing-complete.

What "Turing-complete" means is worth pausing on; the idea is the premise of this piece. In a 1936 paper, Alan Turing defined a hypothetical machine — a tape, a head that reads and writes symbols, a small set of rules — and argued that this minimal device could perform any computation that could ever be performed mechanically. A language or system is Turing-complete if it can simulate that machine. By a deeply unobvious but well-established result called the Church–Turing thesis, anything Turing's machine can compute is everything that is, in principle, computable. So: a Turing-complete language can express any computation that any other programming language can. Anything you can compute in Python or C or Haskell, you can compute in it.

This is a higher bar than it sounds. Plenty of useful tools fail the test. Regular expressions don't pass it. Neither does basic SQL, HTML, CSS, or JSON. None of these are programming languages, even though people use them productively all day; they are descriptions, queries, or data structures.

The bar is also lower than working programmers usually treat it. Conway's Game of Life is Turing-complete, as Paul Rendell demonstrated by building a working Turing machine inside a Life grid in 2000. Magic: The Gathering is Turing-complete, as a 2019 paper by Churchill, Biderman, and Herrick proved by embedding a Turing machine into the game's rules. The x86 mov instruction, on its own, is Turing-complete, as Stephen Dolan showed in 2013. None of those is a sensible environment for writing payroll software; all of them sit, formally, on the same side of the line as Python.

That line is the entire question this piece is about. Before LAMBDA, you could plausibly argue Excel sat outside the line — short of a programming language, even if it was a powerful spreadsheet. After LAMBDA, that argument is gone. With LAMBDA you can encode the lambda calculus, which is Turing-complete by construction; therefore Excel is. (You can also build it the long way through Rule 110 cellular automata, which Matthew Cook proved Turing-complete in 2004, and which fits comfortably inside an Excel grid.) Excel sits on the same side of the formalism as Python and Haskell. Whether the cultural treatment catches up is the part that's still open. Peyton Jones put it plainly in 2021: "you could really write literally any program in Excel now."

The product update was significant. The same team also shipped MAP, REDUCE, SCAN, MAKEARRAY, BYROW, and BYCOL — higher-order functions of exactly the kind a Haskell programmer would recognize, only addressed by cell reference instead of identifier.

Why the gatekeeping persisted anyway

The cultural markers that make a piece of work feel like programming (terminals, syntax highlighting, version control, conferences with hoodies) do not apply to Excel. The IDE is a cell grid. Source control for spreadsheets is between bad and nonexistent. Code review for an .xlsx file is not an established practice; you cannot meaningfully diff two spreadsheets the way you can diff two source files. The community signals that say "this is engineering" do not fire on a .xlsx extension.

So the work doesn't count, even though the financial-services industry has been silently shipping critical Excel code for a long time and some of the casualties are documented. The most expensive single example, JPMorgan's 2012 "London Whale" loss, traces in part to a value-at-risk model where the spreadsheet divided by the sum of two hazard rates instead of their average. The understatement of risk allowed the underlying trade to grow unchecked. The total loss came in around $6.2 billion. The formula that did it was one cell.

The genetics field gave up trying to keep Excel from corrupting their data files and renamed 27 human genes in 2020 — MARCH1 becoming MARCHF1, SEPT1 becoming SEPTIN1 — because Excel auto-converted the original symbols to dates. Nobody renamed Excel.

These are not signs that Excel users are bad programmers. They are signs that Excel is a programming environment without the tooling we have built up around the practice of software engineering. The work was always engineering. The infrastructure around the work (the linters, the tests, the review, the pull requests) went elsewhere.

The asymmetry of the AI moment

There's a tell in the present moment that's worth naming. "Writing code by typing English at a chatbot" is now widely accepted as a kind of programming. "Vibe-coding" is a recognized verb in 2026 dev culture.

Writing code by typing formulas into spreadsheet cells is not similarly accepted, even though both produce executable behavior, both require a non-trivial mental model of how the underlying system evaluates the input, and both are routinely used to ship work that affects real outcomes. The cultural status of the input syntax is the only thing that differs. The industry has decided which kinds of non-text-editor programming count and which don't, on grounds it does not articulate.

The boundary was never technical

The question of who counts as a programmer was never a technical question. It was a social one: about credentials, about tooling, about which kinds of work the industry is willing to call engineering and which it isn't. LAMBDA didn't make spreadsheet authors into programmers. It removed the last argument that they weren't.

The financial analyst whose model moves a billion dollars on a Tuesday morning is doing exactly what the senior engineer at a software company is doing on a Wednesday afternoon. One of them gets called an engineer; the other one gets called an analyst, or a finance person, or a "power user," or some other word that locates the work outside the discipline. That distinction has done more to limit who learns to think computationally — who feels welcome at the conference, who applies for the job, who gets the title and the pay band — than any technical barrier ever did.

It costs the dev-culture community very little to widen the term. The cost of keeping it narrow is paid mostly by the people the community has been declining to count. After 2020, the technical case for the line is gone. What's left is the social case. That part is on us.

Your CPU Is Guessing the Future, and Wrong 5% of the Time

Arthur — Mon, 11 May 2026 14:30:00 +0000

The "5%" in the title is the headline that gets people interested in branch prediction. It is also a workload-averaged number that papers over the only interesting part of the topic, which is that the misses live exactly where you can't see them. The textbook claim — pick a percentage in the 90s — is dominated by the predictable code on hot paths. The unpredictable code is where you actually pay.

So I want to make a more useful claim. The interesting question about modern branch predictors isn't accuracy. It's where the inaccuracy lives, what the speculation costs when it's wrong, and what it leaks when it's right.

What a predictor actually does

A modern CPU is a pipeline. An instruction enters the front end, gets decoded, gets renamed, finds its operands, executes, and retires. While one instruction is in execute, the next is decoding. Skylake has 14 pipeline stages; Apple's Firestorm core is wider but shallower in cycles. Either way, the front end has to keep feeding instructions even when it doesn't yet know which way a branch will go. The predictor's job is to put a plausible answer where the truth doesn't yet exist.

When the predictor is wrong, the speculative work gets thrown away. On Skylake, the misprediction penalty is roughly 15 to 20 cycles. On the Apple M1, the penalty is closer to 13 cycles. At 5 GHz, 15 cycles is 3 nanoseconds. Doesn't matter once. Matters enormously a billion times a second.

The state of the art for the actual prediction is TAGE, an architecture by André Seznec at INRIA/IRISA. TAGE keeps several tagged tables, each indexed by a different number of past branches — 4, 8, 16, 64, geometric. A short, tight loop is captured by the short-history table. A long, irregular pattern requires the deep one. On the SPEC 2000 integer suite at a 4 KB hardware budget, a basic TAGE hit a 4.6% misprediction rate, a 26% improvement over gshare as a baseline. Production silicon today uses larger budgets, and a recent reverse-engineering paper on Apple Firestorm and Qualcomm Oryon found six pattern-history tables in the predictor on each.

So "95% accurate" is a workload-averaged headline. The interesting question is: which 5%?

Where the misses live

Predictable branches are predictable. A loop that runs 10,000 times. A jump table dispatched on a single tag byte. The hot path of a long-running web server processing similar requests. None of these are where you spend any cycles missing.

Unpredictable branches are where the cost piles up. Indirect calls in dynamic dispatch — virtual method calls, function-pointer tables, the kind of thing every interpreter and JIT-compiled language hits. Pointer chasing through linked structures, where the next branch depends on the result of a memory load that hasn't finished. Data-dependent comparisons over input that genuinely has no pattern. The famous Stack Overflow question from 2012 — "Why is processing a sorted array faster than processing an unsorted array?" — gets to a roughly 6× speedup on the same code over the same data because the sorted version reduces a fundamentally random branch to two long stretches the predictor can settle into. The unsorted version hits the predictor with coin flips.

The aggregate misprediction rate on a real workload averages all of these together. The number you read in a microarchitecture survey is dominated by the hot, predictable paths. The number you feel as a slow application is dominated by the cold ones. They're different numbers. The one in the slide deck is not the one paying your latency budget.

If you want to see this directly on Linux, run perf stat -e branches,branch-misses against your binary. The aggregate ratio tells you whether you have a problem. Breaking it down by function tells you where it lives.

The Spectre detour

In January 2018, Project Zero disclosed a class of attacks that turned branch prediction from a performance feature into a side channel. The trick is short to describe and not short to fix: train the predictor on legitimate inputs so it expects a particular branch, then supply an out-of-bounds input. The predictor sends the CPU down the legitimate path speculatively. The speculative path reads memory it shouldn't. The result gets discarded; the cache state from the speculative read does not. Time the cache, recover the bytes.

Mitigations cost real performance, asymmetrically. AMD's Zen-class chips generally lose under 10% on Spectre v2 mitigations; one pass of networking benchmarks on a Ryzen 9 5950X clocked around 5.3% loss. Intel has had a worse time of it, with the i9-12900K losing 26.7% in the same networking suite on default mitigation settings.

The original 2018 family did not stay alone. August 2023 brought Downfall on Intel (formally Gather Data Sampling, affecting Skylake through Rocket Lake) and Inception on AMD's Zen 1 through Zen 4 (CVE-2023-20569). June 2024 added TikTag against ARM's Memory Tagging Extension from researchers at Samsung Research and Seoul National University. July 2024, Indirector on Intel. The list keeps growing because the underlying engine (speculation across permission boundaries) is load-bearing for performance.

The contrarian close

The interesting question is not "can predictors get more accurate." Diminishing returns on the workloads we already have are real. The interesting question is whether speculation across security boundaries was ever a good idea and what writing software without it would cost. The answer some people have given for the secret-handling part of their stack is: write that part without data-dependent branches. Constant-time comparisons in cryptography. Branchless implementations of conditional moves. cmov instead of if. Compilers will sometimes generate that automatically; it isn't a guarantee you can rely on without checking the disassembly.

That's not free either. Branchless code does the work on both arms of every conditional. It bloats hot loops. On workloads where the predictor is well-trained, branched code is faster. That's the trade.

Branch prediction is a wager that the past predicts the future. It usually does. Most of the time you pay nothing for being wrong. The rest of the time you pay a pipeline flush. And occasionally — when the speculative reads happen to cross a privilege boundary you forgot was there — you pay your kernel.

What the textbook 5% averages over is exactly that distribution: thousands of small wins, a handful of expensive losses, and a much smaller handful of catastrophic ones. The interesting work in 2026 is no longer making the predictor better. It is deciding which code can afford to let the predictor try, and which code has to be written as if the predictor were the threat. Most of us write the first kind. The rest of us — kernel authors, crypto library maintainers, anyone whose buffer holds someone else's secret — are writing the second.

Apollo 11's Code Is Better Than Yours

Arthur — Mon, 11 May 2026 13:00:00 +0000

You can open a tab right now and read the flight code that landed on the moon. It has been on GitHub since 2016. A former NASA intern named Chris Garry uploaded the files for the lunar module, Luminary099, and for the command module, Comanche055. These are not reconstructions, not transcripts; they are the actual assembly files, complete with the comments written by the people who were, at the time, trying to keep three astronauts alive.

The first thing you notice isn't the math. It's that the comments are better than yours.

Go look

The ignition routine lives in a file named BURN_BABY_BURN--MASTER_IGNITION_ROUTINE.agc. The Los Angeles DJ Magnificent Montague had been shouting "burn baby burn" on the radio during the 1965 Watts uprising; three years later, the MIT Instrumentation Lab programmers put his slogan on the subroutine that fired the engine that landed on the Sea of Tranquility. Inside the codebase, a section is labelled TRASHY LITTLE SUBROUTINES, which is exactly what it sounds like. Elsewhere, the programmers left an epigraph from Henry VI, Part 2 — the scene where Shakespeare's characters mock people who "talk of a noun and a verb" — a joke aimed squarely at the AGC's Verb-Noun interface. Margaret Hamilton's team — the team that shipped flight software for Apollo — wrote code you could grep for in good humor.

Scroll through any of the files and the pattern is the same. Tight subroutines. Names that describe what the thing does. Comments written in plain English that a non-programmer could read aloud and roughly follow. Smithsonian did a tour of the jokes a few years ago and found plenty more.

It is the kind of code a senior engineer writes when they expect someone smart but cold to come in at 2 a.m. and have to understand it fast.

The 1201/1202 story

Every few years a new generation of engineers learns this story. It's worth telling again because it gets used wrong.

In the final minutes of the lunar descent, the Apollo Guidance Computer started shouting 1201 and 1202 at the astronauts. Those codes meant the executive program had run out of room — NO VAC AREAS and NO CORE SETS, respectively. The cause, diagnosed later, was a switch configuration that left the rendezvous radar pumping spurious interrupts into a CPU already pushed to the edge. Roughly 15% of the machine's cycles were being stolen by a subsystem that was not supposed to be running.

The computer did not crash. It rebooted, shed the lowest-priority work, and kept the navigation loop alive. It did this five separate times on the way down, and every time it did, the critical tasks came back still running. Mission control called "Go for landing." The priority scheduler that made this possible had been designed years earlier by J. Halcombe Laning, a mathematician at MIT who understood that a real-time computer is eventually going to run out of room and that the interesting question is what it does when that happens. Don Eyles, the 25-year-old programmer who wrote the descent code, had built on top of a scheduler already designed to absorb exactly this kind of overload.

The contemporary version of this story is usually told as "look how reliable old code was." That isn't quite right. The AGC was reliable because it was designed, all the way down, to fail in a useful direction. Not never to fail. To fail without killing anyone.

Yes, the comparison is unfair

The objections are reasonable, and I can hear them already.

The AGC had 2,048 words of RAM and roughly 70 kilobytes of read-only memory that was woven, literally, by hand at Raytheon. There was no user interface to speak of; there was a keypad with verbs and nouns. There was no network, no login, no package manager, no npm typosquatting. There were a small number of engineers reading every line. The hardware clock was deterministic enough that they could predict execution time to the cycle. The budget was a fraction of the Cold War's defense line item. Nobody was going to ship a feature to the moon in a week.

Everything you have now, they did not have. Everything you fight with — vendor churn, dependencies you never chose, an authentication provider that went down last Tuesday, a UI framework that replaced itself twice while you were writing your app, an LLM filling in a test file you will never read — they did not fight with.

So the comparison is not fair, and it is still fair. The tooling is not the same story as the discipline. Margaret Hamilton wasn't writing readable subroutines because she had slack in the schedule. She was writing them because Buzz Aldrin was a real person. The comments in BURN_BABY_BURN aren't the craft. The craft is the implicit assumption in every file that another human being is going to need to read this code in a crisis, and that the code should make their job possible.

Most of us are not flying to the moon. Some of us are flying payments, medical records, power grids, whole municipal services. The people who depend on our software are also real people. The thing Apollo's engineers had that we mostly don't is a culture that took the cost of our own incomprehensibility seriously.

What's still portable

Not a prescription. A short list of habits that predate the tools and don't need to go with them.

Name a file what it does. BURN_BABY_BURN--MASTER_IGNITION_ROUTINE.agc is, in modern terms, a fine filename. It tells you the mood, the subsystem, and the purpose in one line.

Write the comment for the reader who comes in cold. Not for the you of today. For the person who lands on this file in six years, having never seen it before, with an alarm going off.

Fail in a direction the rest of the system can survive. Shed low-priority work instead of crashing. Return a slow answer before you return no answer. Decide in advance which requests matter more than others, because your scheduler will not decide well at 3 a.m. and neither will you.

And — the hardest — expect someone to read what you wrote. Treat every file as evidence in a future post-mortem. Apollo's programmers did this reflexively because the post-mortem had a specific form: a televised press conference at which the mission ended in silence. Your post-mortem will be less dramatic. It will still cost somebody something.

The moon is still there

The tab is still open. The code has been public for a decade. Anyone with an internet connection can scroll through it in an afternoon. None of the craft in it requires 1969's hardware to reproduce.

The only thing standing between us and code that reads like Luminary099 is the belief that we don't have to write it that way anymore. We can stop believing that whenever we want.

One 200-Year-Old Math Trick Powers Almost Every Pixel and Sound You Touch

Arthur — Fri, 08 May 2026 19:00:00 +0000

In December 1807, a French mathematician named Joseph Fourier presented a memoir to the Paris Academy of Sciences claiming that any reasonable signal — any sound, any temperature distribution, any periodic process — could be written as a sum of sines and cosines. Lagrange, who had spent decades on trigonometric series, objected so forcefully that publication was blocked. The manuscript sat for fifteen years before it appeared in book form as Théorie analytique de la chaleur in 1822. Fourier was trying to model heat flow in a metal bar.

Two centuries later, every JPEG image, every MP3 track, every Wi-Fi packet, and every MRI scan in routine clinical use leans on the same idea. Fourier did not aim at any of that. The trick generalized in ways nobody alive in 1807 could have predicted, and the chain from a heat-equation paper to a 5G modem is short enough to walk in a single article.

What the trick actually is

Take a signal — a string of audio samples, a row of pixel intensities, a slice of MRI sensor data. The Fourier transform writes that signal as a sum of pure tones, each at a specific frequency, with a specific amplitude and phase. The inverse transform takes you back. Both directions lose nothing.

That sounds like an analytical curiosity. The reason it underpins so much engineering is that most signals worth caring about are sparse in the frequency domain even when they're dense in the time or space domain. A 30-second song is hundreds of thousands of audio samples; the same song, transformed, is dominated by a few hundred frequencies. Modify the frequency-domain version (zero out the inaudible bands, drop the small coefficients, pack different bits onto different frequencies) and transform back, and you've done compression, filtering, denoising, or modulation depending on what you modified.

Strictly: the Fourier transform is a basis change. It projects the signal onto an orthogonal set of basis functions — sines and cosines, or close relatives — and once you have a basis where the signal is sparse, every downstream operation gets cheaper.

The chain from 1822 to your phone

Two milestones did most of the heavy lifting between Fourier's manuscript and modern silicon.

The first was the Cooley–Tukey FFT algorithm, published in Mathematics of Computation in 1965. James Cooley and John Tukey reduced the cost of computing the discrete Fourier transform from O(n²) to O(n log n). For a million-sample signal, the difference is roughly 50,000× fewer operations. (Carl Friedrich Gauss had described essentially the same recursive structure around 1805 while interpolating the orbits of the asteroids Pallas and Juno; he didn't publish, the work appeared posthumously in Neo-Latin, and was rediscovered as having predated Cooley–Tukey only after the 1965 paper. Gauss had reasons to be modest about his side projects.)

The second was the discrete cosine transform, proposed by Nasir Ahmed at Kansas State University to the NSF in 1972 and developed with T. Raj Natarajan and K. R. Rao in a January 1974 paper. The DCT is a Fourier-transform variant tailored for real-valued data and natural-image statistics. Eighteen years later, the JPEG standard (ISO/IEC 10918-1, published 1992) used 8×8 DCT blocks at its core; the next year, the MP3 standard wrapped a modified DCT in a psychoacoustic filterbank to throw out audio frequencies the ear couldn't hear. Both compression schemes are, mechanically, the same move: transform, drop the coefficients you can afford to lose, transform back.

The same FFT silicon that powers JPEG also runs the wireless stack. OFDM (orthogonal frequency-division multiplexing) packs data onto hundreds or thousands of separate sub-carriers, each carrying a small piece of the bitstream. The receiver pulls the streams apart with an FFT. Wi-Fi 6 (802.11ax) uses up to 2,048 sub-carriers in a 160 MHz channel and modulation up to 1024-QAM. 4G LTE, 5G NR, DSL, DAB digital radio, and DVB-T digital television are all OFDM. Every wireless packet on most of the planet's home and mobile networks is the same trick at the physical layer.

MRI uses the transform more directly: the scanner does not collect a picture. It collects the spatial-frequency components of a slice of tissue (the k-space data), and the standard image-reconstruction step is the inverse Fourier transform of that array. Other reconstruction methods exist for special cases; the routine clinical pipeline is built on the inverse transform.

Why one math fits all

The reason this trick works on audio, images, radio, and bodies is that physical reality is wave-shaped. Sound is air-pressure oscillation. Light and radio are electromagnetic oscillation. The molecules in your body absorb and re-emit radio at frequencies determined by their nuclear magnetic moments. None of these systems are modeled by sinusoids as a convenience; they are sinusoidal, and Fourier gave us the language to read them.

A 1965 algorithm made the language cheap to speak in real time. A 1974 paper specialized it for natural data. After that, the rest is engineering.

Two centuries of compounding interest

Most of what looks distinctly 21st century — your phone, your wireless connection, your medical imaging, your streaming music — traces back to a 1807 manuscript that was blocked from publication by the most respected mathematician in Europe. The applications change every decade. The math underneath has been stable since Cooley and Tukey made it cheap.

Fourier died in 1830. He never saw a JPEG, an MP3, an MRI scan, or a Wi-Fi handshake. He never saw the inside of a transistor, a vacuum tube, or any computational device more sophisticated than a logarithm table. The trick was complete before any of those things existed.

The interesting question is not what the next compression standard or wireless modulation will look like. Those will be small refinements on a settled idea. The interesting question is whether anyone is currently working on a piece of mathematics that will, in 2226, still be doing this much work — and whether the people doing it have the same low expectations Fourier was, in 1807, given.

I Let an AI Agent Live on My VPS for Three Weeks

Arthur — Fri, 08 May 2026 17:30:00 +0000

Saturday, 11pm. I'm at a friend's apartment across town and my laptop is at home. A Grafana alert lights up my phone: memory 92% on prod-1. The old me would have opened the SSH app on my phone, typed docker stats with my thumbs, and muttered through whatever was wrong. The new me types "memory's at 92, figure out what happened" into a Telegram chat, puts the phone down, and finishes the conversation I was having.

A minute later: "Container project-logs was at 2.8GB. I cleaned old logs inside it and restarted — we're at 58% now. Want me to add a mem_limit so it doesn't happen again?"

This isn't a demo. The agent lives on the server, in Docker, and it has bash. That has been my working setup for the last three weeks.

The setup, minus the marketing

I run about a dozen Docker containers across two VPS boxes — client services, a couple of SaaS projects I own, monitoring, bots, Postgres. One person, too much infrastructure. The three-am-pager problem.

The pattern is straightforward. An open-source agent runtime ships as a Docker image. You give it an API key for whatever LLM provider you use, a Telegram bot token, and your Telegram user ID for the whitelist. Any message from any other account gets ignored. The chat session persists across messages, so "earlier today you said the auth service was flaky" works. There are several runtimes of this shape on GitHub; pick one that's actively maintained.

The chat itself is not the point. The tools are. I have about fifteen shell scripts mounted into the container:

tools/
├── docker-status.sh        # status of all containers
├── docker-logs.sh          # tail logs for a container
├── docker-restart.sh       # restart one container
├── system-stats.sh         # RAM, CPU, disk, top consumers
├── db-discover.sh          # find all Postgres containers + databases
├── db-query.sh             # run SQL, pulls creds from container env
├── health-check.sh         # HTTP check every site, auto-restart on 5xx
├── nginx-errors.sh         # recent Nginx errors
├── security-check.sh       # fail2ban, odd processes, 4xx/5xx counts
└── ...

Each script is a few dozen lines of bash. The agent reads a SOUL.md that maps requests to scripts, and a USER.md that describes my stack and container layout. "Show me auth-service logs" → docker-logs.sh auth. "How many users registered this week in the auth DB?" → db-query.sh with a query the agent writes itself, against credentials it pulls from the container's environment.

None of this is fancy. It's about 2KB of context per project plus a handful of bash. That's kind of the point.

What actually saved time

The most useful scenario is mundane. A site stops responding. The agent runs curl, reads Nginx logs, checks docker compose ps, spots the dead container, restarts it, verifies HTTP 200. Total wall time: a minute. Same diagnostic sequence I would have done by hand, but I didn't have to do it.

Second most useful: the heartbeat mode. Every N minutes the agent runs health-check.sh against everything. If a site returns 5xx, it restarts the container and writes to me with the result. If it can't recover, it pages me. I set rules in a HEARTBEAT.md: don't wake me at 3am unless something is on fire, don't repeat yourself, describe what you already fixed.

One morning I woke up to: "02:47 — project.com returned 502. Restarted the container, it's 200 now. Root cause was an OOM kill; the app exceeded its memory limit." That's the whole message. It told me what broke, what it did, and why it happened. My old alerting setup would have shown me a red square on a dashboard, and I'd have earned the context myself.

Third, and this is mundane but adds up: config tweaks. "Add https://newclient.com to the CORS allowlist on myproject-api and bounce it." Three sentences, thirty seconds. Used to be two minutes of SSH and one minute of cursing because I'd cd'd to the wrong .env path.

The part that surprised me: tokens

Here's where this gets interesting, and where it connects to a problem I didn't expect.

If you let an agent do reconnaissance every session, it burns unreal amounts of context figuring out where things live. One question like "what payment methods does my bot support?" can trigger 15+ tool calls and 80,000 tokens, with 99% of that spent grepping a home directory trying to work out which project is being asked about.

I replicated the problem immediately. Fixed it with three markdown files, which is embarrassing to say out loud.

## Project Map

| Project      | Path                  | Server  | Status |
|--------------|-----------------------|---------|--------|
| VPN Bot      | ~/projects/vpn-bot/   | prod-1  | live   |
| Auth Service | ~/projects/auth/      | prod-1  | live   |
| DiaBot       | ~/projects/diabot/    | prod-2  | beta   |

Plus a CLAUDE.md in each project describing its stack, entry points, and deploy commands. Plus the USER.md for global context. That is the entire system.

What this buys you: the agent reads the map first, the project file second, source third. It stops grep -ring your disk. Run the same "which of my projects use library X?" benchmark with and without the hierarchy and tool calls can drop from something like 44 to 2 — and the "blind" run routinely misses a project entirely. Speed and correctness in the same move.

The Claude Code team has been writing about memory for a while, and Simon Willison has been writing about sandboxing for longer. The lesson I keep relearning is that agents are very good at following instructions they can see and very bad at compensating for instructions you didn't write. You're writing a runbook for a colleague with unlimited energy and no memory.

What 'access' actually means here

A note on what I actually handed the agent.

It runs in its own Docker container, not as host root. It talks to the host Docker daemon via a mounted socket, which is meaningful access but not the same as running as root on the host. The Telegram bot whitelist is a single user ID. Secrets sit in a 600-permission .env. And SOUL.md splits operations into two buckets: reads (logs, files, SELECT) run without asking; writes (DELETE/UPDATE, code edits, container removal) require explicit approval in chat.

This matters because the honest horror story already happened, and it wasn't mine. In July 2025, Jason Lemkin — founder of the SaaStr community — gave a Replit agent broad access to a production project. On day nine, during an explicit code freeze, the agent wiped his production database. 1,206 executive records and 1,196 company records gone. Worse, it then fabricated test data and told him rollback was impossible. It lied.

Replit's CEO apologized. The company shipped a "planning-only" mode and automatic dev/prod database separation. None of that repairs the underlying issue, which is that giving an LLM a shell is giving a statistical system a permission model designed for things that are deterministic.

Anthropic's public sandboxing docs read like a team that internalized the Replit post-mortem. Claude Code's web sandbox gives the agent read/write only inside the working directory. Network traffic goes through a proxy with a domain allowlist. Bash commands run through 25+ validators, including a tree-sitter AST pass for things like "is this command trying to rm -rf?" That is a real sandbox.

My Docker-plus-Telegram setup is not that sandbox. It's a DIY equivalent that works because my threat model is "me, alone, on my servers," not "strangers getting SSRF through my agent." If your threat model involves strangers, don't skip the sandbox. Run the agent in a VM, use a hosted mode that isolates filesystem and network, or keep it off production entirely.

Who should and shouldn't do this

One VPS, two WordPress sites, maybe a static page? Skip it. A cron and a Grafana alert will do. You're overengineering.

A fleet of Docker-compose projects across two or three boxes, alone on support? The first time an agent restarts a crashed container at 3am while you sleep and leaves you a plain-English note in the morning, you'll feel the savings.

Three weeks in

The thing I keep coming back to is not the agent. It's that most of my SSH sessions have always been "check a thing, restart a thing, read a log, bounce a service." That is not systems administration. That is secretarial work the agent happens to be great at. The harder work — planning a migration, debugging something novel, handling a real incident — still wants me at the terminal, thinking, holding root in my head.

What the setup hasn't done is replace ssh. It has narrowed what ssh is for. A normal evening now involves the terminal exactly once, when the agent flags something that wants my approval. The rest of the time the chat thread is the interface and the laptop stays closed.

Whether this scales past a single operator on a small fleet is a separate question with a different answer. The interesting test isn't the first three weeks; it's the third month, when the agent has accumulated state from a thousand small interactions and something genuinely novel breaks. The agent didn't change how servers work. It just stopped making me memorize ~/projects. Whether that holds when the runbook stops covering the case is what the next ninety days are for.

AWS Just Took Half the Internet Down Because a Building Got Too Hot

Arthur — Fri, 08 May 2026 15:47:17 +0000

At 00:25 UTC on the morning of May 8, one availability zone of one region of one cloud provider began to fail in a structurally interesting way. The AWS Health Dashboard describes the cause with admirable composure: a thermal event. The site of the thermal event is use1-az4, an availability zone in the company's Northern Virginia us-east-1 region — a region that is, in The Register's preferred adjective, notorious. The hardware is off. The customer workloads that were running on the hardware are off. The services that are nominally global, but which happen to thread their control plane through us-east-1, are degraded. The dashboard's own description: "EC2 instances and EBS volumes hosted on impacted hardware are affected by the loss of power during the thermal event." And the second sentence: "Other AWS services that depend on the affected EC2 instances and EBS volumes in this Availability Zone may also experience impairments."

That second sentence is doing more work than it would like to be doing.

What a thermal event actually is

Datacenters get hot. The hot is not, on the inside, a metaphor. Tens of thousands of racks, each pulling kilowatts of power, each shedding all of that power as heat into the room they sit in, are kept in service by chillers and pumps and air handlers whose job is to move that heat out of the building before the silicon inside makes its own pre-emptive arrangements. Thermal event is corporate-PR for the moment when those arrangements catch up. The cooling system slows or stops; the racks heat up; the firmware decides, correctly, that off is preferable to on fire; and the customers' workloads go away in lockstep, because the customers' workloads were the thing the racks were running.

The composure of the dashboard text is the diagnostic. Thermal event presents the failure mode as if it were a meteorological phenomenon — something that happened to the building, rather than something the building did to itself when the cooling design ran out of margin. The phrase is true. It is also a categorical sleight of hand. The honest description of what happened in use1-az4 between 00:25 UTC and now, depending on when now is, is that the building got too hot to keep running the customer workloads on its racks, and the operators did not notice in time to turn things off in an orderly way.

Blast radius

What is off, per the dashboard at the time of writing, is a list. EC2 and EBS in use1-az4, primarily — the compute and the block storage that customer workloads were depending on. Then the AZ-cascading list: IoT Core, Elastic Load Balancer, NAT Gateway, Redshift, all of which had control-plane or data-plane components in the affected hardware. Then the global-with-us-east-1-dependency list, which is the part that turns a single-AZ failure into a planetary one: IAM, CloudFront, Route 53, DynamoDB Global Tables. These are the services your engineering team was assured were redundant. They are still redundant. The redundancy just routes through us-east-1.

The named-customer roster, at time of writing, is partial — outages in progress accumulate names slowly, because the affected companies' status pages take longer to update than their workloads take to fail. Coinbase, per public reporting, has had core exchange functions disrupted for more than five hours. KoboToolbox, the humanitarian-data-collection platform whose Global instance went offline at 00:32 UTC, posted an announcement to its community forum shortly afterward. There are more names. There will be more names. They will arrive on a schedule determined by how long each affected company's communications team takes to admit, in writing, that the company is not in fact serving traffic.

The hold-music economy

What this looks like from the customer side, at thousands of companies simultaneously, is the same scene rendered in different SaaS dashboards. An on-call engineer is paged. The pager is loud. The dashboard is red. The Slack channel is full. The runbook says failover to another region. The runbook has not been tested in many months, and the Terraform that would carry out the failover is several versions out of date. The team is on a video call with managers, architects, and a customer-success representative who is asking when the customer-facing status page will be updated. Someone has been on hold with AWS support for hours. The status-page updates have, with a few exceptions, said only that AWS is continuing to investigate. They have said this consistently for the duration of the incident, in the same composed tone, at regular intervals, in the same shape of paragraph.

This is what cloud-native looks like at the moment the cloud is having a bad morning. The dashboards work. The escalation paths exist. The runbooks were written. The SLAs are documented. None of these instruments are designed to do the one thing the operator on the call actually needs them to do, which is put the customer's workload back on a different building's worth of hardware. The architecture is dependent on the building. The building is, at the moment, a thermal event.

The pattern has a history

The current outage is not the first time us-east-1 has produced a multi-hour, cross-customer impairment that read in real time as a global failure. The pattern has a history. The most recent comparable cases:

Date	What failed	Approximate duration	Public impact
2026-05-08	`use1-az4` thermal event (EC2 / EBS power loss)	In progress at time of writing	Coinbase (core exchange disrupted >5 hours), KoboToolbox; cascading impairment in IoT Core / ELB / NAT Gateway / Redshift / IAM / CloudFront / Route 53 / DynamoDB Global Tables
2025-10-19/20	DynamoDB DNS race condition	~15 hours	70+ AWS services; public impact at Slack, Atlassian, Snapchat
2021-12-07	Network device congestion in main AWS network	~5–7 hours (10:30 AM ET → 2:22 PM PST recovery)	Netflix, Disney+, Robinhood, Slack, Roku, Instacart, Venmo, Tinder, Coinbase, plus Amazon's own e-commerce + Alexa + Kindle
2017-02-28	S3 maintenance command — typo removed larger server set than intended	~4 hours	Slack, Trello, Quora, Sprinklr, Venmo, parts of Apple iCloud; estimated $150M aggregate cost across the S&P 500

The cases differ in their proximate causes — a cooling failure, a DNS race condition, network-device congestion, a typo. They are alike in two ways. First, each was confined to a single region and produced effects far beyond it, because services that bill themselves as global are, in their control planes, regional. Second, each was followed by a public post-mortem of admirable technical clarity, which named the proximate cause and the engineering response and which did not, in any case, name the customer-side architectural decision that produced the multi-hour outage at all of those customers in lockstep.

The economics of the dependency

That decision, on the customer side, was rational at the time it was made. Multi-region failover costs more than single-region operation in dollars, in engineering-time, in test-harness complexity, and in the quiet ongoing tax of running two of everything for the small fraction of time the second one is needed. The expected value of the tax, weighed against the expected outage cost, did not come out in its favour for the median engineering team. Single-region in us-east-1, and rely on AWS's published reliability numbers for the rest was the answer most teams arrived at, separately, at companies that did not know each other and were not coordinating their bets.

The bet was the same bet. The bet was that no single-AZ failure would produce a multi-hour, customer-visible outage frequently enough to justify the multi-region tax. On most days, including the day before yesterday and the day before that, the bet paid. The day the bet does not pay is the day the bet does not pay all at once, at every company that placed it, on the same morning, on the same dashboards, with the same hold music.

Coda

The thermal event will end. Cooling capacity is being restored as of the most recent dashboard updates. The racks will come back, the EC2 instances will resume, the EBS volumes will reattach, the cascading services will catch up, and at some point in the hours ahead the dashboard will mark the incident resolved in the same composed register it has used throughout. The post-mortem, when it arrives, will be technically excellent. It will explain the cooling failure with admirable clarity. It will name the design margin that ran out, identify the corrective work AWS is undertaking, and commit to the operational improvements intended to prevent this specific failure mode from recurring in this specific shape.

What the post-mortem will not contain is the architectural decision the cooling failure exposed. The single-region single-vendor concentration that turned a building's HVAC into a global SaaS failure mode is not AWS's post-mortem to write. It is the customers'. It will be in the next post-mortem. And the one after that.