Forem: Naz Quadri

The Invisible Negotiation Between Your Laptop and the Air

Naz Quadri — Tue, 31 Mar 2026 22:05:01 +0000

The Invisible Negotiation Between Your Laptop and the Air

WiFi: Radio Physics, Collision Avoidance, and the Name That Means Nothing

Reading time: ~15 minutes

You typed a URL into your browser and hit Enter.

Before a single byte of your request left your laptop, your WiFi card performed a dance of radio physics, collision avoidance, and cryptographic negotiation that would make a diplomat proud. It listened to the air to make sure nobody else was talking. It picked a random backoff timer in case someone else had the same idea. It encrypted your data with a key derived from a four-way handshake that happened when you first connected. Then it modulated your bits across 52 subcarrier frequencies simultaneously, transmitted them as radio waves, and waited for an acknowledgement -- all in under a millisecond.

You saw a webpage load. Here's what actually happened.

The Name Means Nothing

Let's get this out of the way: WiFi doesn't stand for anything. If I'm found to be wrong the internet is welcome to string me up.

It's not "Wireless Fidelity." That's a backronym -- a meaning retrofitted onto a word that was chosen for entirely different reasons. In 1999, the WiFi Alliance (then called WECA -- the Wireless Ethernet Compatibility Alliance) hired Interbrand, the same branding firm that named Prozac, to come up with a consumer-friendly name for the IEEE 802.11 standard. The engineers had been calling it "IEEE 802.11b Direct Sequence" in marketing materials 🤦🤦‍♂️🤦‍♀️. Nobody was buying it. Literally -- consumers didn't understand what it was.

Interbrand proposed ten names. The consortium picked "WiFi" because it was short, memorable, and rhymed with "hi-fi" -- a term people already associated with quality audio equipment. The parallel was intentional: hi-fi meant high-quality sound; WiFi would mean high-quality wireless.

Then someone on the Alliance's board insisted on adding a tagline: "The Standard for Wireless Fidelity." Phil Belanger, a founding member of the WiFi Alliance, has publicly called this a mistake. The tagline made people assume WiFi stood for Wireless Fidelity, the way hi-fi stands for high fidelity. It doesn't. The WiFi Alliance eventually dropped the tagline, but the damage was done. Twenty-five years later, people still think it's an acronym.

The 802.11 committee itself started in 1990. The first standard was ratified in 1997, offering a blistering 2 Mbit/s. Your morning coffee took longer to brew than a file transfer took to time out.

The Standards Naming Disaster

If you thought USB naming was bad -- and it is -- WiFi standards naming is worse. For two decades, the WiFi Alliance used the IEEE committee letter suffixes as consumer-facing product names. 802.11a. 802.11b. 802.11g. 802.11n. 802.11ac. 802.11ax. These aren't even alphabetical by release date. 802.11a and 802.11b were ratified the same year, but 802.11b shipped to consumers first. 802.11n came after 802.11g. The letters tell you nothing about which is newer, faster, or better.

In 2018, the WiFi Alliance finally admitted nobody could remember the letter soup and introduced generational numbering. Retroactively.

Generation	IEEE Standard	Year	Frequency	Max Speed	Channel Width
WiFi 1	802.11a	1999	5 GHz	54 Mbit/s	20 MHz
WiFi 2	802.11b	1999	2.4 GHz	11 Mbit/s	20 MHz
WiFi 3	802.11g	2003	2.4 GHz	54 Mbit/s	20 MHz
WiFi 4	802.11n	2009	2.4 / 5 GHz	600 Mbit/s	20/40 MHz
WiFi 5	802.11ac	2014	5 GHz	6.9 Gbit/s	20/40/80/160 MHz
WiFi 6	802.11ax	2021	2.4 / 5 GHz	9.6 Gbit/s	20/40/80/160 MHz
WiFi 6E	802.11ax	2021	6 GHz	9.6 Gbit/s	20/40/80/160 MHz
WiFi 7	802.11be	2024	2.4 / 5 / 6 GHz	46 Gbit/s	20/40/80/160/320 MHz

The retroactive rename was a good idea executed a decade too late. Your router's box now says "WiFi 6" instead of "802.11ax," which is progress. But the industry spent 20 years training people on the letter system, so every spec sheet still lists both. We're stuck in a bilingual world.

How WiFi Actually Works at the Radio Level

Bluetooth, as I covered in the last post, hops across 79 narrow channels 1,600 times per second. WiFi does the opposite. It parks on a wide channel and splits its data across dozens of subcarrier frequencies simultaneously.

The technique is called OFDM -- Orthogonal Frequency Division Multiplexing. The "orthogonal" part is the key: the subcarriers are spaced so that the peak of each one lines up with the nulls of its neighbors. They overlap in frequency but don't interfere with each other. It's like a choir where everyone sings at a slightly different pitch -- the voices overlap, but you can still pick out each one.

A standard 20 MHz WiFi channel contains 52 data subcarriers (plus 12 pilot and null subcarriers for synchronization and guard bands). Each subcarrier carries a portion of your data independently. If one subcarrier hits interference, the others keep going. The receiver reassembles the pieces.

Channel Width: The Speed-vs-Coverage Tradeoff

Wider channels mean more subcarriers mean more data. A 40 MHz channel has roughly twice the capacity of a 20 MHz channel. 80 MHz roughly doubles again. WiFi 7 introduces 320 MHz channels -- 16 times the width of the original 20 MHz channels.

The tradeoff is brutal. In the 2.4 GHz band, there's only 70 MHz of usable spectrum (channels 1 through 13, though in the US only 1 through 11). A 20 MHz channel fits, but a 40 MHz channel consumes more than half the available space. An 80 MHz channel doesn't fit at all. Wider channels only work in 5 GHz and 6 GHz, where there's more room.

2.4 GHz vs 5 GHz vs 6 GHz

Three frequency bands, three different physics.

2.4 GHz reaches everywhere and is useless in an apartment building. The physics is on your side — 12.5 cm wavelength diffracts around obstacles and punches through walls — but you share 70 MHz of spectrum with every WiFi router, Bluetooth device, baby monitor, and microwave in the building. Three non-overlapping channels. Thirty competing networks. Good luck.

5 GHz is where the actual work happens. Twenty-five non-overlapping channels, far less congestion, but the shorter wavelength (6 cm) means each wall costs you 3-6 dB — roughly halving usable range per wall.

6 GHz is the land grab. WiFi 6E and 7 opened 1,200 MHz of pristine spectrum — more than 2.4 and 5 combined. Seven non-overlapping 160 MHz channels. Three 320 MHz channels. No legacy devices, no microwaves, no Bluetooth. The range is even shorter, but if your router is in the same room, who cares.

That's why your 2.4 GHz network reaches the garden but your 5 GHz network dies at the bedroom wall. It's not a defect. It's physics. Longer wavelengths bend around and penetrate obstacles better. The 2.4 GHz signal at 12.5 cm wavelength diffracts around a doorframe. The 5 GHz signal at 6 cm wavelength gets absorbed by it.

The 2.4 GHz Channel Overlap Problem

The 2.4 GHz band has 11 channels in the US (13 in most other countries), each 20 MHz wide, spaced 5 MHz apart. They overlap. Channel 1 bleeds into channel 2, 3, 4, and 5. Only channels 1, 6, and 11 are far enough apart to not interfere with each other.

If your neighbor sets their router to channel 3 and yours is on channel 1, their signal bleeds directly into yours. You'd both be better off on channel 1 -- at least then the routers can hear each other and take turns. Partial overlap is worse than complete overlap, because the radios can't decode the interfering signal well enough to defer to it, but it's strong enough to corrupt their own transmissions.

That's why "auto channel selection" on your router exists -- and why it almost always picks 1, 6, or 11.

CSMA/CA: How WiFi Avoids Collisions

WiFi is half-duplex. Only one device can transmit on a channel at a time. Your router and your laptop take turns. Every phone, tablet, and smart thermostat connected to your network takes turns on the same channel. This is fundamentally different from Ethernet, which is full-duplex on modern switches.

The mechanism that enforces turn-taking is CSMA/CA -- Carrier Sense Multiple Access with Collision Avoidance.

The algorithm is polite to a fault:

Listen. Before transmitting, the device listens to the channel. If someone else is talking, wait.
Wait for silence. Once the channel has been quiet for a specific interval (called DIFS -- Distributed Inter-Frame Spacing, 34 microseconds on 802.11a/g), the device still doesn't transmit immediately.
Random backoff. It picks a random number of time slots to wait (the "contention window"). Only after this random timer expires -- and the channel is still quiet -- does it transmit.
Transmit and wait for ACK. The access point must acknowledge every frame. No ACK means the frame was lost -- retry with a larger contention window.

The random backoff is the collision avoidance part. If two devices both hear the channel go quiet at the same time, the random timer makes it unlikely they'll transmit simultaneously. Not impossible -- but unlikely. On each collision, the contention window doubles -- exponential backoff.

If that pattern sounds familiar, it should. When your HTTP client retries a failed request with exponential backoff and random jitter, it's solving the same problem at a different layer. WiFi does it in microseconds at the radio level to prevent two stations from colliding again. Your web client does it in seconds at the application level to prevent a thousand clients from hammering a recovering server simultaneously. Different timescale, different medium, same insight: randomise the retry, and double the window on failure. The WiFi version came first -- inherited from Ethernet's CSMA/CD in the 1980s. Your retry_with_jitter() function reinvented a 40-year-old radio technique.

Compare this with Ethernet's CSMA/CD (Collision Detection). Ethernet devices transmit and listen simultaneously. If they detect a collision mid-transmission, both stop, send a jam signal, and retry. WiFi can't do this because a radio can't transmit and receive on the same frequency at the same time -- the transmitted signal is a billion times stronger than the received signal and would drown it out. WiFi avoids collisions because it can't detect them.

The Hidden Node Problem

There's a scenario CSMA/CA can't handle on its own. Imagine two laptops connected to the same access point, but on opposite sides of the building. Laptop A can hear the AP but not Laptop B. Laptop B can hear the AP but not Laptop A. Both sense an empty channel and transmit simultaneously. Their signals collide at the AP, and neither knows why their frames aren't getting through.

The solution is RTS/CTS -- Request to Send / Clear to Send. Before transmitting a large frame, a device sends a short RTS frame to the AP. The AP responds with a CTS frame that's heard by everyone in range. The CTS includes a duration field that tells all other devices to shut up for that long. This costs overhead -- two extra frames per transmission -- so it's typically only enabled for frames above a size threshold.

Why WiFi Gets Slower With More Devices

Twenty devices on one access point don't each get 1/20th of the bandwidth. They get less. Much less.

Each device must wait its turn. More devices mean more contention, longer backoff windows, more collisions, more retries. The overhead is proportional to the number of devices contending, not the amount of data they're sending. A room full of idle phones still degrades your WiFi because they're all periodically transmitting management frames, probe requests, and keepalives -- each one claiming the channel for a moment.

That's why your home WiFi feels slow during a family gathering. It's not that your guests are using all your bandwidth. It's that 15 phones are all contending for the channel, and the contention overhead is eating your throughput alive. The bandwidth is there. The airtime isn't.

Association and Authentication

When you first connect to a WiFi network, a multi-step negotiation happens before you can send a single data frame.

Finding the Network

Your device finds networks in two ways. Passive scanning: it listens on each channel for beacon frames that access points broadcast every ~100 milliseconds. Each beacon contains the network name (SSID), supported rates, security type, and channel information. Active scanning: your device sends probe requests on each channel, asking "is anyone here?" Access points respond with probe responses containing the same information.

That's why your phone's WiFi list populates -- it's cycling through channels, listening for beacons and sending probes. It's also why your phone is trackable in stores: those probe requests contain your device's MAC address. Modern phones randomize the MAC in probe requests to mitigate this, but the effectiveness varies.

The Four-Way Handshake (WPA2/WPA3)

Once you select a network and provide the password, the real cryptography begins. Your password never goes over the air. Instead, both sides derive the same Pairwise Master Key (PMK) from the password and the network name using PBKDF2 (a key derivation function that runs SHA-1 4,096 times to make brute-force attacks expensive).

Then comes the four-way handshake (sorry no Bob and Alice):

AP → Client: here's a random number (ANonce)
Client → AP: here's my random number (SNonce), plus a MIC (Message Integrity Code) proving I know the PMK
AP → Client: here's the group key (for broadcast traffic), encrypted with the derived key, plus a MIC
Client → AP: acknowledgement

After these four frames, both sides have a Pairwise Transient Key (PTK) derived from the PMK plus both random numbers. Every data frame is encrypted with this key using AES-CCMP (in WPA2) or AES-GCMP (in WPA3). The key is unique to this session, this client, this connection.

Why Open WiFi Is Dangerous

An open network (no password) has no four-way handshake. No PMK. No per-session keys. Every frame goes over the air in plaintext. Anyone with a WiFi adapter in monitor mode (also called promiscuous mode) can read every byte. This includes the coffee shop WiFi, the airport WiFi, and the hotel WiFi.

Monitor mode isn't exotic. A USB adapter like the Alfa AWUS036ACHM — about $40, fits in your pocket — puts your card into passive listening mode where it captures every frame on the channel, not just frames addressed to you. Combine it with Wireshark and you're reading plaintext traffic in real time. This is a standard tool for network engineers and penetration testers. It's also why wardriving was a thing — driving around with a laptop and a directional antenna, mapping open networks and logging traffic. Tools like Kismet automated the process. Wardriving peaked in the mid-2000s when most networks were either open or using WEP (which could be cracked in minutes). WPA2 made passive sniffing useless against encrypted networks, but open networks remain fully transparent.

HTTPS protects the content of your web requests, but DNS queries (unless you're using DoH or DoT), HTTP sites, and unencrypted app traffic are fully visible. The metadata alone — which domains you're visiting, when, how often — is valuable to an attacker. On an open network, that metadata is free for the taking.

WPA3-SAE: The Fix

WPA3 replaced the PSK (Pre-Shared Key) authentication of WPA2 with SAE -- Simultaneous Authentication of Equals. SAE uses a cryptographic protocol called Dragonfly that provides two critical properties WPA2 lacked:

Forward secrecy: if someone captures your encrypted traffic today and later learns the WiFi password, they still can't decrypt the old traffic. Each session's keys are ephemeral and not derivable from the password alone.
Resistance to offline dictionary attacks: in WPA2, an attacker who captures the four-way handshake can take it home and brute-force the password offline. SAE makes each guess require an interactive exchange with the AP, making offline attacks impossible.

Beamforming and MIMO

Early WiFi was omnidirectional -- the access point blasted signal in all directions equally, like a bare light bulb. Modern WiFi is more like a spotlight.

Beamforming uses multiple antennas to shape the transmitted signal so it's stronger in the direction of the receiving device. The AP sends the same data from each antenna with carefully calculated phase shifts. The signals constructively interfere in the target direction and destructively interfere elsewhere. The result: stronger signal where you need it, less wasted energy where you don't.

MIMO -- Multiple-Input, Multiple-Output -- uses multiple antennas on both ends to send independent data streams simultaneously over the same channel. A 4x4 MIMO system (four antennas on each end) can theoretically quadruple throughput compared to a single antenna. Your laptop's WiFi card likely has 2 antennas. Your router might have 4 or 8.

MU-MIMO (Multi-User MIMO) extends this to serve multiple devices simultaneously. Instead of taking turns, the AP beamforms separate streams to different devices at the same time, each on different spatial paths. WiFi 5 introduced downlink MU-MIMO (AP to devices). WiFi 6 added uplink MU-MIMO (devices to AP).

That's why your router has weird antennas. Those protruding rods aren't decorative. Each one is a separate antenna element, and the router needs spatial diversity -- antennas positioned at different angles and locations -- to create distinct spatial streams. Internal antennas (like in mesh systems) do the same thing; they're hidden inside the enclosure, but the physics is identical.

The Reason Your Video Call Drops

When your Zoom call freezes and your colleague's face turns into a Cubist painting, four things are probably going wrong at once.

Distance and obstacles. Every wall between you and your router costs roughly 3-6 dB of signal strength. A concrete wall can cost 10-15 dB. At -70 dBm (a typical "two bars" signal), your connection is viable but fragile. At -80 dBm, you're in trouble. At -85 dBm, you're disconnecting.

Channel congestion. In an apartment building, your AP might share a channel with a dozen neighbors. CSMA/CA forces everyone to take turns. Your video frame waits in queue while your neighbor's smart TV downloads a firmware update.

Interference. Microwave ovens operate at 2.45 GHz -- dead center of the 2.4 GHz WiFi band. A running microwave can obliterate WiFi channels 6 through 11. Bluetooth devices, cordless phones, baby monitors, and even poorly shielded USB 3.0 cables can add noise.

Bufferbloat. Your router buffers outgoing packets when the WiFi link is congested. Large buffers add latency without improving throughput. A 500ms buffer on a video call means your words arrive half a second late. The connection is technically up. The conversation is technically ruined. If your router supports SQM (Smart Queue Management) or FQ_CoDel (Fair Queuing Controlled Delay), enable it.

Signal strength is not throughput. RSSI (Received Signal Strength Indicator) tells you how loud the signal is, not how fast data moves. You can have strong RSSI and terrible throughput if the channel is congested. You can have weak RSSI and decent throughput if you're the only device on the channel. The WiFi icon on your phone shows signal strength. It tells you nothing about airtime contention, noise floor, or actual data rate.

WiFi 6E and WiFi 7: Finally, Room to Breathe

The most significant WiFi advancement in the last decade isn't speed. It's spectrum.

WiFi 6E opened the 6 GHz band -- 1,200 MHz of pristine, uncongested spectrum. No legacy devices. No microwaves. No Bluetooth. Every device on 6 GHz supports WiFi 6 or later, which means modern features like OFDMA (multi-user channel access) and BSS Coloring (interference mitigation) are universal. The 6 GHz band alone has more usable spectrum than 2.4 GHz and 5 GHz combined.

WiFi 7 (802.11be) builds on this with three headline features:

320 MHz channels: available in the 6 GHz band, offering absurd peak bandwidth at the cost of range. A 320 MHz channel in 6 GHz can theoretically push 46 Gbit/s with 16 spatial streams. In practice, nobody has 16 spatial streams. But even with two streams, you're looking at multi-gigabit wireless speeds.
4K-QAM: QAM (Quadrature Amplitude Modulation) encodes multiple bits into each radio symbol. The number tells you how many distinct signal states it uses — and since each state is a bit pattern, the number is always a power of 2. WiFi 6 uses 1024-QAM: 2¹⁰ = 1024 states = 10 bits per symbol. WiFi 7 quadruples the states to 4096-QAM: 2¹² = 4096 states = 12 bits per symbol. Two extra bits per symbol is a 20% throughput increase for free — but distinguishing 4096 signal levels requires a much cleaner signal, so it only works at close range with minimal interference.
MLO -- Multi-Link Operation: the feature I'm most excited about. A WiFi 7 device can connect to the AP on multiple bands simultaneously -- 2.4 GHz and 5 GHz and 6 GHz -- and aggregate the bandwidth or use the best link for each packet. Low-latency traffic goes to the least-congested band. Bulk downloads use all bands at once. If one band hits interference, traffic seamlessly shifts to another. This is the first time WiFi has been able to use multiple bands concurrently, and it fundamentally changes the reliability story.

The Part That Gets Me

WiFi is the most successful wireless technology in human history. Over 18 billion devices use it. It works in your home, your office, on airplanes, in coffee shops, in hospitals, in warehouses. It operates in unlicensed spectrum that anyone can use, governed by standards written by committee, using a name that was chosen by a branding firm and means absolutely nothing.

The engineers who designed 802.11 in 1997 couldn't have imagined 46 Gbit/s wireless links or video calls with 20 participants. But the core ideas they established -- OFDM modulation, CSMA/CA channel access, management frames for discovery and association -- still form the foundation of WiFi 7, nearly three decades later.

Every time I open my laptop and start working without plugging in a cable, I'm relying on a system where my device listened to the air for 34 microseconds, picked a random number, waited that many slot times, modulated my data across 52 frequencies, encrypted it with a session key derived from a four-way handshake, and blasted it as radio waves toward a box on my shelf. That box did the reverse, routed my packets to the internet, and got a response back -- all before I noticed the page was loading.

The best infrastructure is the kind you forget exists.

What Is RAM, Actually?

Naz Quadri — Tue, 31 Mar 2026 22:04:45 +0000

What Is RAM, Actually?

From Leaking Capacitors to Cache Lines

Reading time: ~17 minutes

You called malloc(). The kernel gave you an address. You stored a number there, read it back, and moved on with your life.

But what is that address? It's not a location on a chip — not in any way you'd recognize. It's an index into a grid of billions of capacitors, each one holding a single bit as an electrical charge that is, right now, draining away. The data you stored is evaporating. A circuit you've never thought about is racing to put it back before it disappears. And it does this for every single bit, millions of times per second, whether you're using the memory or not.

This is what "random access memory" actually is. Let me show you the machinery.

The Smallest Unit of Memory You Own

Every bit in your RAM stick lives in a 1T1C cell: one transistor, one capacitor. That's it. Two components per bit. Your 16GB stick has roughly 137 billion of these cells.

The capacitor stores the bit. A charged capacitor is a 1. A discharged capacitor is a 0. The transistor is a gate — it connects the capacitor to the outside world when activated, and isolates it when not.

These cells are arranged in a two-dimensional grid. Rows and columns, like a spreadsheet. Every cell in a row shares a word line — the wire that activates all the transistors in that row simultaneously. Every cell in a column shares a bit line — the wire that carries the data out.

How a Read Works

Reading a bit from DRAM is one of those processes that sounds simple and absolutely isn't.

The memory controller activates the word line for the target row. Every transistor in that row turns on. Every capacitor in that row connects to its bit line. The charge stored in each capacitor is almost nothing — about 30 femtofarads of capacitance holding less than a volt. To put "femto" in perspective: a femtofarad is 10⁻¹⁵ farads. A typical AA battery stores roughly 5,000 coulombs of charge. A DRAM cell stores about 0.000000000000030 coulombs. The entire contents of your 32GB RAM, every single bit, could be powered by the static charge you build up shuffling across a carpet. That's how small this is. And yet the sense amplifier has to reliably distinguish it from zero.

A sense amplifier at the end of each bit line detects this voltage difference. It's comparing the bit line voltage to a reference voltage, and the difference might be as small as 50 millivolts. The sense amp swings this to a full logic level — rail-high for a 1, rail-low for a 0. That's your bit.

Here's the part that should bother you: reading is destructive. Schrödinger would have loved DRAM 😹 — you can't observe the bit without destroying it.

When the capacitor shares its charge with the bit line, the capacitor drains. The bit you just read is gone. The sense amplifier detected it, but the original charge is dissipated. Every single DRAM read erases the data it reads.

That's why every read is followed by a write-back. The sense amplifier, having determined the value, drives the bit line back to full voltage and recharges the capacitor. Every read is secretly a read-then-write. Your innocent int x = array[0] triggers a destructive read and a restoration write at the hardware level.

The Refresh Treadmill

Even when nobody is reading your data, it's disappearing.

Capacitors leak. The charge drains through the transistor's junction, through the dielectric, through physics being physics. A DRAM cell loses its charge in milliseconds. Left alone, every bit in your system would decay to garbage.

The memory controller handles this with refresh cycles. It systematically walks through every row and reads it — which, as you now know, drains the capacitors and writes the values back. The JEDEC spec for DDR4 requires every row to be refreshed within 64 milliseconds. For a chip with 65,536 rows, that's one row refresh every 976 nanoseconds.

Your program has no idea this is happening. You never asked for it. You never see it. But roughly 5-10% of your memory bandwidth is consumed by refresh operations that exist solely because the hardware is fighting thermodynamics to keep your data alive.

That's why it's called dynamic RAM. (Penny drop ... )

SRAM: The Expensive Alternative

If DRAM is a leaky bucket that needs constant refilling, SRAM (Static RAM) is a proper container.

An SRAM cell uses six transistors per bit instead of one transistor and one capacitor. Four transistors form a cross-coupled pair of inverters — a latch — that holds a stable 0 or 1 as long as power is applied. Two more transistors act as access gates.

No capacitor. No leaking. No refresh cycles. The data stays put.

The tradeoff is size. Six transistors per bit vs. two components per bit. An SRAM cell is roughly six times the area of a DRAM cell on the same process node. It's also more expensive per bit and draws more standby power.

SRAM is what your CPU caches are made of. DRAM is what your sticks are made of. The entire cache hierarchy exists because we can afford small amounts of fast SRAM but need large amounts of cheap DRAM. That tension between speed and density shapes every modern computer.

The DDR Lineage

The DRAM in your machine is DDR SDRAM — Double Data Rate Synchronous Dynamic RAM. "Double data rate" means it transfers data on both the rising and falling edges of the clock signal, effectively doubling throughput without doubling the clock speed.

Each generation doubles the prefetch width and drops the voltage. DDR1 (2000, 2.5V, 2n prefetch) and DDR2 (2004, 1.8V, 4n prefetch) are ancient history — if either of those is still in production somewhere, someone should send flowers. DDR3 (2007, 1.5V, 8n prefetch) lasted nearly a decade and powered everything from the Mac Pro trash can to the PlayStation 4. If you built a computer between 2010 and 2016, this is what you used, and honestly it was fine.

DDR4 (2014, 1.2V, 8n prefetch with bank groups) is what most desktops and servers are still running in 2026. It added bank groups to reduce conflict penalties but kept the same prefetch width as DDR3 — the improvement was structural, not brute-force.

DDR5 (2021, 1.1V, 3200-6400+ MT/s, 16n prefetch) is the first generation that changed something fundamental.

First, on-die ECC. Every DDR5 chip has error correction built into the die itself. It can detect and correct single-bit errors within each internal transfer — not to be confused with system-level ECC (more on that shortly). You get error correction whether your motherboard supports ECC or not.

Second, dual channels per DIMM. A DDR4 DIMM has one 64-bit channel. A DDR5 DIMM has two independent 32-bit channels. Same total width, but two independent channels mean two independent transactions can be in flight simultaneously. This cuts bank conflict stalls and improves utilization, especially for multi-threaded workloads.

ECC: When Bit Flips Matter

A single bit flip in the wrong place can ruin your day.

In 2003, Belgium's electronic voting system recorded an extra 4,096 votes for a candidate — later attributed to a single bit flip in RAM (bit 13 flipping from 0 to 1 adds exactly 4,096). Cosmic rays — high-energy particles from space — hit silicon and generate enough charge to flip a stored bit. It's rare per cell, but when you have 137 billion cells, the math gets uncomfortable. Google published a study in 2009 showing roughly one correctable error per gigabyte of RAM per year. Every first Tuesday in November, I quietly hope the cosmic rays take a day off from US voting machines, we struggle with societal ECC.

ECC RAM adds an extra chip per channel that stores parity/syndrome bits. The most common scheme uses SECDED — Single Error Correction, Double Error Detection. It can correct any single-bit error and detect (but not correct) any two-bit error. This is why every server you've ever touched runs ECC memory. The cost premium is around 10-20%, and nobody running production workloads considers it optional.

Rowhammer makes this worse. Discovered in 2014, rowhammer exploits the physical proximity of DRAM rows. Rapidly activating the same row — "hammering" it — causes electrical interference that flips bits in adjacent rows. It's a physical attack that crosses process boundaries. Attackers have used it to escalate privileges, escape VMs, and compromise entire systems through nothing but carefully timed memory access patterns. DDR5's on-die ECC mitigates single-bit rowhammer flips, but multi-bit attacks remain an active area of research.

VRAM: Memory for a Different Kind of Processor

The RAM inside your graphics card is a different beast. It's optimized for a fundamentally different access pattern.

Your CPU wants low latency — it needs one specific value now. A GPU wants high throughput — it needs a million values soon. This difference drives the entire GDDR and HBM design.

GDDR (Graphics DDR) shares DNA with regular DDR but makes different tradeoffs. GDDR6X, used in NVIDIA's RTX 4000-series, runs at higher data rates (up to 21 Gbps per pin) and uses a wider bus (256 or 384 bits vs. DDR's 64 bits). The per-pin latency is worse than DDR, but the raw bandwidth is staggering — over 1 TB/s on a high-end GPU.

HBM (High Bandwidth Memory) takes a radically different approach. Instead of discrete chips on a PCB, HBM stacks multiple DRAM dies vertically — 8 or 12 layers tall — and connects them to the processor via an interposer, a silicon bridge that sits beneath both the memory stacks and the processor die. Each HBM stack has a 1024-bit bus. An HBM3-equipped GPU might have six stacks, delivering over 3 TB/s of aggregate bandwidth.

This is why AI accelerators use HBM — training and inference on large language models are memory-bandwidth-bound, not compute-bound. The math units can process data faster than conventional memory can feed them. HBM exists to close that gap.

The NVIDIA H100 makes the cost of this choice concrete. It comes in two variants:

	H100 SXM	H100 PCIe
Memory type	HBM3	HBM2e
Capacity	80 GB	80 GB
Bandwidth	3.35 TB/s	~2 TB/s
Price (2025)	~$30,000-40,000	~$25,000-30,000

Same GPU die. Same 80GB. The SXM version costs $5,000-10,000 more, and a huge chunk of that premium is the newer HBM3 and the SXM board design that feeds it. The H200 pushed further — HBM3e, 141GB, 4.8 TB/s — and costs even more. When people ask why AI infrastructure is so expensive, a significant part of the answer is: stacking DRAM dies twelve layers tall on a silicon interposer and wiring each stack with a 1024-bit bus is not cheap.

When people say a GPU has "80GB of HBM3," they're describing five stacks of vertically-interconnected DRAM dies, each with a bus wider than any DDR channel, all sitting millimeters from the compute die on a shared silicon interposer. That's not a memory stick you slot in. It's a piece of semiconductor engineering that was fabricated as part of the GPU package.

How the CPU Talks to RAM

In the old days — before 2003 — the memory controller lived on the motherboard's northbridge chip. Every memory access had to travel from the CPU, across the front-side bus, through the northbridge, and out to the DIMMs. It was slow and the bus was a bottleneck.

AMD moved the memory controller onto the CPU die with the Athlon 64 in 2003. Intel followed with Nehalem in 2008. Today, every modern CPU has an integrated memory controller (IMC). The CPU talks directly to your RAM sticks with no intermediary chip.

The Addressing Hierarchy

A memory address isn't a flat index. It gets decomposed into a hierarchy of physical selectors.

Channels: Most desktop CPUs have 2 memory channels (DDR4/DDR5). Servers have 4, 6, or 8. Each channel is an independent path to a set of DIMMs.

Ranks: Each DIMM can have 1, 2, or 4 ranks. A rank is a set of chips that respond together to fill the full data width of the channel (64 bits).

Banks: Each rank is divided into banks — typically 16 bank groups of 4 banks each in DDR5. Banks can be accessed independently, allowing parallel operations.

Rows and columns: Within a bank, data is stored in a 2D array. The controller first opens a row (loading it into the bank's row buffer), then reads specific columns from that row.

What Those Timing Numbers Mean

Your RAM sticks have numbers printed on them like "CL16-18-18-36". These are latency timings, measured in clock cycles.

CAS Latency (CL): The number of clock cycles between the column address command and data appearing on the bus. CL16 at 3200 MT/s means 10 nanoseconds (16 cycles / 1600 MHz actual clock). This is the most-quoted number but only tells part of the story.

tRCD (RAS to CAS Delay): How long after opening a row you can issue a column read. If the row you need isn't already open, you pay this penalty first.

tRP (Row Precharge): How long it takes to close the current row before opening a new one.

tRAS (Row Active Time): Minimum time a row must stay open before it can be precharged.

The worst case — you need data from a row that isn't open, and a different row is — costs you tRP + tRCD + CL. That's why the "random" in Random Access Memory is misleading. A row hit (data is in the already-open row) takes CL cycles. A row miss (need to precharge, open a new row, then read) takes tRP + tRCD + CL. In that scenario, "random" access is 3x slower than sequential access to the same row.

This is why memory access patterns matter. This is why the memory controller reorders requests to maximize row hits. This is why prefetching exists.

The Cache Hierarchy

You now have the full picture of DRAM: slow (50-80ns), cheap, dense, and leaky. The CPU core runs at ~4-5 GHz — one clock cycle takes roughly 0.2 nanoseconds. An L1 cache hit takes about 1 nanosecond. A DRAM access takes 50-80 nanoseconds, roughly 200-400 CPU clock cycles of thumb-twiddling.

This gap — the memory wall — has been growing since the 1980s. CPU clock speeds improved roughly 1000x between 1985 and 2005. DRAM latency improved about 10x in the same period. The solution is a hierarchy of progressively larger, slower, cheaper memories that hide the latency of the level below.

All caches are built from SRAM. The six-transistor cells that don't leak, don't need refresh, and switch in under a nanosecond. The price you pay is density — which is why caches are measured in kilobytes and megabytes while main memory is measured in gigabytes.

L1: The Core's Private Scratchpad

Each CPU core has its own L1 cache, split into two halves: L1i (instructions) and L1d (data). Typical size: 32-64 KB each. Access time: roughly 1 nanosecond, or about 4-5 clock cycles on a modern core.

32 KB sounds absurd for a working set. It is. But the L1 isn't meant to hold your working set — it's meant to hold the data and instructions the core needs right now. Its hit rate on typical code is 95%+ because programs exhibit temporal locality (you use something, you'll use it again soon) and spatial locality (you use something, you'll use its neighbor soon).

L2: The Per-Core Buffer

Each core also has a private L2 cache. Typical size: 256 KB to 1 MB. Access time: roughly 3-4 nanoseconds. It's 4-8x slower than L1 but 4-16x larger.

L2 catches the misses from L1. When L1 doesn't have what the core needs, L2 usually does. Combined L1+L2 hit rates typically exceed 97% for well-behaved code.

L3: The Shared Pool

The L3 cache is shared across all cores. On a modern desktop CPU it ranges from 8 MB (low-end) to 64 MB (AMD's X3D chips with stacked cache). On server processors, L3 can be 128 MB or more. Access time: roughly 10-12 nanoseconds.

L3's job is to catch misses from L2 and, critically, to be the rendezvous point for data shared between cores. If core 0 writes a value that core 4 needs, L3 is where they coordinate.

Cache Lines: The 64-Byte Atom

The CPU never fetches a single byte from cache or memory. The smallest unit of transfer is a cache line — 64 bytes on every modern x86 and ARM processor.

When you read array[0], the CPU fetches the entire 64-byte block containing that address. If array holds 4-byte integers, you just got array[0] through array[15] for free. This is spatial locality in action — and it's why iterating an array sequentially is fast. Each cache line fill pays for the next 15 accesses.

It's also why your struct layout matters. If your struct is 72 bytes, every access touches two cache lines. If it's 64 bytes, it fits perfectly in one. If you have an array of structs where you only ever read one field, all the other fields are wasting cache space. That's the argument for struct-of-arrays vs. array-of-structs in performance-critical code.

Write-Back vs. Write-Through

When the CPU writes to a cached location, it has two strategies.

Write-through: Write to the cache and to the next level simultaneously. Simple, always consistent, but slow — every write incurs the latency of the slower level.

Write-back: Write only to the cache. Mark the line as "dirty." Write it to the next level later, when the line is evicted. Faster for the common case (multiple writes to the same line before eviction), but more complex — you need to track which lines are dirty.

Modern CPUs use write-back at every level. The performance difference is enormous. A write-through L1 would bottleneck on L2 latency for every store instruction. Write-back means the core can fire off dozens of writes to L1 at full speed, and the dirty lines trickle down to L2 and L3 in the background.

Cache Coherency: The MESI Protocol

The moment you have multiple cores with private caches, you have a consistency problem. If core 0 and core 4 both cache the same memory line, and core 0 writes to it, core 4's copy is stale.

The MESI protocol (and its variant MOESI, used by AMD) solves this by assigning each cache line one of four states:

Modified: This cache has the only valid copy, and it's been written to. Main memory is stale.
Exclusive: This cache has the only copy, and it matches main memory. Can be written without notifying anyone.
Shared: Multiple caches hold this line. All copies match main memory. Must notify others before writing.
Invalid: This line is not valid. Must be fetched from elsewhere.

When core 0 writes to a Shared line, it broadcasts an invalidation to all other cores. They mark their copies Invalid. Core 0's copy becomes Modified. This takes ~40-100 nanoseconds depending on the topology — it has to cross the interconnect, hit the other core's cache controller, and wait for acknowledgment.

This is the mechanism behind false sharing, one of the most insidious performance bugs in concurrent programming. Two threads write to different variables, but those variables happen to sit in the same 64-byte cache line. The hardware sees writes to the same line from different cores and starts the invalidation ping-pong. Neither thread is doing anything wrong logically, but physically they're fighting over the same cache line. I've seen false sharing cause a 10x slowdown on workloads that looked perfectly parallel.

The fix is usually alignment padding — force the two variables onto different cache lines. Most languages have annotations for this (alignas(64) in C++, #[repr(align(64))] in Rust, CacheLinePad patterns in Go).

Prefetching: The CPU Guesses Your Future

Modern CPUs don't wait for cache misses. They try to predict what you'll access next and fetch it before you ask.

The hardware prefetcher monitors your access patterns. Sequential access is the easiest case — if you read cache lines N, N+1, N+2, the prefetcher starts loading N+3, N+4 before you get there. Stride patterns (every 4th element, every 8th) are also detected. Random access patterns defeat the prefetcher entirely.

This is another reason sequential memory access is fast. You're not only getting 16 array elements per cache line — the prefetcher is loading the next cache line while you're still processing the current one. The combination means a sequential array scan can approach the theoretical bandwidth of the memory subsystem, while random access pays the full latency penalty on every access.

The Numbers That Matter

Here's the latency hierarchy that shapes every performance decision:

Level	Typical Size	Latency	CPU Cycles (~4 GHz)
L1 cache	32-64 KB	~1 ns	~4
L2 cache	256 KB - 1 MB	~4 ns	~16
L3 cache	8-64 MB	~12 ns	~48
DRAM	16-128 GB	~50-80 ns	~200-320

Every step down the hierarchy is roughly 4x slower and 10-100x larger. This isn't a coincidence — it's the design constraint. If L2 were as fast as L1, it would be the same size and cost. If DRAM were as fast as SRAM, we wouldn't need caches.

The memory wall is the term for the growing gap between CPU speed and memory speed. In 1985, a CPU cycle and a memory access took roughly the same time. By 2005, the CPU was 1000x faster but memory was only 10x faster. Caches exist entirely to hide this 100x disparity. Every cache hit at L1 means the CPU avoided a 50-80 nanosecond stall — an eternity at 4 GHz.

This is why data structures matter more than algorithms for many real-world workloads. A linked list traversal — pointer-chasing through random heap locations — defeats prefetching, defeats spatial locality, and hits DRAM latency on nearly every node. A flat array scan of the same data, even with a worse algorithmic complexity, can be faster because every access is an L1 hit.

I'm not saying throw away your algorithms textbook. I'm saying the cost model it assumed — that all memory accesses cost the same — hasn't been true since the 1990s.

The Full Stack of a Single Address

malloc() gave you an address. That address maps, through the page table and memory controller, to a specific channel, rank, bank, row, and column — an index into a grid of capacitors that are leaking right now. The memory controller refreshes every row every 64 milliseconds. A sense amplifier destroyed and rebuilt your data the last time anyone read it. Six layers of caching — L1i, L1d, L2, L3, TLB, prefetch buffers — exist to hide the fact that those capacitors take 300+ CPU cycles to respond.

Every struct you lay out, every array you iterate, every concurrent data structure you design — the cache hierarchy is the invisible judge of your performance. The CPU hasn't been the bottleneck for decades. Memory has.

And now you know what "memory" actually is: a grid of leaking capacitors, maintained by a refresh circuit running a race against physics, fronted by a hierarchy of SRAM caches playing an elaborate prediction game about what you'll need next.

The next time you see a 64ms stall in a profile, you'll know where to look.

How a Viking King Ended Up in Your Earbuds

Naz Quadri — Tue, 31 Mar 2026 22:04:29 +0000

How a Viking King Ended Up in Your Earbuds

Bluetooth: From Runic Initials to 1600 Hops Per Second

Reading time: ~15 minutes

You paired your headphones this morning. Your phone found them, you tapped "connect," and music started playing. That felt like nothing happened. But between that tap and the first beat of audio, your phone and your headphones agreed on an encryption key, negotiated a codec, established a frequency-hopping pattern across 79 radio channels, and started jumping between those channels 1,600 times per second -- all in a band shared with your WiFi router, your microwave, and every other wireless device in the building.

The protocol that makes this work is named after a Viking king who's been dead for a thousand years.

That's not a joke. It's one of the best naming stories in the history of technology.

Harald Bluetooth Gormsson

In the mid-900s AD, Denmark was a mess. Warring tribes, fractured loyalties, no unified rule. Into this stepped Harald Gormsson, who became King of Denmark around 958 AD and later King of Norway. He's remembered for two things: unifying the warring Danish tribes under a single kingdom, and converting the Danes to Christianity. His nickname was "Bluetooth" -- most likely from a conspicuously dead tooth that had turned dark blue, though some historians argue it derives from the Old Norse blatand meaning "dark chieftain."

Harald's real legacy was political unification. He took factions that refused to talk to each other and brought them under one protocol. One kingdom. One set of rules.

Fast-forward about a thousand years to 1997.

Intel engineer Jim Kardach was at a bar in Toronto with Sven Mattisson from Ericsson, and they were wrestling with a problem: IBM, Ericsson, Nokia, Toshiba, and Intel all had competing short-range wireless standards. Each company wanted their own protocol to win. Nobody was budging.

Kardach had been reading Frans G. Bengtsson's historical novel The Long Ships -- a book about Vikings and the reign of Harald Bluetooth. The parallel was too perfect. Harald unified warring Scandinavian tribes. This new protocol needed to unify warring tech companies. Kardach proposed "Bluetooth" as a temporary codename while the consortium figured out a real name.

The consortium was called the Bluetooth Special Interest Group (SIG). They considered names like "RadioWire" and "PAN" (Personal Area Network). Focus groups were run. Marketing decks were produced. But by the time the final name was due, nothing had cleared trademark searches. "Bluetooth" had already stuck in the press and in internal documentation.

The placeholder became the name.

And the logo? It's not an abstract design. It's a bind rune -- two runes from the Younger Futhark alphabet merged into a single glyph. ᚼ (Hagall, Harald's H) and ᛒ (Bjarkan, his B), overlaid on each other. Harald's initials, in the alphabet of his own era, sitting on billions of devices a millennium later.

A 10th-century Viking king named the protocol in your pocket. Every time you see that angular logo on a speaker or a pair of earbuds, you're looking at runic initials from before the Norman Conquest of England.

If like me you went to school in England, you've already met his family — you just don't know it. Harald Bluetooth's son was Sweyn Forkbeard, who invaded England and was crowned king on Christmas Day 1013. Sweyn's son was Cnut the Great — the king every English school kid learns about, and a simple misspelling away from class laughter.¹ Cnut ruled England, Denmark, and Norway simultaneously — the North Sea Empire, the largest domain in northern Europe.

The man who named your wireless protocol is the grandfather of the most powerful Viking king who ever sat on the English throne. Harald himself never invaded England — he was busy unifying Denmark and making the Danes Christian. But his dynasty did. The Jelling Stone he erected — "Denmark's baptismal certificate," still standing in Jutland — declared that he "won for himself all of Denmark and Norway and made the Danes Christian." His grandson won England too.

If you watched the TV show Vikings hoping to spot him, you were about 80 years too late — the show ends around 878 AD, and Harald didn't reign until 958. Vikings: Valhalla starts in 1002, about 16 years after he died. He fell right into the gap between both shows. The Viking age had a lot of Haralds; this is the one that matters for your earbuds.

I think about this every time I pair my headphones. I can't help it.

The Radio: 79 Channels and a Lot of Jumping

Bluetooth operates in the 2.4 GHz ISM band -- the same unlicensed slice of radio spectrum used by WiFi, baby monitors, cordless phones, and microwave ovens. This band is crowded. Catastrophically crowded. The reason Bluetooth works at all in this environment is a technique borrowed from military radio: frequency hopping spread spectrum (FHSS).

The 2.4 GHz band is divided into 79 channels, each 1 MHz wide, spanning from 2.402 GHz to 2.480 GHz. A Bluetooth connection doesn't sit on one channel. It hops between all 79 channels in a pseudo-random sequence, changing frequency 1,600 times per second. That's a new channel every 625 microseconds.

Why does this help? Because interference is usually narrowband. Your microwave blasts a few MHz of the 2.4 GHz band with noise. A WiFi router parks on a 20 MHz or 40 MHz chunk. But if your Bluetooth connection is only on any given frequency for 625 microseconds before jumping elsewhere, a microwave blast on channel 34 only corrupts 1 out of 79 hops. The other 78 get through clean. The protocol retransmits the missed packet on the next visit to a clean channel. You never notice.

The hopping sequence is determined by the master device's clock and address. Both devices in a connection know the sequence. To an outside observer without the sequence, the signal looks like noise spread across the entire band -- which is the entire point. FHSS was originally developed for military communications precisely because it makes signals hard to intercept or jam. It's the same pattern as the internet itself — ARPANET was a military network designed to survive partial destruction, and now you use its descendant to order takeout. Military technology has a habit of ending up in your living room.

Bluetooth Low Energy (BLE) does things differently -- it uses only 40 channels, each 2 MHz wide, and hops at a different rate. But the principle is identical: don't sit still, don't get jammed.

The Piconet: One Clock to Rule Them

When Bluetooth devices connect, they form a piconet -- a tiny ad-hoc network with one device running the show.

The device that initiates the connection becomes the central device (historically called "master" -- the Bluetooth SIG updated the terminology in 2020). The device that accepts becomes the peripheral (formerly "slave"). The central's clock becomes the reference clock for the entire piconet. All frequency hopping is synchronized to it.

A single piconet supports up to 7 active peripherals. That's not arbitrary -- peripheral addresses are 3 bits, giving you 8 slots, one of which is the central. Additional devices can be "parked" -- they stay synchronized to the piconet clock but can't send or receive data until unparked. Up to 255 devices can park on a single piconet.

What happens when a device needs to be in two piconets at once? Your phone connected to your car stereo and your smartwatch, for instance? It participates in a scatternet -- overlapping piconets where a device acts as peripheral in one and central in another, time-slicing its radio between the two. This is why audio sometimes stutters briefly when your smartwatch gets a notification while you're streaming music to your car. The phone's radio just timesliced away from your A2DP stream for a moment to handle the watch's piconet.

Pairing and Bonding: The Handshake

Before two devices can talk, they need to establish trust. This is pairing -- the initial key exchange. Once paired, the keys are stored so reconnection is instant. That storage is called bonding.

The pairing process has evolved across Bluetooth versions, but the core problem is always the same: two devices that have never met need to agree on a shared secret over a radio link that anyone nearby can listen to.

Legacy Pairing (Bluetooth 2.0 and earlier)

Both devices display or require a PIN code. You type the same PIN on both sides (or one side has a fixed PIN, like "0000" for simple devices). The PIN seeds the key generation. This is crude. A 4-digit PIN has only 10,000 possible values. An attacker recording the pairing exchange can brute-force it in seconds.

Secure Simple Pairing (Bluetooth 2.1+)

SSP introduced four models:

Numeric Comparison: both devices display a 6-digit number, you confirm they match. Uses Elliptic Curve Diffie-Hellman (ECDH) for the key exchange. Secure against passive eavesdropping and man-in-the-middle attacks.
Passkey Entry: one device displays a number, you type it on the other. For devices where one side has a keyboard but no display.
Out of Band (OOB): the pairing information is exchanged via a different channel -- NFC tap, QR code scan. Secure because the attacker would need to compromise both channels.
Just Works: ECDH key exchange with no user confirmation. Protects against passive eavesdropping but not against active man-in-the-middle attacks.

"Just Works" exists because headphones don't have screens or keyboards. There's no way to display a confirmation number on a $30 pair of earbuds. The trade-off is deliberate: slightly lower security in exchange for the pairing experience not requiring a device with a display. For most consumer audio, this is the right call. For medical devices or payment terminals, it's absolutely not -- which is why those use Numeric Comparison or OOB.

Two Stacks, One Name: Classic vs BLE

Classic Bluetooth and Bluetooth Low Energy (BLE) are fundamentally different protocol stacks that happen to share a name, a radio band, and a logo.

Classic Bluetooth (formally "BR/EDR" -- Basic Rate/Enhanced Data Rate) was designed for continuous data streams. Audio. File transfer. Serial port emulation. It keeps a persistent connection, and its power consumption reflects that.

Bluetooth Low Energy (BLE, originally marketed as "Bluetooth Smart") was introduced in the Bluetooth 4.0 specification in 2010. It was designed from scratch for a different use case: devices that send tiny amounts of data infrequently and need to run on a coin cell battery for years. Fitness trackers. Temperature sensors. AirTags.

The two stacks are so different that early Bluetooth 4.0 chips came in three varieties: Classic-only, BLE-only, and "dual-mode" chips that supported both. Your phone has a dual-mode chip. Your AirTag has a BLE-only chip. The fact that they both say "Bluetooth" on the box is a branding decision, not a technical statement.

	Classic (BR/EDR)	BLE
Designed for	Continuous streams	Short bursts
Channels	79 x 1 MHz	40 x 2 MHz
Data rate	1-3 Mbps	1-2 Mbps
Power	Milliwatts	Microwatts (sleep)
Range	~10-100m	~10-100m (similar)
Connection	Persistent	Wake-transmit-sleep
Typical use	Audio, file transfer	Sensors, beacons, tags

BLE Advertising: Shouting Into the Void

One of the most elegant things about BLE is that devices can broadcast data without ever pairing.

A BLE device in advertising mode sends out small broadcast packets -- up to 31 bytes of data -- on three dedicated advertising channels (37, 38, and 39, deliberately spread across the band to avoid single-frequency interference). These packets go out at a configurable interval, anywhere from every 20 milliseconds to every 10.24 seconds.

This is how your AirTag works. It never pairs with the phones that relay its location. It broadcasts a rotating identifier on advertising channels. Every iPhone in range passively picks up these advertisements as part of normal BLE scanning, encrypts the location with Apple's key, and relays it to the Find My network. The AirTag's battery lasts a year because it does almost nothing except advertise a few bytes every couple of seconds.

This is also how Bluetooth beacons work in retail stores, museums, and airports. The beacon doesn't know or care who's listening. It advertises a UUID, and any app configured to listen for that UUID can react. The beacon is stateless. The intelligence lives on the receiver side.

The advertising data is structured: each field has a type and a value. Common types include the device name, service UUIDs, manufacturer-specific data, and TX power level (which receivers use to estimate distance based on signal strength). Bluetooth 5.0 extended advertising to allow much larger payloads -- up to 255 bytes in the extended format -- enabling richer broadcasts without requiring a connection.

GATT: The API Layer

When a BLE device does establish a connection (as opposed to passively advertising), the data exchange follows a structured protocol called GATT -- the Generic Attribute Profile. If you've ever worked with a BLE device programmatically, GATT is the API you talked to.

GATT organizes data into a hierarchy:

Profile: a collection of services that define a use case (e.g., "Heart Rate Profile")
Service: a group of related data points, identified by a UUID (e.g., "Heart Rate Service" = 0x180D)
Characteristic: an individual data value within a service (e.g., "Heart Rate Measurement" = 0x2A37)
Descriptor: metadata about a characteristic (e.g., units, valid range, notification settings)

A BLE heart rate monitor exposes a Heart Rate Service. Inside that service is a Heart Rate Measurement characteristic that a connected phone can read or subscribe to for notifications. When the value changes, the peripheral pushes an update to the central without the central having to poll.

The beauty of GATT is standardization. The Bluetooth SIG has defined hundreds of standard service and characteristic UUIDs. Any heart rate monitor from any manufacturer uses the same UUIDs for the same data. Your running app doesn't need a driver for each brand of chest strap -- it looks for service 0x180D and reads characteristic 0x2A37. The interoperability is baked into the protocol.

Profiles: Why Your Car Can't Display Album Art

Back in Classic Bluetooth territory, the concept of profiles defines what devices can actually do with their connection. A profile is a specification for a particular use case.

The ones you interact with daily:

A2DP (Advanced Audio Distribution Profile): stereo audio streaming. This is what your headphones use for music.
HFP (Hands-Free Profile): phone calls. Mono audio, microphone, call control. This is why call audio sounds worse than music -- it's a different profile with a different codec at a lower bandwidth.
HID (Human Interface Device): keyboards, mice, game controllers. Uses the same report format as USB HID, which is why Bluetooth keyboards can send the same scan codes.
SPP (Serial Port Profile): emulates an RS-232 serial port. Beloved by embedded engineers. Gives you a simple bidirectional byte stream between two devices.
AVRCP (Audio/Video Remote Control Profile): play, pause, skip, volume. And -- in newer versions -- track metadata like title, artist, and album art.

That last one is why your car stereo might not display album art from your phone. AVRCP has multiple versions. Version 1.0 only supports play/pause/skip. Version 1.3 added track metadata (title, artist, album). Version 1.6 added album art. If your car's Bluetooth module implements AVRCP 1.3 and your phone sends 1.6 features, the car doesn't understand the album art data. Both sides negotiate down to the highest mutually supported version.

This is the profile negotiation problem in miniature. Every "why doesn't X work with Y" Bluetooth complaint usually traces back to mismatched profile versions, not a fundamental incompatibility.

The Audio Codec Situation

When you stream music over A2DP, the audio must be compressed to fit through Bluetooth's bandwidth constraints. The codec doing that compression has a massive impact on what you hear.

SBC (Sub-Band Codec) is the mandatory baseline. Every A2DP device must support it. It was designed in the early 2000s for the computational constraints of the era. It works. It's also noticeably worse than the source material at its default settings. SBC at the standard bitrate of 328 kbps sounds muddy in the highs and lacks spatial detail. It's the reason "Bluetooth audio sounds bad" became conventional wisdom in the 2010s.

AAC is Apple's preferred codec. iPhones use AAC for Bluetooth audio by default. It sounds substantially better than SBC at the same bitrate, but its quality depends on the encoder implementation — and Android's AAC encoder has historically been worse than Apple's, making AAC-over-Bluetooth inconsistent across platforms. This is the kernel of truth behind every Apple user's claim that their audio sounds better than Android — but it's the encoder, not the platform. Pair high-end earbuds that support aptX or LDAC with an Android phone and the codec difference evaporates. The Bose QuietComfort Ultras on my desk don't care what Apple thinks about AAC.

aptX is Qualcomm's proprietary codec family. aptX Classic targets "CD-like quality." aptX HD pushes to 24-bit/48kHz. aptX Adaptive dynamically adjusts bitrate based on connection quality. All require Qualcomm licensing and hardware support on both ends.

LDAC is Sony's codec. It can push up to 990 kbps -- nearly three times SBC's standard rate. At its best, it approaches lossless quality over Bluetooth. At its worst (in congested RF environments), it drops to 330 kbps and falls back to SBC-like quality. LDAC is open-sourced and available in Android's AOSP.

LC3 (Low Complexity Communication Codec) is the new standard, part of the Bluetooth 5.2 LE Audio specification. LC3 achieves better subjective quality than SBC at half the bitrate. The Bluetooth SIG's own listening tests showed that subjects preferred LC3 at 160 kbps over SBC at 345 kbps. LC3 is mandatory for LE Audio devices, which means it will eventually become the universal baseline, replacing SBC's 20-year reign.

Why Bluetooth Audio Has Latency

The latency you notice when watching video with Bluetooth headphones -- lips moving before words arrive -- comes from multiple sources stacking up.

The audio codec needs a buffer of samples to compress. SBC's frame size is 128 samples. At 44.1 kHz, that's about 2.9 milliseconds per frame. But the codec usually batches multiple frames before transmission.

The Bluetooth baseband adds its own buffering. Classic Bluetooth's connection interval is typically 7.5 milliseconds, meaning data can only be sent at those intervals. The data then crosses the radio link, gets received, buffered, decoded, and sent to the DAC.

Total end-to-end latency for SBC over Classic Bluetooth is typically 150-250 milliseconds. That's noticeable for video. Some phones compensate by delaying the video to match. Some don't. aptX Low Latency claims under 40 ms, but both sides need to support it.

LE Audio changes the game. LC3 was designed for lower latency from the start. LE Audio's isochronous channels provide guaranteed timing (unlike Classic Bluetooth's retransmission-based reliability). The target for LE Audio is 20-30 milliseconds end-to-end -- below the threshold where most people perceive audio-visual desynchronization. LE Audio also brings Auracast -- broadcast audio where one source can stream to unlimited receivers simultaneously. Think public venue announcements, silent disco, hearing aid loops. One transmitter, every listener.

Coexistence: Sharing the 2.4 GHz Band

Bluetooth and WiFi both live in the 2.4 GHz ISM band. They are neighbors who didn't choose each other. The fact that they coexist at all is a small engineering miracle.

WiFi (802.11b/g/n in 2.4 GHz) uses wide channels -- 20 MHz or 40 MHz. A single WiFi channel covers 20 to 40 of Bluetooth's 79 narrowband channels. WiFi transmits at higher power (up to 100 mW vs Bluetooth's typical 1-2.5 mW). By rights, WiFi should obliterate Bluetooth.

It doesn't, because of Adaptive Frequency Hopping (AFH). Introduced in Bluetooth 1.2, AFH lets a Bluetooth connection detect which channels are occupied by WiFi or other interference and remove them from the hopping sequence. If channels 20-40 are blasted by a WiFi router, the Bluetooth piconet hops across the remaining 59 channels instead. The hop rate stays the same -- 1,600 per second -- but the pattern avoids the known-bad frequencies.

On devices that have both a WiFi and Bluetooth radio (which is every phone, laptop, and tablet), a more sophisticated mechanism kicks in: coexistence signaling. The WiFi and Bluetooth controllers share a physical wire or a bus connection. When the WiFi radio is about to transmit, it signals the Bluetooth controller, which can defer its own transmission by a fraction of a millisecond. And vice versa. This is why your phone's WiFi and Bluetooth don't constantly interfere with each other even though the antennas are millimeters apart.

The 5 GHz and 6 GHz WiFi bands (802.11ac/ax) don't overlap with Bluetooth at all. If your WiFi network runs exclusively on 5 GHz, there's zero contention. This is the simplest fix for the person who Googles "Bluetooth interference WiFi" -- move your WiFi to 5 GHz and the two protocols never see each other.

The Part That Gets Me

I keep coming back to the naming.

In 1997, Jim Kardach was reading a novel about a Viking king who unified Scandinavian tribes. He was sitting across from engineers whose companies couldn't agree on a wireless standard. He saw the parallel and threw out a name as a placeholder.

That placeholder outlived every serious marketing attempt to replace it. It outlived IBM's exit from the consortium. It outlived the original standard it was created for -- Classic Bluetooth is slowly being superseded by BLE and LE Audio, but the name persists. The runic bind-rune logo persists. Harald Bluetooth Gormsson, dead for over a thousand years, has his initials on more devices than any human in history.

Harald unified tribes. Bluetooth unifies devices. The metaphor was perfect in 1997, and it's even more perfect now that the protocol connects 5 billion devices annually.

I wonder what he'd make of it. A 10th-century king who couldn't read Latin, who ruled a country smaller than South Carolina, whose greatest technological achievement was a runestone — and his initials are on 5 billion devices. His name is spoken more often now than it was during his own reign. Not by subjects or chroniclers, but by people saying "turn on Bluetooth" while fumbling with their car stereo.

He'd probably understand the unification part. Getting Ericsson and Nokia and Intel to agree on a protocol isn't so different from getting Jutland and Zealand to stop fighting. Both require convincing proud, territorial powers that the shared standard serves them better than the private one. Both require someone stubborn enough to hold the table together until the deal is done.

The frequency hopping, the piconets, the GATT profiles — he wouldn't understand any of it. But the politics? A king who unified warring tribes would recognise a standards body instantly. Same game. Same stakes. Smaller swords. Similar beards.

Not bad for a placeholder.

What Happens When You Plug Something In

Naz Quadri — Tue, 31 Mar 2026 22:04:14 +0000

What Happens When You Plug Something In

USB: The Protocol That Ate Every Port

Reading time: ~15 minutes

You plugged in a USB drive. A notification popped up. You dragged some files over. You yanked the cable out without clicking "safely remove" and felt a tiny pang of guilt.

That guilt is warranted, but not for the reason you think. Between the moment you pushed that connector in and the moment your OS mounted a filesystem, your computer conducted a multi-round negotiation involving device identity, power budgets, endpoint capabilities, and transfer scheduling — all over two twisted wires carrying differential signals at 480 million bits per second. The drive didn't announce itself. Your computer had to ask.

That's the first thing most people get wrong about USB. It's not a peer-to-peer protocol. It's a polled, host-controlled bus where nothing speaks unless spoken to.

A Brief History of Cable Hell

Before USB, connecting a peripheral to a PC was an exercise in suffering. Serial ports (RS-232) needed specific baud rates, stop bits, and flow control settings. Parallel ports (DB-25) were faster but required fat cables and had timing problems. PS/2 ports worked for keyboards and mice but nothing else. SCSI gave you throughput but demanded termination resistors and SCSI IDs and the patience of a monk, I am no monk. Every peripheral type had its own connector, its own driver model, its own failure mode.

In 1994, a consortium of seven companies — Compaq, DEC, IBM, Intel, Microsoft, NEC, and Nortel — decided this was insane. Ajay Bhatt at Intel led the architecture work. The goal: one connector, one protocol, hot-pluggable, self-describing, and capable of powering small devices. USB 1.0 shipped in January 1996.

It took another two years and the iMac G3 for anyone to actually care. Apple dropped every legacy port on that machine and bet the entire peripheral story on USB. It was either visionary or reckless. It was both.

The Physical Layer: Two Wires and a Clever Trick

A USB 2.0 cable has four wires: VBUS (+5V power), GND, D+, and D-. The data travels on D+ and D- using differential signaling — the receiver doesn't look at the voltage on either wire individually. It looks at the difference between them.

Why? Noise immunity. If electromagnetic interference hits your cable, it hits both wires roughly equally. The voltage on D+ goes up by 50mV, and the voltage on D- goes up by 50mV. The difference stays the same. The signal survives. This is the same trick used by Ethernet, HDMI, and every other protocol that needs to work in environments full of switching power supplies and WiFi radios.

A logical 1 (called a "J state" in USB-speak) is D+ high, D- low. A logical 0 ("K state") is the reverse. The encoding scheme is NRZI — Non-Return-to-Zero Inverted — which means a 0 bit causes a transition and a 1 bit doesn't. To prevent long runs of 1s from losing clock sync, the protocol uses bit stuffing: after six consecutive 1 bits, a 0 is inserted. The receiver strips it out. This means USB's actual data throughput is slightly less than the raw bit rate.

USB 3.0 (SuperSpeed) added a completely separate set of wires — two differential pairs for a full-duplex link — on top of the existing USB 2.0 wires. That's why USB 3.0 cables are thicker. That's also why a USB 3.0 device works in a USB 2.0 port at USB 2.0 speeds: the old wires are still there, doing the same job they always did.

The Speed Naming Disaster

USB's speed grades are a masterclass in how not to name things. The USB Implementers Forum — the standards body that governs USB — has renamed speeds multiple times, creating a layered mess where the marketing name, the spec name, and the thing developers actually say are three different strings.

Here's what actually matters:

Generation	Raw Speed	Spec Name	What People Say
USB 1.0	1.5 Mbit/s	Low Speed	—
USB 1.1	12 Mbit/s	Full Speed	—
USB 2.0	480 Mbit/s	Hi-Speed	"USB 2"
USB 3.0	5 Gbit/s	SuperSpeed (now "USB 3.2 Gen 1")	"USB 3"
USB 3.1	10 Gbit/s	SuperSpeed+ (now "USB 3.2 Gen 2")	"USB 3.1"
USB 3.2	20 Gbit/s	SuperSpeed+ (now "USB 3.2 Gen 2x2")	(confused silence)
USB4	40 Gbit/s	USB4 Gen 3x2	"Thunderbolt, sort of"
USB4 v2.0	80 Gbit/s	USB4 Gen 4	¯_(ツ)_/¯

Note that "Full Speed" is the slow one. 12 Mbit/s. It was full speed in 1998. The name stuck forever. This is what happens when you name speeds after marketing adjectives instead of numbers.

Also note that what was once called "USB 3.0" was retroactively renamed to "USB 3.2 Gen 1" — a spec version number that didn't exist when the product shipped. Cable and device manufacturers print whichever name they feel like. The result is that "USB 3.2" on a box could mean 5, 10, or 20 Gbit/s. Good luck out there!

Enumeration: The Descriptor Dance

This is the part most developers never see, and it's the part that makes USB actually work.

When you plug in a device, here's what happens:

Step 1: Electrical detection. The host sees a voltage change on D+ or D-. A full-speed or high-speed device pulls D+ high through a 1.5kΩ resistor. A low-speed device pulls D- high instead. The host now knows something is there, and it knows the speed class.

Step 2: Reset. The host drives both D+ and D- low for at least 10ms. This is a bus reset. It tells the device to abandon any previous state and start fresh.

Step 3: Default address. After the reset, the device listens on address 0. Every USB device starts its life as address 0. Only one device can be at address 0 at a time — this is why there's a brief period after plugging in where the host won't detect a second new device.

Step 4: Get Device Descriptor. The host sends a control transfer to address 0, asking for the device's device descriptor — an 18-byte structure containing the vendor ID, product ID, device class, number of configurations, and the USB spec version the device supports. This is the handshake. "Who are you?"

Step 5: Set Address. The host assigns a unique address (1–127) to the device. From now on, the device only responds to that address. Address 0 is free for the next newcomer.

Step 6: Get Configuration Descriptor. The host asks for the full configuration descriptor, which describes every interface and endpoint the device offers. A webcam might report a video interface with an isochronous endpoint and an audio interface with another isochronous endpoint. A USB drive reports a mass storage interface with two bulk endpoints (in and out).

Step 7: Set Configuration. The host picks a configuration (most devices have exactly one) and activates it.

That's 7 steps, multiple round-trip transactions, all before a single byte of actual data moves. The whole process takes somewhere between 100ms and a few seconds, depending on the device and the OS.

That's why some USB devices take a moment to "appear" — your OS isn't being lazy. Nobody expects the Spanish Inquisition! 🦜

Device Classes: Why Your Webcam and Keyboard Share a Protocol

USB defines device classes — standardized interfaces that let any OS talk to any device of that type without a vendor-specific driver. This is the reason you can plug a keyboard into any computer and have it work immediately. The keyboard announces "I'm a HID device" during enumeration, and the OS loads its built-in HID driver.

The important ones:

HID (Human Interface Device) — keyboards, mice, game controllers, touchscreens. Reports are small fixed-size packets sent on interrupt endpoints. The HID spec defines a report descriptor language that lets devices describe their own button layouts and axis mappings. It's surprisingly expressive and surprisingly painful to parse.
Mass Storage — USB drives, external SSDs. Uses bulk transfers and speaks SCSI commands over USB (a protocol called BOT — Bulk-Only Transport). Your thumb drive is pretending to be a SCSI disk from the 1990s.
CDC (Communications Device Class) — serial ports, network adapters. This is what Arduino boards use to show up as /dev/ttyACM0. CDC-ACM (Abstract Control Model) emulates a serial port. CDC-ECM and CDC-NCM emulate Ethernet interfaces.
Audio — microphones, speakers, DACs. Uses isochronous transfers for guaranteed timing. The class defines controls for volume, mute, sample rate selection.
Video — webcams (UVC — USB Video Class). Also isochronous. Defines format negotiation so the host can request specific resolutions and frame rates.

The beauty of the class system is substitutability. Any UVC-compliant webcam works with any UVC driver. That's not a given — it's an engineering achievement. Go try plugging a random printer into a random computer and watch it fail because the printer class (Printing) is a thin spec and most printers ship with vendor-specific protocols.

Endpoints and Pipes: Four Flavors of Data Transfer

Every USB device exposes one or more endpoints — numbered channels (0–15, each direction) that carry data in or out. Endpoint 0 is special: it's the control endpoint, used for enumeration and device management. Every device must have it.

The other endpoints use one of three transfer types, and which one a device picks determines everything about its performance characteristics:

Bulk transfers — reliable, no timing guarantee. The host sends data when the bus is free. If a packet gets corrupted, it's retried. Used by mass storage devices and printers. You get correctness, but the data arrives "whenever." Bulk transfers get the leftover bandwidth after all other transfer types are served.

Interrupt transfers — small, periodic, guaranteed latency. The host polls the device at a fixed interval (between 1ms and 255ms for USB 2.0). The device either has data or it doesn't. Used by HID devices. Your keyboard gets polled every 1ms for its current key state. The word "interrupt" is misleading — this is still polling, not a hardware interrupt. The device can't yell at the host. It waits to be asked.

Isochronous transfers — What an amazing word, iso same, chronous to do with time, so these are for real time feeds, guaranteed bandwidth, no retries. A fixed amount of bus time is reserved at a fixed interval. If a packet is corrupted, it's gone. No do-overs. Used by audio and video devices, because a retransmitted audio sample that arrives 5ms late is worse than a dropped sample. Your ears interpolate. Your audio stack doesn't have time to wait.

That's why audio uses isochronous transfers — correctness is less important than timing. A glitch in audio is a brief click. A stall while waiting for a retransmission is a gap in playback. The protocol picks the lesser evil.

Hubs and the Tiered Star Topology

USB's physical topology is a tiered star. The host controller is at the root. Hubs branch outward. Devices connect to hubs (or directly to the host's root hub). The spec allows up to 5 levels of hubs and 127 devices total on one bus — a 7-bit address space, because address 0 is reserved for devices mid-enumeration.

Every host controller has a root hub built in — that's what your motherboard's USB ports connect to internally. When you plug a hub into a port, the hub itself goes through enumeration (it's a USB device too, class 09h). Then the hub monitors its downstream ports for new connections and reports them to the host.

The host manages all traffic. A hub doesn't make routing decisions. It's a repeater with port management. At USB 2.0 speeds, all devices on a hub share bandwidth. A hub connected to a 480 Mbit/s port gives each downstream device a time slice of that 480 Mbit/s, not 480 Mbit/s each.

USB 3.0 changed this — SuperSpeed hubs have separate transaction translators for USB 2.0 and USB 3.0 traffic, so a slow device on one port doesn't starve a fast device on another. But the fundamental model — host-scheduled, star-topology, no peer-to-peer — remains.

How Many Devices Can You Actually Connect?

Your laptop has, say, 3 USB-C ports. Each is a root hub port. Plug a 7-port hub into each, and those hubs can cascade further. The math: 3 ports × 5 tiers of 7-port hubs = far more than 127. The address space is the hard ceiling, not the physical ports.

But long before you hit 127 devices, reality intervenes:

Power. Each USB 2.0 port supplies 500mA at 5V (2.5W). A 7-port hub without its own power adapter divides that 2.5W across all downstream devices. Four bus-powered hubs deep and each device gets fractions of a watt. External hard drives spin down. Webcams refuse to enumerate. Your Keychron keyboard works fine because it draws 100mA — keyboards are cheap dates.

Bandwidth. All USB 2.0 devices behind a hub share 480 Mbit/s (realistically ~280 Mbit/s after protocol overhead). Two external drives on the same USB 2.0 hub will each get ~140 Mbit/s. Add a webcam streaming 1080p and the drives slow to a crawl. USB 3.x helps — each device gets its own 5 Gbit/s lane — but only if every hub in the chain is USB 3.x. One USB 2.0 hub in the chain and everything downstream drops to 2.0 speeds.

Enumeration storms. Plug in a powered hub with 7 devices attached and the host controller has to enumerate all of them — sequentially. Each device goes through the descriptor dance: reset, address 0, Get Device Descriptor, Set Address, Get Configuration, Set Configuration. Seven devices × ~100ms each = nearly a second of the host controller doing nothing but paperwork. Plug in a hub with daisy-chained hubs below it and you can stall enumeration for several seconds. That's the pause you feel when you plug in a USB dock.

The real limit for most people isn't 127 devices. It's power and bandwidth. This is why every serious USB setup has powered hubs, and why Thunderbolt docks exist — they provide dedicated PCIe lanes and power delivery that USB's shared-bus model can't match.

USB-C: The Connector That Knows Too Much

USB-C is not a speed. It's a connector — a reversible (remember trying to plug in the older version in the dark 😖) 24-pin plug that can carry USB 2.0, USB 3.x, USB4, Thunderbolt 3/4, DisplayPort, HDMI (via alt mode), analog audio, and power delivery up to 240W. All through the same physical port.

This flexibility is also why USB-C cables are a minefield.

The connector has 24 pins, but not all cables wire all 24. A USB-C cable that only carries USB 2.0 has 4 active data pins (D+/D-) plus power and ground. A USB 3.2 cable adds the SuperSpeed pairs. A Thunderbolt 4 cable wires everything and includes signal conditioning electronics in the connector itself (active cable). They all look identical from the outside. Buyer beware !

Configuration Channel (CC) pins — two pins on the connector used for cable detection, orientation (which way is "up" for the reversible plug), and capability negotiation. When you plug in a USB-C cable, the CC pins are the first thing that talks. They determine what the cable can carry before any data flows.

USB Power Delivery (PD) runs over the CC pins as a separate protocol layer. It negotiates voltage and current between source and sink. The original USB spec allowed 5V at 100mA (500mW). USB PD 3.1 allows up to 48V at 5A (240W). That's a 480x increase in power delivery through a connector lineage that started as a peripheral attachment protocol. PD negotiation uses structured messages — the source advertises what it can provide (called PDOs — Power Data Objects), and the sink requests what it needs.

Alternate Modes let the USB-C connector carry non-USB protocols. DisplayPort alt mode repurposes the SuperSpeed lanes to carry DisplayPort signals. Thunderbolt alt mode does the same for Thunderbolt/PCIe. The CC pins negotiate which alt mode to use. This is why a single USB-C port on a laptop can drive a 4K display, charge the laptop, and connect a USB hub — different pin functions running simultaneously.

That's why "some USB-C cables don't work" — a cable that only wires USB 2.0 pins physically cannot carry a DisplayPort signal. The electronics aren't there. The connector fits, the protocol fails, and you blame the monitor.

Why "Safely Remove" Exists

You've yanked a USB drive out mid-use. Maybe nothing happened. Maybe you lost a file. The reason "safely remove" exists is a combination of write caching and in-flight transfers.

When your OS writes to a USB mass storage device, it doesn't necessarily send every write immediately. The OS may write-cache — hold dirty pages in RAM and flush them to the device in batches. This is faster (fewer small transactions on a bulk endpoint) but means that at any given moment, the device's on-disk state may be behind what the OS thinks it wrote.

Clicking "safely remove" does three things: flushes all cached writes, completes any in-flight transfers, and then tells the device it's safe to power down. The OS unmounts the filesystem first, so no new writes can start.

On Linux, the kernel's block layer handles this via sync and cache flush commands sent as SCSI SYNCHRONIZE CACHE over the USB mass storage protocol. On macOS, the same logic runs through IOKit. On Windows, it's the "surprise removal" path if you yank without ejecting.

Modern operating systems (macOS since Catalina, Windows 10 since 1809) default to a "quick removal" policy for USB drives — write caching is disabled by default, so every write goes to the device immediately. This makes safe removal less critical but makes writes slower. The trade-off: you can yank the cable guilt-free, but large file copies take longer.

That's why copying to a USB drive feels slower than copying between internal disks — beyond the raw interface speed difference, the default write policy adds per-transaction latency.

Power Delivery: From 500mA to Charging Laptops

The original USB 1.0 spec in 1996 provided 5V at 100mA — 500mW, barely enough to power a mouse. A device could request a "high-power" configuration of 500mA (2.5W) during enumeration. This was elegant: the host knew exactly how much power each device needed because the device said so in its configuration descriptor.

Then phones happened. People wanted to charge phones via USB. The Battery Charging Specification (BC 1.2) introduced "dedicated charging ports" that could supply 1.5A at 5V — 7.5W. Detection used the D+/D- lines (shorting them together signaled a charging port). It worked, sort of. It also created a mess of proprietary "quick charge" protocols where chargers and phones did voltage negotiation outside the USB spec.

USB Power Delivery cleaned this up. PD is a protocol that runs on the CC pins of USB-C connectors, completely independent of the data lines. It supports multiple voltage levels (5V, 9V, 15V, 20V, and with PD 3.1's Extended Power Range, 28V, 36V, and 48V) and programmable current limits.

A PD source advertises its capabilities. A PD sink requests what it needs. The negotiation happens in milliseconds. A laptop charger providing 20V at 5A (100W) and a phone requesting 9V at 3A (27W) use the same protocol, the same connector, and potentially the same cable.

PD 3.1 Extended Power Range (EPR) pushes this to 48V at 5A — 240W. That's enough to power a high-end gaming laptop. From a protocol that started powering mice. The specification required new safety mechanisms: EPR cables must be electronically marked (the cable has a chip in its plug that reports its voltage rating to the source), and the source and sink continuously monitor for faults.

The Polled Bus: Everything Waits for the Host

I keep coming back to this because it's the thing that surprises developers most: nothing on a USB bus talks unless the host asks it to.

Your keyboard doesn't fire an interrupt when you press a key. The host controller sends an IN token packet to the keyboard's interrupt endpoint every 1ms. The keyboard either responds with data or responds with NAK (nothing to report). The host controller is doing this for every device on the bus, continuously, interleaving transactions across all active endpoints.

The host controller hardware (EHCI for USB 2.0, xHCI for USB 3.x) maintains a schedule — a frame-based timeline of transactions. Each USB frame is 1ms (USB 2.0) or 125 microseconds (USB 2.0 microframes / USB 3.x). The controller walks through the schedule every frame, issuing IN and OUT token packets, collecting responses, and reporting results to the OS via DMA.

This is fundamentally different from PCIe, where devices can initiate transactions (bus mastering). USB's host-controlled model is simpler, more predictable, and easier to secure — a malicious USB device can't flood the bus because it never gets to talk without permission. But it also means latency is bounded by the polling rate. You can't go faster than "one sample per frame."

What's Next

Every wire we've talked about today requires a physical connection. That cable between your keyboard and your computer is a guarantee — a dedicated channel, a known speed, a predictable latency.

Bluetooth throws all of that away. It shares a radio band with WiFi, microwaves, and baby monitors. It hops between 79 channels, 1600 times per second, to avoid interference. It negotiates connections in a mesh where every device is both a peer and a potential relay. And somehow, your AirPods still manage to play music while your keyboard types and your mouse tracks — all on the same 2.4 GHz band. And then there's a Viking ...

⣴⣶⣶⠶⢤⣄⠀⣤⣤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠈⠉⠛⢿⣆⠈⠻⢮⡙⢿⣯⠓⢤⣀⣀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠙⣇⠀⠀⠹⣆⢹⣿⣿⣿⣿⣿⣿⣍⠛⠳⢦⣀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠹⣆⠀⠀⠈⢻⣿⣿⣿⣿⣿⣿⣿⣷⡘⢆⠙⢷⣄⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢹⡄⠀⠀⠀⠹⣿⣿⣿⣿⣿⣿⣿⣷⠘⣯⡀⠙⢷⡄⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠈⢿⡀⢻⣀⠀⠈⠙⠛⢿⣿⠿⠿⠿⠷⠤⠤⠤⠴⠿⠦⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⣸⣷⡄⠙⢧⣄⠀⠀⠈⣿⣶⣶⣶⠖⠒⠒⠒⠒⠒⠒⠒⠂⠀
⠀⠀⠀⠀⠀⠀⢠⣿⣏⣿⣦⡀⠈⠀⣀⣼⣯⠙⣿⣇⠐⠿⠀⠀⠿⠂⠀⠺⠇⠀
⠀⠀⠀⠀⠀⠀⣸⣿⣿⣿⣿⣿⣿⣿⣿⡿⢿⣿⡿⣿⠷⣶⡶⢶⣶⣶⣶⣶⡶⠆
⠀⠀⠀⠀⠀⣀⣿⣿⡯⢂⣿⣿⣿⣿⣫⡄⢠⡿⢡⡟⠀⠉⠙⣿⣿⣿⡿⣯⠀⠀
⠀⠀⠀⠀⠘⣿⣿⣿⣾⡿⠛⠙⠻⣿⣯⣶⠟⢠⠞⠀⠀⠉⠉⠛⠋⠀⢣⠹⣷⡀
⠀⠀⠀⠀⢸⣿⣿⣿⡟⠉⠙⠢⠀⠘⠋⠀⢊⣁⣤⢷⣤⣠⣄⡀⠀⢠⣆⣀⣽⡿
⠀⠀⠀⢀⣼⣿⣴⡿⠉⠓⠄⠀⣴⡿⠿⠿⠛⠋⠈⠀⠀⠀⢀⣤⡶⡾⠛⠛⣿⠀
⠀⠀⠀⠀⢻⠿⣿⡉⠑⠀⠀⣼⣿⣷⠂⠢⣄⣀⠀⢀⣤⠆⡽⢀⣤⣦⣴⡶⠋⠀
⠀⠀⠀⠀⠀⣾⢿⣿⣷⠀⠘⢿⣿⠳⣄⠀⣬⣿⣾⠟⢋⣄⣠⠟⠻⠶⢶⣆⠀⠀
⠀⠀⠀⠀⠀⣿⣼⣿⣿⡆⠀⠀⠙⣦⡙⣷⣾⠟⠛⢛⣛⣋⡁⠀⡀⢀⢀⣿⠀⠀
⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⠀⠀⠀⠘⣿⣿⠷⠚⠋⠉⠉⢻⣿⠿⣷⣿⣾⣿⠀⠀
⠀⠀⠀⠀⠀⠚⠛⠉⠼⠟⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠈⠀⠀⠼⣿⠿⠋⠀⠀

That's the next layer below.

malloc Is Not Free

Naz Quadri — Tue, 31 Mar 2026 22:03:58 +0000

malloc Is Not Free

You Called malloc(64). Here's What Actually Happened.

Reading time: ~13 minutes

You called malloc(64). You got a pointer back. You stored your data. You moved on.

The kernel, the MMU hardware, the page table walker, and possibly the disk were all involved.

Not necessarily in every malloc call — your allocator is pretty clever about batching all that work. But at some point, behind that innocent function call, something had to go talk to the operating system, the OS had to talk to the memory hardware, and the hardware had to talk to RAM. Understanding where that happens — and when — is the difference between "I wonder why this crashed" and "oh, of course it crashed."

The Bug That Changed How We Allocate

Picture this: a service allocates a 512MB buffer at startup, does a quick sanity check, then proceeds with the real work. It runs fine in staging. In production, under load, it OOM-kills itself hours later.

The buffer was allocated. No error. The pointer was valid. The problem was that allocating memory and having memory are two different things in Linux — and we'll explain exactly why in the next three sections.

If you've ever seen a SIGSEGV on a line that seemed obviously fine, or watched a program get killed by the OOM killer when it had "plenty of memory," or wondered why valgrind sometimes finds bugs that asan misses — this is the layer where that happens.

The First Player: Your Allocator

When you call malloc(64), the first thing that intercepts your request is not the kernel. It's a userspace library — ptmalloc2 if you're on glibc (the default on most Linux systems), or jemalloc if you're on Firefox or FreeBSD, or Daan Leijen's mimalloc (2019, Microsoft Research) if you're on something that prizes both speed and memory efficiency.

The allocator's job is to be fast by being greedy. On its first run, it asks the kernel for a big chunk of memory — far more than your 64 bytes. Then it subdivides that chunk itself, handing out pieces to your program without involving the kernel at all.

This pool of memory is called the heap. It's not a fixed-size thing. It starts small and grows on demand. The allocator manages it with data structures (linked lists of free blocks, size-class bins, various clever tricks depending on which allocator you're using) and tries hard to satisfy your requests from memory it already has.

Most malloc calls never reach the kernel. That's the whole point.

That's why Java, Python, and Rust programs appear to use far more RAM than you'd expect. Their allocators (and the JVM/runtime heap managers) request large chunks upfront from the kernel to amortize the syscall overhead. The virtual size looks enormous. The resident size — physical pages actually in RAM — is often much smaller.

When the Pool Runs Dry

But eventually, the allocator's pool fills up. You've handed out more than you initially reserved. Time to ask the kernel for more.

The allocator has two ways to do this:

brk() / sbrk() — the old way, and still used for small allocations. These syscalls move the "program break" — a pointer that marks the end of the heap segment. Move it up, and the virtual address space between the old break and the new break is yours.

mmap() — the modern way, used for large allocations (typically over 128KB, though the threshold is configurable). mmap() asks the kernel for a new anonymous mapping — a fresh region of virtual address space not backed by any file. (You saw mmap doing file-backed work in What Actually Happens When You Read a File — here it's the same syscall, but for memory that doesn't map to a file at all.)

Both syscalls return to the allocator with a pointer to a new region of virtual address space. The allocator updates its internal bookkeeping, carves off the chunk you asked for, and returns it to you.

Here's the thing: this all happens extremely fast. The kernel updates a data structure and returns. No RAM has been touched. No physical memory has been allocated. The pointer you receive points to... nothing, yet.

The Kernel's Comfortable Lie

This is the part that bends everyone's brain the first time: virtual memory is a lie the kernel tells your program.

Your process has a virtual address space — on a 64-bit system, a staggering 128TB of addressable space, minimum. The kernel hands you slices of this virtual space freely. It's cheap. It's just numbers in a data structure.

Physical RAM is different. Physical RAM is scarce. A machine with 16GB of RAM has 16GB of RAM, and sharing it across dozens of processes requires actual bookkeeping.

The kernel resolves this tension with a policy called overcommit. When your allocator calls mmap() for 512MB, the kernel doesn't check whether 512MB of physical RAM is available. It checks whether 512MB of virtual address space is available (almost always yes), updates its virtual memory map, and returns success.

The assumption is: you'll never actually touch all that memory. Most programs allocate a lot and use a fraction. The kernel is gambling on this, and statistically it wins.

This is why you can malloc(500GB) on a machine with 8GB of RAM, get a non-NULL pointer back, and not crash — until you actually try to use it.

That's why malloc returning non-NULL doesn't mean you have memory. The kernel said yes to the virtual address. It hasn't committed any RAM. You find out at first write, via a page fault — or, if RAM is truly exhausted, via the OOM killer.

// This does NOT allocate 1GB of physical RAM. It reserves
// 1GB of virtual address space. Pages only become real on first write.
char *p = malloc(1024 * 1024 * 1024);
assert(p != NULL);  // passes on most Linux systems with default overcommit settings
p[0] = 1;           // *this* is when things get interesting

The Page Fault: Where Memory Gets Real

You write to your newly allocated memory for the first time. The CPU translates the virtual address to a physical address using the page table — a tree of mappings the kernel maintains for each process.

The page table entry for your address says: "valid virtual address, but no physical frame assigned yet."

The CPU raises a page fault. This is a hardware exception — the CPU stops executing your instruction, saves its state, and jumps to the kernel's page fault handler. Your process is suspended, mid-instruction, while the kernel handles it.

The page fault handler looks at the virtual address and decides what to do. For a freshly allocated anonymous page, the answer is:

Find a free physical frame (a 4KB chunk of RAM)
Zero it out — this is a security requirement enforced by the kernel. Without it, you'd get physical memory that still contains another process's data. POSIX mandates zeroing for mmap with MAP_ANONYMOUS, but Linux zeroes all new pages regardless, because handing out stale memory is how you get information leaks.
Write the mapping into the page table: "this virtual address → this physical frame"
Flush the relevant address translation cache entry (the TLB — more on this in a moment)
Return to the CPU, which re-executes the faulting instruction

Your process resumes, completely unaware that anything happened. From your code's perspective, you just wrote to a pointer. But you actually just caused a hardware exception, ran kernel code, and allocated physical RAM.

This is demand paging — the kernel is lazy by design. It doesn't allocate physical memory until you demand it by accessing the page.

The Page Table: A Tree the Hardware Walks

Every memory access your process makes — every load, every store, every function call — has to go through address translation. The CPU takes your virtual address and turns it into a physical address.

This translation happens in hardware, in a component called the MMU (Memory Management Unit). The MMU walks the page table — a multi-level tree structure that the kernel maintains in RAM.

On x86-64 with the common 4-level configuration, a 64-bit virtual address is split into five fields:

 63       48 47   39 38   30 29   21 20   12 11        0
┌───────────┬───────┬───────┬───────┬───────┬──────────┐
│  (unused) │  PML4 │  PDP  │  PD   │  PT   │  Offset  │
└───────────┴───────┴───────┴───────┴───────┴──────────┘
     16        9       9       9       9        12

Each level is a 512-entry table stored in a physical page. The MMU starts at the PML4 (the root, whose physical address is stored in x86-64's CR3 register — ARM64 calls the equivalent TTBR0_EL1, but the concept is identical), indexes into it with bits 47–39, follows the pointer to the next level, indexes again, and so on down until it has the physical address of the page. It then adds the 12-bit offset to get the final physical byte address. Every architecture with virtual memory does some version of this walk — the number of levels and register names change, but the tree structure is universal.

That's four memory reads just to translate one address. On every. Single. Memory. Access.

Obviously, this would be catastrophically slow if it really happened that way. The hardware maintains a cache for recent translations: the TLB (Translation Lookaside Buffer). Most address translations hit the TLB and cost a single cycle or two. When the kernel updates a page table entry — as it does during a page fault — it has to flush the affected TLB entries, or the CPU would use stale translations.

TLB flushes are one reason context switches are expensive. When the kernel switches from your process to another, it loads the other process's page table root into CR3, which invalidates (most of) the TLB. The next process has to re-warm the TLB from scratch.

Spectre and Meltdown exploited the relationship between virtual memory and CPU caches to leak data across process boundaries. The attacks used speculative execution to access memory the process shouldn't see, then measured cache timing — not TLB timing — to observe what was read. The cache side-channel is the key: a speculatively loaded cache line leaves a timing fingerprint even after the speculative execution is rolled back. I cover the full mechanism in How Your Python Code Actually Runs.

When There Are No Free Frames

The page fault handler needs a free physical frame. What if there aren't any?

This is where the kernel earns its keep.

Reclaim path. The kernel's memory management subsystem maintains lists of physical frames sorted roughly by how recently they were accessed — this is the LRU (Least Recently Used) approximation. When free frames run out, the kernel tries to reclaim some.

Clean pages — pages whose content matches what's on disk (think: file-backed mmaps, shared libraries) — can be simply discarded. The data is already on disk. If the process touches that page again, it page-faults back in. No data lost, just latency.

Dirty pages — pages with data that exists only in RAM — have to be written somewhere before the frame can be reused. If swap space is configured, they go to swap. The page table entry is updated to "swapped out, here's the location on disk." If the process touches that page again, it takes a much more expensive page fault (disk I/O), the page is read back from swap, and execution resumes.

That's why mlock() exists. If you need to guarantee that a page stays in RAM — real-time audio processing, cryptographic keys that must never touch swap — you call mlock(). This forces the page fault to happen immediately and pins the physical frame.

When even that isn't enough. If the kernel has churned through its reclaim options and still can't find a free frame, it invokes the OOM killer: the out-of-memory killer. This is a last-resort mechanism that selects a process — based on a scoring heuristic that considers memory usage, process priority, how long it's been running, and a few other factors — and kills it with SIGKILL.

The process has no say. There's no handler to catch it. One frame you're executing instructions. Next frame, black. No cut scene. No credits. As Henry Hill would say — "And that's that."

If you've ever had a long-running service vanish with no error, no log entry, no exit code — check /var/log/kern.log or dmesg for "Out of memory: Kill process". There's a very good chance the OOM killer found you before you found it.

Fork and the Copy-On-Write Trick

Demand paging enables one of the most useful cheap operations in Unix: fork().

fork() duplicates the parent's entire virtual address space, but it doesn't copy any of the physical pages. Instead, it marks all pages copy-on-write (COW): both parent and child map to the same physical frames, but the page table entries are marked read-only. The first write to any page by either party triggers a page fault; the handler copies that one frame, and the two processes get independent copies.

A thousand-page process can fork in microseconds.

That's why web servers and shell pipelines can spawn child processes so cheaply — the kernel is not copying megabytes of RAM, it's updating page table entries.

The Stack Is Different

One more thing worth knowing: the stack is handled differently from the heap.

The stack starts at a fixed virtual address and grows downward (on x86). The kernel doesn't actually map all of it upfront — it only maps a few pages, and relies on page faults to extend it when needed. If your function goes deeper, it faults into a new page, the kernel checks that it's still within the stack's allowed region, and extends the mapping.

This is also why stack overflows aren't caught by malloc failure — you don't call any allocator function. You just write past the end of the stack's mapped region, hit a page that was deliberately left unmapped (the guard page), and get a segfault. The guard page is not an accident; it's there specifically to catch stack overflows.

void infinite_recurse(int n) {
    char buffer[4096];   // touches a new stack page on each call
    buffer[0] = n;       // force the page to be mapped
    infinite_recurse(n + 1);
    // → SIGSEGV on guard page after ~8000-16000 frames (default 8MB stack)
}

What You Actually Control

The system we've described is mostly automatic. But you have more control than you might think:

/proc/sys/vm/overcommit_memory — set to 2 to disable overcommit entirely. malloc will return NULL when memory is actually exhausted. This is appropriate for databases and other programs that prefer a clean error to a sudden kill.
mlock() / mlockall() — pin pages in RAM, prevent swapping.
madvise() — tell the kernel how you plan to use a memory region. MADV_SEQUENTIAL lets it read-ahead pages. MADV_FREE tells it the pages are unused and can be reclaimed.
malloc_trim() (glibc-specific) — tell the allocator to release unused heap memory back to the kernel. Useful for long-running services that allocate a lot and then don't.
LD_PRELOAD swap — because allocators are userspace, you can swap them out entirely. Replace ptmalloc with jemalloc or mimalloc by setting LD_PRELOAD. Companies have done this to get 10–30% memory savings and significant throughput improvements with zero code changes.

The Full Picture

Let's run malloc(64) one more time, end to end:

Your code calls malloc(64).
The allocator checks its free list. If a suitable chunk exists, it returns a pointer immediately. Done.
If not, the allocator calls mmap() (or brk()) to request more virtual address space from the kernel.
The kernel updates the virtual memory area (VMA) list for your process. No physical RAM involved yet.
The allocator carves off 64 bytes and returns the pointer.
Your code writes to that pointer for the first time.
The MMU translates the virtual address. Page table says: valid range, no physical frame.
CPU raises a page fault.
Kernel page fault handler runs. Finds a free physical frame. Zeroes it. Updates page table. Flushes TLB entry.
CPU re-executes the faulting instruction. The write completes.
Your program continues, completely unaware that hardware exceptions and kernel code were involved.

That's the machinery behind a function call so common you type it without thinking.

The Event Loop You're Already Using

Naz Quadri — Tue, 31 Mar 2026 22:03:42 +0000

The Event Loop You're Already Using

select, poll, epoll, and the System Calls Behind Every Async Framework

Reading time: ~13 minutes

You wrote await fetch(url). Your Node.js server handled ten thousand simultaneous connections while it waited. Your CPU usage barely moved.

Here's what actually happened: your code called into a JavaScript engine, which called libuv, which called epoll_wait, which asked the kernel to wake it up when any of ten thousand file descriptors had data ready. The kernel said nothing for 40 milliseconds. Then it said: "three of them are ready." Your event loop woke up and processed exactly those three. The other 9,997 connections cost you nothing while they waited.

That's the whole trick. One syscall. The kernel does the waiting. You do the work.

Why This Matters Beyond "Async Is Fast"

You've heard that async I/O is efficient. You may have accepted that on faith, or from a benchmark someone posted. But without understanding the layer underneath, you're flying on instruments you don't know how to read.

Here's the bug you'll eventually hit: you write some async Python, everything looks right, and it's still blocking. Your entire server stalls for 300 milliseconds every time a particular function runs. You add more workers. It keeps happening. The problem is a single synchronous call buried in a dependency — a time.sleep, a blocking DNS lookup, an open() on a network filesystem. That call doesn't yield to the event loop. It holds the thread hostage until it returns.

Understanding the mechanism is the only way to understand why that's catastrophic — and why the fix isn't "just add async," it's "figure out where you're not actually doing non-blocking I/O."

The Problem That Needed Solving

Before we get to the solutions, let's understand what problem they solve.

It's 1983. You're writing a server. A client connects. You read() from the socket. If there's no data yet, your process blocks — it goes to sleep, the CPU runs something else, and you wake up when data arrives. This is called blocking I/O, and for one client, it's totally fine.

Scale it up. A thousand clients. Each read() call could block. Your single-threaded process blocks on the first client who has nothing to say yet, while the other 999 who have data are sitting there waiting. The obvious fix is threads — one thread per client. But a thousand threads is a thousand stacks (usually 8MB each by default), a thousand kernel scheduling contexts, and constant context switching overhead.

In 1983, you couldn't afford that. In 2024, the math still gets ugly fast. A modern web server at scale handles hundreds of thousands of connections. You cannot have hundreds of thousands of threads.

What you want is a way to say: "Here's a list of a hundred thousand file descriptors. Tell me when any of them have something interesting." One call. The kernel blocks until there's work. You wake up, process exactly what's ready, go back to sleep.

That's the problem. Here are the solutions, in chronological order of how well they work.

`select`: The 1983 Hammer

select was the first answer, and it arrived with 4.2BSD in 1983 — the Berkeley team's first attempt to solve the multiplexing problem in a standard way. The interface looks like this:

int select(int nfds, fd_set *readfds, fd_set *writefds,
           fd_set *exceptfds, struct timeval *timeout);

You give it three sets of file descriptors — ones you want to read, ones you want to write, ones where you care about exceptions — and a timeout. It blocks until something is ready or the timeout expires. When it returns, the sets have been modified in place to show you which ones are ready.

The fd_set is a bitmask. On most systems, it's 1024 bits. That's your limit: 1024 file descriptors, max. (You can recompile with a larger FD_SETSIZE, but you can't escape the O(n) scan.) If you need more, select literally cannot help you.

But it gets worse. Every time you call select, you have to rebuild the set of descriptors you care about, pass it to the kernel, and the kernel has to walk every bit of that mask to figure out which ones changed. For 1000 connections, that's 1000 checks on every call, even if only one descriptor became ready.

select is O(n) where n is the number of file descriptors you're watching, regardless of how many are actually active. At scale, this becomes expensive.

`poll`: Slightly Less Wrong

poll arrived as a POSIX standardization of the same concept, without the 1024 limit. Instead of a bitmask, you pass an array of struct pollfd structures:

struct pollfd {
    int   fd;       // the file descriptor
    short events;   // what you're interested in
    short revents;  // what actually happened (filled by kernel)
};

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

No arbitrary limit on the number of fds. Better event granularity. Same fundamental problem: you still rebuild the entire array on every call, pass it to the kernel, and the kernel still has to walk every entry to check what's ready.

poll is O(n) in the same way select is. It fixed the 1024 limit and cleaned up the API, but it didn't fix the performance cliff at high connection counts.

Both select and poll have another problem: every call copies the entire list of descriptors from user space to kernel space. For 100,000 connections, that's 100,000 copies of a struct on every call, in a tight loop. The data movement alone becomes your bottleneck — an O(n) copy cost layered on top of the O(n) scan.

`epoll`: The Linux Answer (And Why Linux Won)

epoll landed in Linux 2.5.44 in 2002. It rethinks the whole interface.

Instead of passing the full list of descriptors on every call, you create an epoll instance — a kernel-managed data structure that persists between calls — and add file descriptors to it once. Then you just ask "what's ready?", and the kernel has the state it needs already.

Three syscalls:

// Create the epoll instance
int epfd = epoll_create1(0);

// Register a file descriptor with it (once, not every loop)
struct epoll_event ev = { .events = EPOLLIN, .data.fd = sockfd };
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, &ev);

// Wait for events (this is the blocking call in your event loop)
int n = epoll_wait(epfd, events, MAX_EVENTS, timeout_ms);

The key insight: epoll_wait returns only the descriptors that are actually ready. If you're watching 100,000 connections and 3 have data, epoll_wait returns 3. You process 3. The kernel doesn't enumerate the other 99,997.

epoll is O(1) for the waiting — adding and removing descriptors is O(log n) once, not O(n) per call (epoll stores its interest set in a red-black tree). The difference between select and epoll at 100,000 connections is the difference between 100,000 operations per loop iteration and approximately 3.

This is why Node.js became credible at high connection counts. Node runs on libuv, which uses epoll on Linux. One thread, one event loop, one epoll_wait call, and the kernel does the heavy lifting.

`kqueue`: BSD Did It Too

If you're on macOS or FreeBSD, the equivalent is kqueue, which appeared in FreeBSD 4.1 in 2000 — actually two years before epoll. Different API, same idea: persistent kernel state, O(1) wakeups, batch event delivery.

int kq = kqueue();

struct kevent change = {
    .ident  = sockfd,
    .filter = EVFILT_READ,
    .flags  = EV_ADD | EV_ENABLE,
};
kevent(kq, &change, 1, NULL, 0, NULL);

// Wait for events
struct kevent events[MAX_EVENTS];
int n = kevent(kq, NULL, 0, events, MAX_EVENTS, NULL);

kqueue is more elegant than epoll — a single syscall handles both registration and waiting — and it watches more than just file descriptors. You can use the same interface to watch for process exits, signals, timers, and file system changes. The event model is unified. But it's BSD-only, so it doesn't run on Linux.

This is the proliferation problem that every cross-platform runtime faces. You want epoll on Linux, kqueue on macOS/BSD, and IOCP on Windows (which is a completely different model). The next section is about how each major runtime solves this.

`O_NONBLOCK`: The Prerequisite Nobody Explains

Here's the part that trips people up. Having an efficient waiting mechanism isn't enough. The individual I/O operations themselves have to be non-blocking, or the whole thing falls apart.

By default, read() blocks. If you call read() on a socket with no data available, your thread sleeps until data arrives. epoll told you the socket was ready, so in normal operation this doesn't happen — but edge cases and bugs can still get you here. More importantly, some operations — connect(), write() on a full buffer — can block even when epoll says you're good to go.

O_NONBLOCK is a flag you set on a file descriptor to change this behavior:

int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);

With O_NONBLOCK set, I/O operations never block. If a read() would have blocked (no data available), it returns immediately with EAGAIN or EWOULDBLOCK. If a write() would have blocked (send buffer full), same thing.

Your code then handles EAGAIN by going back to the event loop — "okay, nothing ready, I'll wait for epoll to tell me when to try again."

Without O_NONBLOCK, even a single blocking operation inside your event loop stalls everything. The whole async model breaks down. This is where that "buried synchronous call" bug comes from: some dependency opens a file descriptor, forgets to set O_NONBLOCK, calls read(), and your entire event loop freezes while it waits.

"Use async all the way down" isn't style advice. It's a correctness requirement. Call time.sleep(1) inside an event loop and you've blocked the only thread — no callbacks fire, no connections are served, the whole system goes dark for one second. Python's asyncio.sleep exists for exactly this reason: it yields back to the event loop instead of blocking the thread. Same reason you never call requests.get in async code — it's a blocking HTTP client that holds the thread hostage while it waits for a response. Use aiohttp or httpx instead.

How the Frameworks Plug In

Let's trace how the high-level abstractions land on these syscalls.

Node.js / libuv

libuv runs a loop: check timers, check I/O callbacks, call epoll_wait (Linux) or kqueue (macOS) with whatever timeout makes sense given pending timers. When it wakes up, it dispatches callbacks. Your await fetch(url) eventually becomes a socket, which gets registered with epoll, which wakes up libuv, which calls your callback, which resolves the Promise. The "event loop" you've heard about? It's this loop.

Python asyncio

Same pattern, different language. asyncio's default SelectorEventLoop uses Python's selectors module, which picks the best available backend for the platform — epoll on Linux, kqueue on macOS, select as a last resort. On Linux, you're already on epoll out of the box. uvloop wraps libuv for even better performance, but the default is not as slow as people assume. Windows gets ProactorEventLoop backed by IOCP. The await keyword doesn't make I/O async — the underlying selector does. await just lets the event loop know your coroutine is willing to pause.

Tokio (Rust)

Tokio uses mio, a thin safe wrapper around epoll/kqueue/IOCP. Its async runtime is more explicit about what it's doing than Node.js — you can see the reactor and the executor as separate components. The reactor watches file descriptors; the executor schedules tasks. An .await in Tokio suspends a task and hands control back to the executor, which runs other tasks until the reactor reports that the descriptor is ready. Then the task is rescheduled.

Go goroutines

Go's runtime is the most opaque of the four. You write blocking-looking code — conn.Read(buf) — and the runtime makes it non-blocking behind your back. When a goroutine would block on I/O, the runtime parks the goroutine, registers the descriptor with the network poller (which uses epoll on Linux), and continues running other goroutines on the same OS thread. When the data arrives, the poller wakes up the parked goroutine. From your perspective, it blocked. In reality, epoll_wait was called and the goroutine was context-switched away.

That's why Go can have a million goroutines — they're not OS threads, and they don't block OS threads when they wait on I/O.

What The Kernel Is Actually Doing

It's worth briefly understanding how epoll does its job efficiently.

When you register a file descriptor with epoll_ctl, the kernel attaches a callback to the descriptor's wait queue. This is a list of sleeping tasks that should be woken up when the descriptor becomes ready.

When a packet arrives, it follows a specific path: the NIC triggers a hardware interrupt, which causes the CPU to run the network driver's interrupt handler, which feeds the packet up through the kernel's network stack, which places data in the socket's receive buffer. At that point, the socket's wait queue callbacks fire — including the one epoll registered. The callback adds the descriptor to epoll's ready list.

When epoll_wait returns, it's just reading off that ready list. No scanning. No iteration over your 100,000 connections. The work was done when the packet arrived, not when you asked.

That's why EAGAIN isn't an error. When O_NONBLOCK is set and read() returns -1 with errno == EAGAIN, the socket is politely saying "nothing ready yet." Your event loop re-registers with epoll and waits. Every async I/O library handles this for you, which is why you've probably never seen it directly.

The Same Kernel, The Same Primitives

The abstractions you use every day — async/await, goroutines, green threads, futures — are different UX choices on top of the same three or four kernel primitives. The kernel API hasn't changed much since epoll landed in 2002. What's changed is how well we've wrapped it.

Node's innovation wasn't non-blocking I/O. The kernel had that. Node's innovation was making it the default — putting it in a single-threaded event loop and forcing the programming model to accommodate it. Python's asyncio brought the same model to a language that was threading-first. Rust's Tokio gave you the same power with compile-time correctness guarantees. Go hid the whole thing behind a familiar synchronous-looking syntax.

All roads lead to epoll_wait. The question is just how many layers of abstraction are between you and the call.

What "Connected" Means in TCP

Naz Quadri — Tue, 31 Mar 2026 22:03:13 +0000

What "Connected" Means in TCP

The Three-Packet Handshake Between Two Kernels Who've Never Met

Reading time: ~13 minutes

You called connect(). Your code moved on. You're "connected."

Nothing physical connected. No wire was plugged in. No circuit was closed. Three packets flew across the network and landed in two kernel data structures — a hash table entry on your machine and a hash table entry on the server — and that's it. That's the whole "connection." A gentleman's agreement between two kernels who've never met, maintained by nothing more than both sides keeping their word.

The moment either kernel loses that state — crash, memory pressure, a firewall that forgets to tell anyone — your "connection" evaporates. The wire is still there. The bytes stop flowing.

Most of us debug socket code for years without understanding what connect() actually does. Every mysterious hang is a debt being collected. Let's fix it.

The State Machine Behind `connect()`

The kernel maintains a state machine for every TCP connection. You've probably seen a diagram of it in a textbook and immediately forgotten it. That's fair — the full diagram has eleven states and looks like a fire escape route — which it is, in a way. Every state represents a moment when one side might crash and the other needs a graceful exit. RFC 793 defined all eleven in 1981.

But here's the part that matters: when you call connect(), your kernel doesn't "connect" to anything. It starts a negotiation. A very specific, three-packet negotiation called the handshake.

Here's what actually happens:

Your machine                    Remote machine
     │                               │
     │  ──── SYN ──────────────────► │   "I want to talk. My sequence starts at X."
     │                               │
     │  ◄─── SYN-ACK ─────────────── │   "OK. Mine starts at Y. I got your X."
     │                               │
     │  ──── ACK ──────────────────► │   "Got it. Now we're both synchronized."
     │                               │
   ESTABLISHED                  ESTABLISHED

Three packets. That's the whole handshake.

The SYN packet contains a randomly chosen sequence number — a 32-bit integer that will be used to label every byte your machine sends. The server responds with its own SYN (it needs to start its own sequence), plus an ACK for yours. Your ACK completes the circle.

After those three packets, both kernels have entered the ESTABLISHED state in their connection tables. The "connection" now exists.

A Pair of Hash Table Entries

What does that state actually look like? On Linux, every socket is a file descriptor — the same integer handle the kernel uses for files, pipes, and everything else. The kernel maintains a hash table of sockets keyed by a four-tuple:

(source IP, source port, destination IP, destination port)

This four-tuple uniquely identifies a connection. Your machine can have thousands of connections to the same server on the same port — as long as the source port is different, they're different entries in the hash table.

When connect() returns, there is one entry in your kernel's connection table and one in the server's. That's your "connection." It's not a circuit. It's not a reserved channel. It's two structs in two hash tables on two machines, agreeing to honor a sequence number protocol.

When you're not sending anything, nothing is happening on the wire. The connection just... sits there. In RAM. Two kernels keeping state about each other.

Why Sequence Numbers Exist

The original problem TCP solved is: the internet is unreliable. Packets get dropped. They arrive out of order. Routers duplicate them. They might take wildly different paths.

IP doesn't care. IP's job is to route packets best-effort and move on. TCP's job is to build a reliable, ordered byte stream on top of that chaos.

The way it does this is with sequence numbers. Every byte you send has a position in the stream. If packet 3 arrives before packet 2, the receiver buffers it and waits. If packet 2 never arrives, the receiver asks for it again. If the same data arrives twice (duplicated in transit), the sequence number identifies the duplicate and it gets discarded.

The sequence number you start with isn't zero. It's chosen randomly, for security: if it were predictable, an attacker could inject packets into your stream by guessing the next expected sequence number. This attack is called a blind injection attack — Kevin Mitnick used predictable sequence numbers in his 1994 attack on Tsutomu Shimomura to hijack a trusted connection. RFC 6528 randomizes initial sequence numbers specifically to prevent this. The randomness is the defense.

The Buffer Between You and the Wire

Here's the piece of the mental model that surprises people most.

When you call send(), your bytes do not go on the wire. They go into a buffer in the kernel.

The kernel's socket send buffer sits between your application and the network stack. send() copies your bytes there and returns immediately. The kernel sends them when it decides to — based on network conditions, the receiver's capacity, and a timer.

# sock = socket already in ESTABLISHED state
sock.send(b"hello world")
# At this point: bytes are in kernel buffer.
# They have NOT left your machine.
# The remote end has NOT received them.
# send() returned anyway.

This has a corollary that catches people off guard: send() can return before the bytes leave your machine. It can return while your laptop is offline. It just means "I accepted your bytes." Delivery is a separate promise, made later, without your involvement.

The receive buffer works the other way. When packets arrive, the kernel puts the data in the receive buffer and sends an ACK back to the sender — "got it" — before your application calls recv(). Your code might be sleeping. The kernel is already acknowledging on your behalf.

This decoupling is what makes TCP reliable without making your code complicated. The kernel manages retransmits, flow control, and reordering. You see a clean byte stream.

That's why send() returns before the data is delivered. There's a kernel buffer between send() and the wire. send() accepts bytes; it doesn't send them.

Window Size and the Invisible Flow Controller

The receiver tells the sender how much space it has in its receive buffer. This is the window size, advertised in every TCP segment.

If the receiver's buffer fills up — because the application is slow to call recv() — the window size shrinks. When it hits zero, the sender stops sending. Entirely.

This is called flow control, and it's operating silently in every TCP connection you've ever used. Your send() call doesn't hang because of the network. It hangs because the application on the other end isn't reading fast enough, and that information has propagated backward through two kernel buffers and a TCP window advertisement.

That's why your socket send() sometimes blocks. The buffer is full. The buffer is full because the window is zero. The window is zero because the remote application is backed up. You're feeling the pressure of something happening three hops away.

That's why database connection pools backpressure: a slow consumer in the application tier propagates all the way back to the TCP send buffer of the client.

Nagle's Algorithm: The Uninvited Optimizer

In the 1980s, a network engineer named John Nagle noticed that people were sending single-character packets over slow serial links. Every keypress in a terminal session became a 41-byte TCP/IP frame — 40 bytes of headers, 1 byte of data. The network was clogging up with tiny packets.

His fix was Nagle's algorithm: don't send a small packet if there's outstanding unacknowledged data. Wait until you have a full packet's worth, or until your outstanding data gets acknowledged.

It made a lot of sense in 1984. It still makes sense for bulk data transfers.

It is a disaster for latency-sensitive protocols.

The classic symptom: you're writing a client that sends a small request and waits for a response. The request is 10 bytes. Nagle buffers it because you sent a header 5 milliseconds ago that hasn't been ACK'd yet. Your round-trip time triples for no reason.

The fix is TCP_NODELAY:

sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

That's why every high-performance network library sets TCP_NODELAY by default now. Nagle's algorithm is opt-out behavior inherited from a world with 1200 baud modems.

That's why Redis, PostgreSQL's wire protocol, and every low-latency RPC framework explicitly disable Nagle. The moment you're doing request/response over TCP, you want control over when the packet leaves.

The Slow Close: `close()` vs `shutdown()`

Here's where the state machine comes back to bite you.

When you're done with a connection, you call close(). What actually happens is more involved.

TCP is a full-duplex protocol. You have a stream going in each direction, independently. Closing a connection means closing both streams, but they don't have to close at the same time.

close() closes the whole socket — both directions. If you haven't read all incoming data yet, that data is lost. This bites you when you're parsing an HTTP response: you close() before reading the error body and get a connection reset instead of the error message you needed.

shutdown() is more precise:

sock.shutdown(socket.SHUT_WR)   # I'm done sending. Remote can still send to me.
sock.shutdown(socket.SHUT_RD)   # I'm done receiving.
sock.shutdown(socket.SHUT_RDWR) # Both.

When you shutdown(SHUT_WR), your kernel sends a FIN packet to the remote end. That FIN means "I'm done sending data." The remote end can still send data back to you, and you can still receive it. Both sides have to send a FIN before the connection is truly closed.

The four-packet close handshake (FIN → ACK → FIN → ACK) mirrors the three-packet open, but split across time: the two sides often close independently, because one side might have more to say.

TIME_WAIT: The Ghost That Haunts Your Port Numbers

After the final ACK, the connection isn't immediately gone. The kernel enters TIME_WAIT state and holds the four-tuple — source IP, source port, destination IP, destination port — for 2 × MSL (Maximum Segment Lifetime).

On Linux, TCP_TIMEWAIT_LEN is hardcoded to 60 seconds — Linux sets this constant directly rather than computing 2×MSL, though the 2×MSL formula from RFC 793 describes the intent. This is not configurable via sysctl. The similarly named tcp_fin_timeout controls something else entirely — the FIN_WAIT_2 timeout, not TIME_WAIT. Confusing the two is a common mistake, and one I've made more than once.

Why? Because that final ACK might have been lost. If the remote end re-sends its FIN (because it didn't get the ACK), your kernel needs to be able to ACK it. If the connection were immediately gone, your kernel would send a RST instead, which is rude and could leave the remote end confused.

There's a second reason: the internet is not instantaneous. Old duplicate packets from the previous connection on this same four-tuple might still be in transit somewhere. TIME_WAIT prevents a new connection from misinterpreting those ancient packets as belonging to it.

This matters when you're running a server that handles thousands of short-lived connections. Every client that disconnects leaves a TIME_WAIT entry behind. On a busy server, you can accumulate tens of thousands of these entries, each holding a port number hostage for 2 minutes.

That's why servers run out of ports. Not because they're out of listening ports. Because the ephemeral port range — the range the kernel uses for outbound connections — is full of TIME_WAIT ghosts.

# See the carnage
ss -s
# TIME-WAIT: 18432  ← these are connections waiting to die

The fix on Linux is SO_REUSEADDR, which lets you bind to an address/port combination that's still in TIME_WAIT. Most server frameworks set this automatically. When you see mysterious "address already in use" errors after restarting a server, you've met TIME_WAIT in person.

That's why your server runs out of ports after handling many short-lived connections. Not listening ports — ephemeral ports. The range is full of TIME_WAIT ghosts.

The Gentleman's Agreement

Let's put it all together.

When you call connect(), you initiate a negotiation. Three packets create state in two kernel tables. The "connection" is those two table entries, nothing more. No circuit. No reserved bandwidth. No wire that's "yours."

While the connection is open, two kernel buffers — yours and theirs — mediate every byte. send() means "here, kernel, deal with this." recv() means "give me whatever showed up." The network happens in between, managed entirely by kernel code you didn't write, running on a schedule you don't control.

The receiver's window size limits how fast you can send. Nagle's algorithm may hold your packets hostage for milliseconds. Congestion control — a whole other topic we're glossing over here — may slow your throughput based on packet loss detected on the path between you.

When you close the connection, the kernel sends FINs, waits for the remote side, and then holds the ghost of the connection for up to 60 seconds.

None of this is visible to your code. You called connect(). You called send(). You called recv(). The bytes arrived in order and the stream made sense.

The "connection" is a gentleman's agreement. Both sides promise to track sequence numbers, ACK each other's bytes, respect each other's window sizes, and retransmit anything that goes unacknowledged. Neither side promises to stay alive, respond quickly, or tell the other if the application crashes. There's no heartbeat. No "are you still there?" The agreement holds until one side breaks it — or just goes silent and lets the retransmission timer expire.

Every TCP connection on the planet is two structs in two hash tables, honouring a handshake that happened milliseconds or hours ago. The wire doesn't care. The agreement is all there is.

Your DNS is Lying to You

Naz Quadri — Tue, 31 Mar 2026 22:02:58 +0000

Your DNS is Lying to You

What Actually Happens Between a URL and the First Byte

Reading time: ~13 minutes

You typed api.example.com into your browser — or curl'd it, or your service tried to connect to it — and something happened. Some bytes arrived. You moved on.

It is not a lookup table. It is a distributed, eventually consistent database with a 40-year-old trust model, deployed across millions of machines that have no obligation to agree with each other. When it goes wrong — and it does go wrong — the failure modes are some of the most maddening in all of networking, because the answer you get looks valid. It's just wrong.

There are four distinct roles. Most people know one of them.

The Bug That Made This Click

Picture this: a microservice can't connect to a dependency. Health checks pass. curl works fine from your laptop. The service throws connection errors that make no sense.

The service is running in a Docker container. Inside the container, curl api.internal.corp returns a different IP than dig api.internal.corp run from the same container.

Different. IP.

Same host. Same moment. Different tool. Different answer.

We'll learn exactly why that's possible before the end of this post.

The Cast of Characters

Before the mechanics, let's name the players. There are four distinct roles in the DNS resolution chain, and conflating them is the source of most confusion.

The stub resolver lives on your machine. It's not really a DNS server — it can't do much on its own. It's the code in libc (or your OS networking stack) that takes a hostname and says "I need an IP for this" by forwarding the question to someone who can actually answer it. On Linux, that's getaddrinfo(). On macOS it goes through mDNSResponder. Every DNS query your applications make starts here.

The recursive resolver (also called a "full-service resolver" or sometimes misleadingly the "DNS server") does the actual work. This is the server your stub resolver talks to. Its job is to walk the DNS tree from the root all the way down to a definitive answer. Your ISP runs one. Google runs one at 8.8.8.8. Cloudflare at 1.1.1.1. Your office probably has one too.

The authoritative nameserver actually owns the answer. If you bought example.com and set up your DNS records, you pointed your registrar at some authoritative nameservers. Those servers are the canonical source of truth for your zone. They don't do recursion — they just answer questions about records they own.

The root nameservers are where recursion starts when a recursive resolver has no cached answer. There are exactly 13 of them by IP — hundreds of physical machines behind those 13 addresses via anycast. Why 13? Because the original DNS protocol used 512-byte UDP packets, and 13 NS records was the maximum that fit 🤷‍♂️. They don't know where api.example.com is, but they know who handles .com.

What Actually Happens When You Type a URL

Let's trace it. You type https://api.example.com/v1/users and hit Enter.

Your browser extracts the hostname: api.example.com. It calls into the OS resolver. Before any network packet leaves your machine, the OS checks three things in order:

First, /etc/hosts. This is a flat text file that predates DNS by over a decade. It's checked before anything else, unconditionally. If api.example.com appears in /etc/hosts, the search is over — no network query happens at all. This is why adding entries to /etc/hosts works for local development, and it's also why corporate malware occasionally modified it to redirect banking sites. It's also why many devs are confused when their DNS changes don't seem to take effect: they have a stale hosts file entry they forgot about from six months ago.

Second, the local DNS cache. Your OS, and often a local daemon (systemd-resolved on modern Linux, mDNSResponder on macOS), keeps a cache of recent answers. If the cache has a fresh entry, done.

Third, and only if neither of those had an answer: a query goes out to the recursive resolver specified in /etc/resolv.conf.

# /etc/resolv.conf — the file that decides where your DNS queries go
nameserver 192.168.1.1      # your router, probably
nameserver 8.8.8.8          # fallback: Google
search corp.internal        # try appending this domain to short names

That nameserver line is the only thing most developers know about /etc/resolv.conf. The search directive is where it gets interesting — and where short hostnames like db can silently resolve to db.corp.internal, which is either convenient or baffling depending on the day.

That's why db works on your laptop but fails in CI: one has a search corp.internal entry and the other doesn't.

The Recursive Resolver Earns Its Name

Your query reaches the recursive resolver. Let's say it's a cold cache — never seen api.example.com before.

The resolver starts at the top.

It queries one of the 13 root nameserver IPs (hardcoded into all resolver software as the "root hints"). The root server doesn't know api.example.com. It responds with a referral: "I don't know, but .com is handled by these nameservers."

The resolver then queries a .com Top Level Domain (TLD) nameserver. The TLD server doesn't know api.example.com. It responds with a referral: "I don't know, but example.com is handled by these nameservers."

The resolver then queries an authoritative nameserver for example.com. This one knows. It returns an A record (IPv4) or AAAA record (IPv6) for api.example.com, along with a TTL — a "time to live" value in seconds.

The resolver caches the answer for TTL seconds and returns it to your stub resolver. Your stub resolver hands it to getaddrinfo(). Your browser gets an IP. The connection starts.

That whole chain — root → TLD → authoritative — happened in the background, probably in under 100ms. On a warm cache, it's a single hop and maybe 5ms.

Query trace for api.example.com:
  → root nameserver (hardcoded IPs)
    ← "ask .com TLD at 192.5.6.30"
  → .com TLD nameserver (192.5.6.30)
    ← "ask ns1.example.com at 93.184.216.10"
  → ns1.example.com (93.184.216.10)
    ← "api.example.com A 198.51.100.42  TTL 300"
  → your stub resolver
    ← 198.51.100.42

Three round trips. More if any of those delegations weren't cached. And critically: the recursive resolver that did all this work is running on someone else's machine, which you do not control, and which has its own cache that it shares with everyone else who uses it.

TTL Is Not a Suggestion, It's Also Not a Guarantee

TTL (Time to Live) is the record's expiry hint. If your A record has TTL 300, it means "cache this for 300 seconds, then check again."

Here's what TTL cannot do: it cannot tell resolvers that already have a cached answer to throw it away. When you update a DNS record, the old answer is still valid in every cache that holds it, until their individual TTLs expire. If your TTL was 24 hours (86400 seconds), some resolvers will be serving the old answer for up to 24 hours after your change.

This is why "just flush DNS" is not a real answer to a propagation problem. You can flush your local machine's cache. You cannot flush Google's cache. You cannot flush your ISP's cache. You cannot flush the cache of the recursive resolver your user's mobile carrier uses.

What you can do: lower your TTL well before a migration. If you know you're moving an IP next Tuesday, set your TTL to 60 seconds on Friday. Let the short TTL propagate. Do the migration. The blast radius of stale caches is 60 seconds instead of 24 hours.

What you cannot do: change the TTL and have it take effect immediately. The TTL change itself has to propagate, and it propagates at the old TTL.

That's right. The new TTL doesn't matter until the old TTL expires and resolvers re-fetch the record. Plan accordingly.

That's why your DNS change isn't working yet. The old answer is cached at some resolver with a TTL of 3600 and there are 47 minutes left. Verify by querying the authoritative server directly: dig @ns1.example.com api.example.com — that bypasses all caches and shows what the authoritative server has right now.

CNAME Chains and Why They're Weird

An A record maps a name to an IP. A CNAME record maps a name to another name — a canonical alias.

api.example.com  CNAME  loadbalancer.us-east-1.elb.amazonaws.com

When a resolver sees a CNAME, it has to resolve the target too. So api.example.com → look up loadbalancer.us-east-1.elb.amazonaws.com → that returns an A record. Two lookups, one query from your perspective.

CNAMEs are used everywhere — CDNs, load balancers, cloud services — because they let you point a name at another name that the provider controls. When AWS moves your load balancer, they update their A record; your CNAME keeps working.

The rule everyone forgets: you cannot have a CNAME at a zone apex. The zone apex is the bare domain itself — example.com with nothing in front of it. Why? Because CNAME has to be the only record for a name (it replaces the name entirely), but the zone apex needs SOA and NS records. You can't have a CNAME and also have NS records. The DNS spec doesn't allow it.

This is why CDN and DNS providers invented CNAME flattening (Cloudflare calls it CNAME at the root, Route53 calls it ALIAS records). When you point example.com at example.com.cdn.cloudflare.net, the provider does the CNAME lookup at query time and returns a flat A record to the client. From the outside, it looks like an A record. It's not. It's a CNAME that your DNS provider is silently expanding.

This matters when you're debugging. If dig example.com A returns an IP directly but you know you set up a CNAME at the root, the flattening is working. If it returns a CNAME, something's wrong with your provider config. These look identical to the application layer.

That's why you can't put a CNAME on example.com itself — CNAME semantics conflict with the SOA and NS records that every zone apex must have. Your DNS provider works around this with record flattening, which looks like an A record to the outside world.

Why `dig` and `curl` Give Different Answers

Back to my Docker debugging story. curl api.internal.corp returned a different IP than dig api.internal.corp. How?

Because they use different resolution paths.

dig is a DNS tool. It talks directly to a DNS resolver — by default, whatever is in /etc/resolv.conf, or you can specify one with @. It bypasses the OS resolver entirely, bypasses the local cache, and makes a raw DNS query.

curl uses getaddrinfo(). That function goes through the full OS name resolution stack, including /etc/nsswitch.conf — the OS's routing table for name resolution, a priority-ordered list of where to look. On a typical Linux machine it looks like:

# /etc/nsswitch.conf
hosts:          files dns myhostname

That files entry means /etc/hosts runs first — and curl reads it, while dig does not.

Inside Docker, it gets more interesting. Docker injects its own nameserver into the container's /etc/resolv.conf, pointing at Docker's internal resolver at 127.0.0.11. That resolver handles Docker network DNS (container names, service names). It may return different answers for internal names than an external DNS server would. dig run without arguments still reads /etc/resolv.conf — so inside the container, dig api.internal.corp was querying Docker's resolver, not the corporate DNS. And Docker's resolver didn't know about the internal service.

The rule: dig shows you what a DNS query returns. curl shows you what the application stack resolves. They are not always the same query against the same server.

When they disagree, the question is: which one matches what your application uses? Usually it's the curl path, because your application also calls getaddrinfo().

That's why dig and curl gave different answers inside that Docker container — Docker's internal resolver handled curl's path but not dig's direct query to the corporate DNS.

The Trust Problem Nobody Talks About

DNS was designed in 1983. The original protocol has no cryptographic authentication. A resolver asks a question; an authoritative server answers. There's nothing in the original design that proves the answer came from the real authoritative server.

This isn't a theoretical concern. DNS spoofing and cache poisoning are real attacks. An attacker who can intercept or forge DNS responses can redirect any hostname to any IP — transparently, with no visible error to the user.

The fix is DNSSEC — DNS Security Extensions. DNSSEC adds cryptographic signatures to DNS records. Authoritative servers sign their records with a private key; validators check those signatures against a public key published in the parent zone. The chain of trust runs from the root all the way down.

DNSSEC deployment is... fragmented. The root zone is signed. Many TLDs are signed. Many individual domains are not. And critically, many recursive resolvers don't validate DNSSEC even if the zone is signed — they just pass the signatures along. Validation has to happen somewhere for it to matter, and the chain has a lot of links.

This is why your browser sometimes shows a DNSSEC validation warning on a subdomain you're confident is yours — someone in the delegation chain has a misconfigured or expired signing key.

DoT (DNS over TLS) and DoH (DNS over HTTPS) are a different layer of protection. They encrypt the query in transit — so your ISP can't see what you're looking up, and a network attacker can't intercept the packet. But they don't solve the authoritative trust problem. You're still trusting the resolver at the other end.

The honest summary: DNS's trust model is "trust whoever answers." DNSSEC tries to fix that. Its deployment is patchy. DoT/DoH protect the wire, not the answer.

Practical Debugging Toolkit

When DNS is misbehaving, you want to query different layers explicitly rather than guessing:

# What does the authoritative server say right now?
dig @ns1.example.com api.example.com A

# What does your configured resolver return (including its cache)?
dig api.example.com A

# What would an uncached query to a specific public resolver return?
dig @8.8.8.8 api.example.com A

# Trace the full recursive delegation chain
dig +trace api.example.com A

# Show what getaddrinfo() would return (follows /etc/hosts, nsswitch.conf)
# getent is part of glibc-utils on Debian/Ubuntu; not available on macOS
getent hosts api.example.com

# Check your resolver configuration
cat /etc/resolv.conf
resolvectl status    # systemd-resolved environments

The +trace flag on dig is particularly useful when something is broken in the delegation chain itself — wrong glue records, expired DS records, missing NS entries. It shows you every hop.

When dig @authoritative returns the right answer but your application gets something different, the problem is between your application and that authoritative server. Work backwards: your OS cache, your container's resolver, your corporate split-horizon DNS, your VPN's DNS override.

The Thing Worth Holding Onto

I have a love-hate relationship with DNS. It's 40 years old, it was designed when the internet was a few hundred hosts, the trust model was "everyone on the network is trustworthy," and the failure modes of a globally distributed cache were... well not fully thought through.

And yet it works. Hundreds of billions of queries a day, run by competing organisations with no central coordinator, and it almost never falls over. When it does fail, the failures are the worst kind: subtle. An answer that looks valid but is stale. A resolution path that bypasses the record you just updated. A CNAME chain that doesn't behave the way you modelled it. You don't get an error. You get the wrong IP, served confidently. Oh dear maybe LLMs learned from DNS !!!

Every DNS debugging session I've ever had ended the same way: I wasn't querying the layer I thought I was querying. The answer is always cached somewhere you forgot to check. And you do this so rarely you basically have to re-teach yourself the DNS stack each time.

Your Process Doesn't Exist Alone

Naz Quadri — Tue, 31 Mar 2026 22:02:42 +0000

Your Process Doesn't Exist Alone

Sessions, Process Groups, and Why Ctrl-C Kills the Right Thing

Reading time: ~13 minutes

You pressed Ctrl-C and the program stopped. Exactly the right program. Not its parent. Not the shell you typed from. The one you were running.

That probably felt unremarkable. It shouldn't. The kernel had to figure out which processes — out of a structured hierarchy on your machine — deserved that signal. It got it right every time you've ever tried. There is infrastructure specifically designed to make that work, and it involves three layers of grouping you've never had to think about.

Let's look at what's actually happening.

The Problem Ctrl-C Has to Solve

Here's a scenario. You type this:

tar czf archive.tgz big_directory/ | pv | gpg --encrypt > archive.tgz.gpg

Three processes. A pipeline. You get bored waiting and press Ctrl-C.

Which one should die? All three. They're one operation from your perspective. The terminal agrees.

Consider: you've spawned those three processes from a bash shell. Bash itself is running. Maybe there are other background jobs. Maybe there's a long-running process you started earlier with &. When you press Ctrl-C, none of those should die.

The kernel needs a way to know which processes are "the foreground thing you're doing right now" and which are everything else. The mechanism it uses is the process group.

Process Groups: The First Layer

Every process belongs to a process group, identified by a PGID (process group ID). When you run a pipeline like tar | pv | gpg, the shell puts all three into the same process group. When you run a single command, that command becomes a process group of one.

The PGID is inherited through fork(). When a process forks, the child starts in the same process group as the parent. To create a new process group, a process calls setpgid().

A pipeline works like this:

Shell (PGID=1000)
    │
    ├─ fork() → tar   (PGID=1001, also leader since tar's PID=1001)
    ├─ fork() → pv    (PGID=1001, setpgid'd to join tar's group)
    └─ fork() → gpg   (PGID=1001, setpgid'd to join tar's group)

The shell then tells the terminal: "the foreground process group is now 1001." When you press Ctrl-C, the kernel sees that sequence, generates SIGINT, and delivers it to every process in group 1001. All three die. The shell — in group 1000 — is unaffected.

That's why Ctrl-C kills the right thing. And that's why Ctrl-C sometimes doesn't work — if a program installs a SIGINT handler and doesn't propagate the signal to its children, or puts its children in a different process group, Ctrl-C kills the parent but leaves the children running. You've seen this: the prompt comes back, but there are still processes chewing through CPU.

There's one process in each group that the kernel considers the group leader — the one whose PID equals the PGID. For a pipeline, bash usually makes the first process in the pipeline the leader. If the leader dies, the group doesn't disappear; the other members keep running. It's just a label.

Sessions: The Second Layer

Process groups answer "which processes are this foreground job." But there's a bigger question: which foreground job is active right now, and who owns the terminal?

That's where sessions come in.

A session is a collection of process groups. All the process groups you create during a login session belong to the same session. When you open a terminal and start typing, every command you run — every pipeline, every background job, everything — is in the same session.

Sessions have an ID too: the SID. The first process to call setsid() creates a new session and becomes the session leader. Its PID becomes the SID.

The session has one more piece: the controlling terminal. This is the terminal device that delivers job control signals — SIGINT from Ctrl-C, SIGTSTP from Ctrl-Z, SIGHUP when the terminal closes. Exactly one session owns a given terminal at a time, and the terminal knows which process group is in the foreground.

Session (SID=999, controlling terminal: /dev/pts/3)
    │
    ├── Process Group 999 (shell — the session leader's group)
    │       └── bash (PID=999, PGID=999)
    │
    ├── Process Group 1001 (foreground: tar | pv | gpg)
    │       ├── tar
    │       ├── pv
    │       └── gpg
    │
    └── Process Group 1002 (background job: long-running-thing &)
            └── long-running-thing

The terminal keeps a pointer to the foreground process group. SIGINT goes there. SIGTSTP goes there. When you run a command in the foreground, the shell calls tcsetpgrp() to tell the terminal which group has the focus. When the command finishes, the shell takes it back.

Run ps -ej and look at the SID and PGID columns. Every process is accounted for — including daemons, which show ? in the TTY column because they have no controlling terminal.

The Controlling Terminal: What It Actually Is

"Controlling terminal" sounds abstract. It's not.

A controlling terminal is a file descriptor — specifically a TTY or PTY device — that a session is attached to. It's the device that knows how to translate "the user pressed Ctrl-C" into a signal, and "the user pressed Ctrl-Z" into a different signal.

When a session leader opens a TTY device for the first time, that TTY becomes the controlling terminal for the session. The kernel records this association in both directions: the TTY knows its session, and the session knows its TTY.

Here's what "controlling terminal" buys you:

Job control signals. Ctrl-C, Ctrl-Z, Ctrl-\ all generate signals via the controlling terminal's line discipline.
SIGHUP on terminal close. When the controlling terminal's last master side is closed, the kernel sends SIGHUP to the session leader and the foreground process group.
Background I/O protection. If a background process tries to read from or write to the controlling terminal without being in the foreground group, it gets SIGTTIN or SIGTTOU. The process stops, you see "Stopped" in the shell, and you have to foreground it. Programs that reach for /dev/tty directly (rather than stdout) are explicitly targeting the controlling terminal — and the kernel enforces access control on it.

`nohup` and `disown`: Why They Exist

Two commands make more sense once you know what a session is.

When your SSH connection drops, the terminal closes. The kernel notices: the master end of the PTY is gone. It sends SIGHUP to the session leader and the foreground process group. If your long-running job is in that session, it gets SIGHUP. The default action for SIGHUP is: die.

This is not a bug. It's the intended behavior. The terminal is gone, so the processes that depended on it should clean up. The problem is that sometimes you want a process to keep running after you log out.

nohup solves this by making the child process ignore SIGHUP before exec. It also redirects stdout and stderr to nohup.out since the terminal won't exist to receive output. The process is still in your session, still has the same controlling terminal, it just won't die when SIGHUP arrives.

# nohup sets SIGHUP to SIG_IGN before exec, and redirects output
nohup long-running-job &

disown is a bash shell builtin that removes the job from the shell's job table. When the terminal closes and the shell receives SIGHUP, it re-delivers SIGHUP to its job groups. disown removes the job from that list, so the shell won't SIGHUP it on terminal close.

Neither of these is a clean solution. The process is still in the session. If the session leader exits and the kernel sends SIGHUP through the normal terminal-close path, your ignored-SIGHUP process might still get caught. And the process still has the closed terminal as its controlling terminal, which causes problems if it ever tries to read from stdin.

These are bandaids. The real solution is what comes next.

That's why ssh remote-host some-command & is dangerous. The some-command process starts on the remote host inside your SSH session. When your SSH connection drops, SIGHUP fires. If some-command doesn't handle SIGHUP, it dies. The solution is either nohup, or — better — put it inside a tmux session on the remote host.

`setsid()`: Cutting the Cord

The real way to detach a process from a terminal is setsid().

setsid() creates a new session. The calling process becomes the session leader of a brand new, empty session. This new session has no controlling terminal — and importantly, a session only gets a controlling terminal when the session leader explicitly opens a TTY. Without O_NOCTTY, opening a TTY device gives you a controlling terminal whether you want one or not. So if you never open a TTY, you never get one. No controlling terminal means no SIGHUP, no job control signals, no SIGTTIN/SIGTTOU.

One constraint: setsid() fails if the calling process is already a process group leader. This is why the daemonization recipe forks first — the child is not a group leader, so setsid() succeeds.

This is what daemonization does. The classic daemon recipe:

// Parent forks, then exits — child is now orphaned
// (child is not a process group leader, so setsid() will work)
if (fork() > 0) exit(0);

// Child calls setsid() — new session, no controlling terminal
setsid();

// Fork again so we're NOT the session leader
// (without O_NOCTTY, opening a TTY would give us a controlling terminal)
if (fork() > 0) exit(0);

// Now we're a non-session-leader process in a session with no
// controlling terminal. SIGHUP cannot reach us through the terminal.
// We are truly detached.

The double-fork: by forking a second time, we become a child of the session leader — still in the new session, but not the leader. A non-leader can never become a session leader, so it can never acquire a controlling terminal by opening a TTY.

That's why daemonization requires a double fork. You can't get to truly "no controlling terminal" in a single step if you're a session leader. The second fork is what closes that loophole.

setsid() is also what PTY supervisors call in the child process before exec. When a supervisor spawns a child with a PTY, the child calls setsid() first, then opens the PTY slave. That open() call is what gives the child its controlling terminal — the PTY slave. This is intentional. We want the child to have a controlling terminal. We just want that terminal to be the PTY we control, not whatever terminal the supervisor was launched from.

Why tmux and screen Survive Terminal Closure

Now all the pieces are in place to understand something that confused most of us for years.

You connect to a server over SSH. You start tmux. You run a bunch of stuff inside tmux. Your SSH connection drops. You reconnect and attach to the same tmux session. Everything is still there.

How?

When you start tmux, it forks a tmux server. That server calls setsid(). It is now the session leader of its own session with no controlling terminal. It is not attached to your SSH terminal at all. When your SSH session dies, SIGHUP goes to your SSH client's session and the processes in it. The tmux server is not in that session. It's untouched.

The tmux server holds the master ends of PTYs for all your "windows." Your shells and programs run on the slave ends. Those PTY slaves are controlled terminals for their sessions — not for yours.

When you reconnect and run tmux attach, a new tmux client process starts. It connects to the tmux server over a Unix socket. The server starts forwarding the PTY master output to the new client, and the client starts forwarding your keystrokes to the server. From your perspective, it looks like you reconnected. From the shell inside tmux's perspective, absolutely nothing changed.

Your SSH session (terminal = /dev/pts/0)
    │
    └── tmux client ─────────Unix socket────► tmux server (setsid, no ctrl terminal)
                                                    │
                                              PTY master ─────► /dev/pts/7
                                                                      │
                                               bash (ctrl terminal = /dev/pts/7)
                                                                      │
                                               your stuff, running fine

When your SSH terminal closes, the left side of this diagram disappears. The right side doesn't care.

That's why the right workflow on a remote server is tmux first, then work inside tmux — not nohup or disown. The session architecture guarantees survival. The workarounds don't. tmux isn't magic. It's setsid() and a Unix socket.

The Signal Routing Map

Let's put the whole picture together. When you press a key combination in your terminal:

Ctrl-C → The terminal line discipline intercepts it. It sends SIGINT to the foreground process group of the terminal's session.

Ctrl-Z → The terminal line discipline intercepts it. It sends SIGTSTP to the foreground process group.

*Ctrl-\* → Same routing, different signal: SIGQUIT. Kills with a core dump.

Terminal close → The kernel sends SIGHUP to the session leader and the foreground process group.

Background process tries to read from terminal → Kernel sends SIGTTIN to that process's process group.

Notice what's absent: there is no "kill just this one process from the terminal" mechanism. That's why daemons show ? in the TTY column of ps aux — a daemon has no controlling terminal, and the kernel records that as ?. The mysterious ? that used to look like noise is a direct signal: this process is properly detached. The terminal deals in groups. If you want to kill exactly one process, you use kill(pid, signal) directly — by PID. The terminal doesn't do individual targeting.

This is why kill -9 $$ in a subshell does what you expect but kill -9 $(jobs -p) might not — jobs -p returns PGIDs, not PIDs, and if you're not careful you're sending signals to entire groups.

What This Means for Supervising Processes

A PTY supervisor is a process that holds the master end of a PTY and manages a child process's terminal lifecycle. When it forks the child, the child calls setsid(). This creates a new session. The child then opens the PTY slave — which becomes its controlling terminal. The child and everything it forks lives in that session. Job control signals go to that session's process groups. SIGHUP will go to that session if the supervisor closes the PTY master — and we can use that deliberately to tell the supervised process to clean up.

The supervisor itself is in its own session, not the child's. The child's terminal lifecycle is entirely under the supervisor's control.

This is exactly why nohup and disown feel janky — they're trying to get the benefits of this architecture without actually building it. They leave the process in the wrong session and just ask it to ignore signals. The proper solution is to put the process in its own session from the start, owned by a supervisor that manages its lifecycle explicitly.

Quick Recap

Every process belongs to a process group (PGID). Shells put pipeline members in the same group.
Ctrl-C and Ctrl-Z deliver signals to the foreground process group, not just one process.
Every process group belongs to a session (SID). A terminal is the controlling terminal of exactly one session.
When a terminal closes, SIGHUP goes to the session leader and foreground process group.
nohup ignores SIGHUP. disown removes the job from the shell's cleanup list. Both are workarounds.
setsid() creates a new session with no controlling terminal. This is how daemons, tmux, and PTY supervisors properly detach.
tmux survives disconnection because the server calls setsid() at startup and communicates via a Unix socket — it was never in your terminal's session.

Signals: The Kernel's Text Messages

Naz Quadri — Tue, 31 Mar 2026 22:02:26 +0000

Signals: The Kernel's Text Messages

kill -9 Isn't What You Think It Is

Reading time: ~10 minutes

You've been saying "force kill" for years. You type kill -9 1234 when a process won't die, and you picture the operating system reaching in with a fist and crushing it.

That's not what happens. What happens is the kernel sends the process a message. The message contains exactly one piece of information: the number 9.

That's it. A number. The process gets a signal, and the signal is SIGKILL — signal 9. For a running process, termination is essentially immediate. But there's a case where even SIGKILL can't immediately kill a process: uninterruptible sleep — when a process is blocked in kernel code that cannot be safely interrupted. That delay, and what causes it, turns out to matter enormously.

Signals are one of the oldest IPC mechanisms in Unix — older than sockets, older than most of the other things you'd reach for when you want two processes to communicate. They're asynchronous, they arrive at arbitrary times, they can be caught or ignored or blocked, and some of them mean three different things depending on context. They're worth understanding.

Let's look at the machinery.

What a Signal Actually Is

A signal is not a byte written to a pipe. It's not a network packet. It's a tiny notification the kernel delivers to a process, completely out of band from whatever the process is currently doing.

When a signal arrives, the kernel interrupts the process mid-execution — whatever instruction it was running — and delivers the signal. The process then does one of three things, depending on how it has configured its signal disposition:

Default action. The kernel handles it. Most signals kill the process; some stop it; one does nothing by default. The process doesn't have a say.
Catch it. The process has installed a signal handler — a function. The kernel calls that function instead. When the handler returns, the process resumes whatever it was doing.
Ignore it. The process has said "I don't care about this signal." The kernel checks, shrugs, and moves on.

The disposition model is per-process, per-signal. A process can catch SIGTERM but ignore SIGUSR1 but take the default for everything else. Most processes don't configure most signals — they just take whatever the defaults are. Think back to the number of times you've intentionally installed a signal handler, I bet it was rare and idiosyncratic, that's been my experience.

And two signals can never be caught or ignored: SIGKILL (9) and SIGSTOP (19). The kernel handles them unconditionally. No signal handler, no SIG_IGN, no blocking mask. The process cannot intercept these — but as you'll see below, that doesn't mean SIGKILL takes effect instantly in every case.

The Signals You Already Know (And What's Actually Happening)

Let's start with the familiar ones and fill in the gaps.

SIGTERM (15)

kill <pid> with no flag sends SIGTERM. This is the polite version. The default action is termination, but the process can catch it — and well-behaved servers do. They use the SIGTERM handler to flush logs, close database connections, finish in-flight requests, and exit cleanly.

When systemd stops a service, it sends SIGTERM first. Waits a few seconds. Then, if the process is still running, sends SIGKILL. That's the TimeoutStopSec setting you've probably seen in unit files. The grace period between the two is intentional: give the process a chance to clean up before the kernel pulls the plug.

SIGKILL (9)

kill -9 or kill -SIGKILL. Not caught. Not ignored. Not blocked. The moment the kernel delivers SIGKILL to a running process, it's marked for immediate termination. The kernel doesn't call any signal handler, doesn't run atexit functions, doesn't flush stdio buffers.

This is why kill -9 leaves zombie processes, half-written files, and unreleased locks behind. It's not "force kill" — it's "unconditional termination, no cleanup allowed."

That's why kill -9 sometimes can't kill a process: uninterruptible sleep, shown in ps as state D. This is a process that's blocked in a kernel code path that cannot be safely interrupted — usually waiting on a disk I/O that can't be cancelled, or an NFS mount that's gone stale. The kernel won't deliver SIGKILL until the process wakes from that sleep. Sometimes it never wakes. That's how you get unkillable processes, and the only fix is rebooting or waiting for the I/O to resolve.

SIGINT (2)

You press Ctrl-C. SIGINT happens. The default action is termination.

That's why Ctrl-C kills a whole pipeline. The kernel's terminal driver sends SIGINT to the entire foreground process group — not a single process. Every process in the group gets it simultaneously. When you run cat huge_file | grep pattern | wc -l, all three processes are in the same foreground process group. They all get SIGINT. They all die. The pipeline collapses cleanly. I explain why it hits the whole group — and what a process group even is — in Sessions and Process Groups.

SIGPIPE

Run python script.py | head -n 10 and sometimes you'll see BrokenPipeError. Here's what happened: head read its 10 lines and exited. The pipe closed. The Python script was still writing. The kernel detected that the write end of a pipe has no readers, and sent SIGPIPE to the still-writing process. The default action is termination.

That's why BrokenPipeError exists — it's a caught SIGPIPE converted to an exception. Programs that ignore SIGPIPE (signal(SIGPIPE, SIG_IGN)) will instead get an error return from write(). Either way, the kernel is telling you: nobody's reading anymore, stop writing.

The Signals You Probably Don't Know

SIGTSTP (20): Ctrl-Z Does This

You press Ctrl-Z. The shell reports [1]+ Stopped. The process is still alive, just suspended.

What happened: the terminal driver sent SIGTSTP to the foreground process group. The default action is to stop the process — it gets parked in the kernel's run queue, not scheduled for CPU time, not consuming memory (well, its pages stay allocated, but it's not actively running). It's frozen mid-execution.

fg sends SIGCONT to resume it. bg also sends SIGCONT, but tells the shell to let it run in the background.

Programs can catch SIGTSTP. Vim does this. When you Ctrl-Z out of vim, it saves its terminal state, restores the original terminal settings, then stops itself. When you fg back in, it gets SIGCONT, re-enters raw mode, and redraws the screen. That's not magic — it's a signal handler.

The difference between SIGTSTP and SIGSTOP: SIGTSTP can be caught — a process can intercept it and do cleanup before stopping. SIGSTOP cannot — the kernel stops it unconditionally. The full job control cycle — &, fg, bg, and how the shell orchestrates it — is in Sessions and Process Groups.

SIGWINCH: Your Terminal Is Resizing

Every time you drag the corner of your terminal window, a signal fires.

SIGWINCH — "window change" — is delivered to the foreground process group when the terminal dimensions change. The kernel's terminal driver detects that the master side has reported a new size, and sends the signal.

Let that sink in for a moment. Window resizing is a signal event. The kernel is involved.

The process catches SIGWINCH, calls ioctl(STDOUT_FILENO, TIOCGWINSZ, &ws) to get the new dimensions, and redraws its UI. This is how vim reflows when you resize the window. How htop re-renders its columns. How your shell prompt re-wraps.

If you're writing a terminal application, you need to catch SIGWINCH or your UI will look wrong after any resize.

SIGCHLD: Your Child Died (Or Did Something Else)

When a child process changes state — exits, is stopped, is resumed — the kernel sends SIGCHLD to its parent.

The default action is to ignore it. Most processes don't care.

But shells care deeply. When bash gets SIGCHLD, it checks which child changed state, updates its job table, and prints [1]+ Done sleep 10 or whatever. When a daemon forks children to handle requests, it catches SIGCHLD so it can call waitpid() and reap the zombies before they accumulate.

That's why zombie processes happen. When a process exits, it doesn't fully disappear — it leaves a small record in the kernel's process table, holding the exit code. The parent is expected to call wait() or waitpid() to retrieve that exit code, at which point the zombie is cleaned up. If the parent never calls wait, the zombie just sits there. If the parent exits, zombies get reparented to init (PID 1), which calls wait in a loop.

SIGCHLD is the kernel radioing your base camp: "One of your people turned." You can go collect what's left — the exit code — by calling waitpid(). If you don't, they shamble around the process table indefinitely. The kernel literally calls them zombies. State Z in ps. No cure. No kill signal works — they're already dead. The only way to clear them is to collect the exit code, or let the parent die too, at which point init adopts the orphans and reaps them. Rick Grimes would call that a mercy kill. The kernel calls it wait().

SIGALRM: How Timeouts Work

alarm(30) sets a timer. In 30 seconds, the kernel delivers SIGALRM to the process. The default action is termination, but programs catch it to implement timeouts.

That's how shell scripts time out operations. The crude bash version:

# Kill myself after 30 seconds if still running
(sleep 30 && kill $$) & long-running-command

Or the cleaner way, using the timeout command (coreutils):

timeout 30 long-running-command

Under the hood, timeout does exactly what alarm() does — sets a timer, catches the signal, kills the child. The real pattern in C is alarm() with a signal handler that cancels or interrupts the operation. Many C programs use this to implement read timeouts, connection timeouts, and watchdog behavior.

SIGHUP: The Signal That Means Two Different Things

This one deserves its own section because it's genuinely confusing until you understand the history.

SIGHUP — "hangup" — originally meant that your physical serial terminal had disconnected. The modem connection dropped. The carrier signal was gone. The line was dead.

When that happened, the kernel sent SIGHUP to the session leader (usually the shell) of the terminal session. The shell would then die, taking its job-controlled children with it. This was the correct behavior: if your terminal is gone, there's no point keeping the session running.

SIGHUP means "your terminal disconnected." The default action is termination.

There's a second meaning, and it came from necessity.

By the mid-1980s, Unix system administrators had noticed: SIGHUP is a signal you can catch. And servers don't have physical terminals to disconnect. So SIGHUP became the conventional signal to tell a daemon "reload your configuration without restarting."

nginx, sshd, Apache, rsyslog — they all catch SIGHUP and re-read their config files. You'll see this in documentation all the time:

kill -HUP $(cat /var/run/nginx.pid)    # reload nginx config
# or, in modern times:
systemctl reload nginx                  # which does the same thing

That's why nginx -s reload works — the nginx binary locates the master process PID, sends SIGHUP, and exits. The master process catches SIGHUP, re-reads its config, and gracefully replaces its workers. It's not a special protocol. It's just a signal.

The same signal means "your terminal died, please exit" to an interactive process, and "please reload your config" to a daemon. Context entirely determines the correct behavior.

It's terrible API design, but it's been working for 50 years.

Process Groups and Why Signals Go Sideways

Signals don't always go where you expect. When Ctrl-C kills an entire pipeline but not your shell, that's not the signal mechanism — that's process groups, and the kernel's rules about who receives what. I cover the full process group and session architecture in Sessions and Process Groups.

nohup

You've probably used nohup to keep a process alive after logout. How it actually works — and why tmux is better — is in Sessions and Process Groups.

Signals and the PTY

Signals and file descriptors are not separate systems. They interlock at the terminal level.

When a PTY supervisor — a process that holds the master end of a PTY and manages a child's terminal session — wants to send Ctrl-C to the child, there are two approaches. The wrong one: kill(child_pid, SIGINT). The right one: write the byte 0x03 (the byte Ctrl-C generates) to the PTY master.

The kernel's line discipline, running inside the PTY, processes that byte and generates SIGINT for the foreground process group inside the PTY's session. The child sees a real Ctrl-C — one that came through the terminal, the way Ctrl-C is supposed to arrive. Job control works correctly. Signal handlers see the right source.

When the supervisor wants to stop the whole session cleanly, it sends SIGTERM to the child's process group (using the negative PGID), waits for exit by catching SIGCHLD, then cleans up resources. SIGKILL is the fallback if the process ignores SIGTERM and the timeout expires.

That's why well-written supervisors write control bytes to the PTY master instead of sending signals directly — it respects the terminal abstraction that the child process expects.

The Mental Model

Signals are the kernel's notification system for processes. Asynchronous, lightweight, and limited — they carry one piece of information, the signal number, and optionally a few extra bytes in the siginfo_t structure. They're not designed for data transfer. They're designed for control: "stop what you're doing," "your child just exited," "your window changed size," "someone is asking you to reload your config."

The disposition model — default, catch, ignore — gives each process control over how it responds. Except for SIGKILL and SIGSTOP, which the kernel reserves for itself. Those two bypass the disposition system entirely.

Process groups determine the blast radius — which processes a signal actually reaches. That mechanism, along with session architecture, is in Sessions and Process Groups.

And SIGHUP is historically weird. Accept it and move on.

Quick Recap

Here's what we've covered:

A signal is a small out-of-band notification delivered by the kernel to a process.
Disposition is per-process, per-signal: default action, catch with a handler, or ignore.
SIGKILL (9) and SIGSTOP (19) cannot be caught or ignored — the kernel handles them unconditionally.
kill -9 can fail to immediately kill a process in uninterruptible sleep (D state).
The terminal driver sends SIGINT on Ctrl-C and SIGTSTP on Ctrl-Z to the entire foreground process group.
SIGPIPE fires when you write to a pipe with no readers — that's the BrokenPipeError you've seen.
SIGWINCH fires whenever the terminal is resized — programs catch it to redraw their UI.
SIGCHLD notifies a parent when a child changes state; catching it is how you avoid zombie processes.
SIGHUP means "terminal disconnected" to interactive programs, and "reload config" to daemons. Both are real.
Process groups, nohup, session architecture, and setsid() are covered in the next post: Sessions and Process Groups.

What Actually Happens When You Read a File

Naz Quadri — Tue, 31 Mar 2026 22:02:11 +0000

What Actually Happens When You Read a File

14 Things That Happen Before You Get Your Bytes

Reading time: ~13 minutes

You wrote data = open('file.txt').read().

Two function calls. That's all you did. Python is happy. Your bytes are there. The whole thing took 0.317 milliseconds and you moved on with your life.

Under the hood, at least four separate processors woke up to serve you. One of them is inside your storage drive. Another is a dedicated interrupt controller. A third is the flash memory controller that lives alongside the NAND chips themselves, running its own firmware, maintaining its own data structures in its own RAM. And then there's your CPU, which you thought was doing all the work.

Let's go look at what actually happened.

The Comfortable Lie

The mental model most of us carry is: open a file, read the bytes, done. The file is "on disk." Reading it means "getting it from disk." Maybe there's a cache involved. Simple.

The truth is more interesting. Here's the actual stack:

Python f.read()
  └── C stdlib buffering
        └── read(2) syscall
              └── kernel VFS layer
                    └── filesystem driver (ext4, xfs, btrfs...)
                          └── block I/O scheduler
                                └── NVMe driver
                                      └── NVMe controller (separate ARM/RISC-V SoC)
                                            └── Flash Translation Layer
                                                  └── NAND flash cells
                                                        └── DMA → system RAM

You've probably hit the effects without knowing the cause: the first read of a file takes longer than subsequent reads. Large files on SSDs can stall with weird latency spikes. Programs that read the same file concurrently don't always step on each other. None of this makes sense until you know what's underneath.

You Haven't Even Left Python Yet

f = open('file.txt')   # this IS a syscall — the kernel is already working
data = f.read()        # and now it works harder

open() in Python isn't just creating a file object. CPython calls down to the open(2) syscall immediately — the kernel resolves the path, walks the directory tree, loads the inode, checks permissions, and allocates a file descriptor. By the time open() returns, the kernel has done real work. What it hasn't done is read any data. The file is open. The bytes are still on disk.

f.read() is where the data moves. Python's file object calls through its buffered I/O layer, which calls read(2) — the actual POSIX syscall that triggers the chain of events this post is about.

Now we're at the syscall boundary.

The Syscall — Crossing Into Kernel Space

The read(2) syscall is the moment your program stops being in charge.

Your code executes a special CPU instruction — syscall on x86-64 — that switches the processor from user mode to kernel mode. The CPU saves your current register state, changes privilege level, and jumps to the kernel's syscall handler. Your process is now blocked, waiting. The kernel is driving.

The kernel looks up the file descriptor you passed. File descriptors are integers — small ones, starting at 0. Behind that integer is a struct file in kernel memory, containing things like the current read position, flags, and most importantly: a pointer to the VFS inode for this file.

The VFS — The Kernel Doesn't Know What a Filesystem Is

I mean it does but in the name of abstraction the kernel doesn't operate in terms of ext4 or XFS at this level.

The kernel has a Virtual Filesystem (VFS) layer — an abstraction that sits above all actual filesystem implementations and defines a common interface. Every filesystem registers itself by providing a set of function pointers: here's how to look up a file by name, here's how to read an inode, here's how to iterate a directory.

When read(2) lands in the kernel, it calls through the VFS. The VFS looks at the inode for your file and calls that filesystem's read function. If you're on ext4, the ext4 driver handles it. If you're on btrfs, btrfs handles it. Your process has no idea which one it is.

This is why you can mount a USB drive formatted with FAT32 and open() files on it with exactly the same Python code. Same syscall. Same VFS interface. Different driver underneath.

read(fd)
  → kernel VFS layer
    → ext4_file_read_iter()   # or xfs_file_read_iter(), etc.
      → generic_file_read_iter()
        → page cache lookup

The Page Cache — The Kernel's Memory is Not Your Memory

Before any I/O happens, the kernel checks the page cache.

The page cache is a giant in-memory buffer the kernel maintains for file data. Every file read that came from disk is stored here. Every file write passes through here before hitting disk (in writeback mode). The page cache is the reason the second read() of the same file is instant: the bytes are already sitting in kernel memory.

Pages are 4KB chunks. The kernel asks: does the page cache contain the pages that cover the byte range this read() asked for?

If yes: copy the bytes from kernel memory into the userspace buffer your process provided. Done. No I/O. The whole chain from syscall to return took maybe 1–2 microseconds.

If no: we need to go get them. This is where it gets interesting.

That's why warm caches are fast and cold caches are slow. A page cache hit skips everything below — the filesystem, the block layer, the NVMe controller, NAND sensing, DMA, all of it. Irrelevant when your bytes are already in kernel memory. This is why spinning up a long-running server process is slow at first — nothing is cached — and then blazing fast once the working set is warm.

That's why mmap() exists. Mapping a file into virtual memory cuts out the copy step. The page cache pages are mapped directly into your process's virtual address space. When you access them, the CPU's MMU handles the translation. No read() syscall, no copy into a userspace buffer — the page cache IS your buffer. For large files read once, this is often faster. For small files read repeatedly, the syscall overhead doesn't matter and read() is fine. As an aside when I first discovered CreateFileMapping() (Windoze equiv of mmap) back in the late 90s, I was amazed, I started using it for everything till one of my mentors pled with me to stop ... but I did not 😂.

That's why databases use O_DIRECT. PostgreSQL and other databases bypass the page cache entirely with O_DIRECT, writing directly to the block layer. They maintain their own buffer pools and don't want the kernel caching their data twice. The kernel's cache is designed for general workloads; a database has better information about which pages to keep warm.

The Filesystem's Job — Names Don't Mean Anything

The filesystem translates the filename into block addresses.

Inside the filesystem, files are identified by inodes. The inode is a data structure that stores everything about a file except its name: permissions, timestamps, owner, size, and most importantly — where the data actually lives on disk.

The filename is just a pointer to an inode. The inode contains block addresses. The block addresses are where the data is.

For ext4, the inode uses a tree of "extent" records. An extent says "this file's data from byte offset X to offset Y is stored at block address Z on disk." Large files have multiple extents. Highly fragmented files have many extents that the filesystem has to chase down.

For a small file that's been there since the filesystem was created, you might get one extent: "all your bytes are at block 12845." For a file that's been written in pieces over months, you might get dozens of extents spread across the disk.

The filesystem resolves all of this and hands the block layer a list of logical block addresses to fetch.

The Block Layer — The Scheduler You Never Knew You Had

Between the filesystem and the storage driver, there's a block I/O layer with a scheduler.

For traditional spinning hard drives, the scheduler tries to minimize seek time by reordering requests so the drive head doesn't thrash back and forth. For NVMe SSDs, the scheduler does something different: it batches requests and submits them in parallel. NVMe was designed for SSDs and supports up to 65535 queues with up to 65535 commands each. The bottleneck isn't seek time; it's queue depth and controller parallelism.

For NVMe on modern Linux, the default scheduler is often none — the kernel trusts the device's own queuing. Check your system with cat /sys/block/nvme0n1/queue/scheduler.

Your single read() call might generate one I/O request. A larger read, or a file with many extents, might generate several requests, all in flight simultaneously.

The NVMe Controller — A Separate Computer

Here's where most people's mental model breaks down completely.

Your NVMe drive is not a passive block device. It's a computer.

Inside your SSD, there's a dedicated NVMe controller: an ARM (or MIPS, or RISC-V) processor running its own firmware, with its own RAM (typically up to 2GB for caching and FTL mapping tables on consumer drives; cheap drives use HMB — Host Memory Buffer — borrowing from your system RAM instead), its own instruction cache, its own operating loop. It boots when your machine powers on. It's running right now, independent of your CPU.

The NVMe protocol is how your CPU's storage driver talks to this controller. When Linux wants to read blocks 12845–12850, it writes a command into a submission queue in memory — a region that both the CPU and the NVMe controller can see, via PCIe. The controller polls or is notified of this command, picks it up, and starts processing it.

The CPU posts the command and goes to sleep, waiting for a completion notification. The NVMe controller is now in charge.

The Flash Translation Layer — Another Layer of Lies

The NVMe controller doesn't directly address NAND flash cells. There's one more indirection: the Flash Translation Layer (FTL).

NAND flash has two deeply inconvenient properties that make it unsuitable for direct use as a block device:

First, you can only write to a NAND cell by erasing it first. Erasing happens in "erase blocks" — units typically 128KB to several megabytes. You can't erase individual bytes. If you want to update 4KB of data in the middle of an erase block, you have to read the whole erase block, erase it, modify the relevant part, and write it all back.

Second, NAND cells wear out. Each erase cycle damages the cell's insulator a little. Consumer NAND is typically rated for 1,000–3,000 program-erase cycles. After that, the cell starts to lose charge and eventually holds incorrect data. If you wrote to the same physical location every time, it would wear out while the rest of the drive was fresh.

The FTL solves both problems. It maintains a mapping table — a giant lookup table in the controller's RAM — that maps the logical block address the host sees to the physical block address where data actually lives in the NAND. Every write goes to a fresh location, and the FTL updates the mapping. The old physical location is marked as invalid, to be reclaimed during garbage collection.

That's why TRIM matters. Without TRIM, the FTL doesn't know which logical blocks are no longer in use. It has to do garbage collection under load, pausing your writes while it erases blocks it thinks might be reclaimable. The OS tells the drive what it deleted. The drive's FTL updates its tables. Without TRIM, your SSD gets slower over time as it accumulates "dead" mappings it can't safely reclaim.

That's why SSDs slow down near full. The FTL runs out of fresh blocks and has to garbage-collect under load. Write amplification increases, and your writes start waiting on internal erase cycles. "90% full" means the FTL has 10% of its physical space to maneuver in.

For a read, the FTL translates: "you want logical block 12845" → "that's at physical NAND page 0xAF3C14." Now the controller can address the NAND directly.

NAND Sensing — Bits Are Charges

NAND flash stores bits as electrical charge on a floating gate — a conductor surrounded by insulating oxide, sandwiched inside a transistor.

Programmed (lower charge) reads as a 0. Erased (higher charge) reads as a 1. Sensing the charge means applying a reference voltage to the gate and measuring whether the cell conducts. The reference voltage is carefully calibrated — and for MLC or TLC NAND (2 or 3 bits per cell), there are multiple voltage thresholds to distinguish, because each cell holds multiple charge levels representing different bit patterns.

The NAND controller reads a full page at once — typically 4KB to 16KB. It applies the sense voltage across thousands of cells in parallel, latches the results into a page register, and then applies ECC (Error Correcting Code) to detect and correct any errors.

NAND cells are lossy. Fresh cells might have error rates of 1 bit per billion. Near end-of-life, that might be 1 bit per thousand. ECC is mandatory — without it, you'd get bit flips constantly.

After ECC, you have the corrected page data in the controller's page register. The relevant portion is your file's blocks.

DMA — The CPU Doesn't Touch the Data

When the NVMe controller transfers data from its page register to your system RAM, your CPU doesn't touch the data. You read that right, and it's not some expensive server build hardware optimisation, this works on that 2018 14 inch laptop you still carry around.

Direct Memory Access (DMA) lets the NVMe controller (and other peripherals) write directly to system RAM over the PCIe bus, without interrupting the CPU. The DMA engine is a separate piece of hardware — often on the CPU die, but logically separate — that handles these memory transfers while the CPU does other things (or sleeps, in this case, since your process is blocked).

The NVMe controller was told the physical address of the DMA target buffer when Linux submitted the I/O command. Now it uses the PCIe bus's memory write transactions to transfer the data directly into that buffer. No CPU instruction reads any of those bytes.

The data travels: NAND → controller page register → PCIe bus → DMA engine → system RAM.

Your CPU set up the transfer and will be notified when it's done. That's it.

The Interrupt — MSI-X

When the DMA transfer completes, the NVMe controller needs to tell the CPU.

It does this with an MSI-X interrupt — a "Message Signaled Interrupt eXtended." Instead of a physical signal on an interrupt pin (the old way), the controller writes a small message to a specific memory address. The interrupt controller (usually the APIC on x86 systems — another separate piece of hardware) sees this write and delivers the interrupt to the appropriate CPU core.

The CPU's interrupt handler wakes up, runs the NVMe completion handler, marks the I/O as complete, and unblocks the page fault that was waiting for this data.

MSI-X is worth naming because it enables one of the key performance features of NVMe: multiple independent queues mapped to different CPU cores. Old storage interrupts all went to one CPU, which then had to fan out work. With MSI-X, each NVMe queue can interrupt a different core. The NVMe controller talks directly to all the CPUs in parallel.

Back in the Kernel — Assembly

The page cache is now populated with the data from disk. The kernel copies the bytes from the page cache into the userspace buffer your process provided in the read() syscall. The syscall returns.

Your process unblocks. The return value is the number of bytes read. Control returns to Python's f.read() implementation, then to your code.

data has your bytes.

The Processor Count

Let's count the processors involved in that single read() call:

Your CPU — executed the syscall, set up DMA, blocked, received the interrupt, ran the completion handler, copied bytes to userspace
The NVMe controller — ARM/RISC-V SoC inside the drive, executed the FTL lookup, sent NAND commands, supervised the DMA transfer
The NAND flash controller — embedded within or tightly coupled to the NAND chips, handled the page read, applied ECC, transferred data to the controller's page register
The interrupt controller (APIC) — delivered the MSI-X completion interrupt to the correct CPU core

Four processors, minimum. More if you count the DMA engine as a separate compute element, which it arguably is.

For one read() call. That took 0.3 milliseconds.

The Lesson

We started with two lines of Python. Here's what they were hiding: the VFS, a filesystem driver, a block scheduler, an NVMe command queue, a Flash Translation Layer, NAND physics, DMA hardware, and interrupt routing.

None of those layers are particularly complicated on their own. It's the stack of them, the fact that each one is a complete independent system with its own logic and failure modes, that makes "reading a file" more interesting than it looks.

The next time you hit an unexpected latency spike on a read, or notice that a fresh server is slower than a warmed-up one, or wonder why SSDs slow down when they're 90% full — you know where to look.

The Parallel Lanes Nobody Uses

Naz Quadri — Tue, 31 Mar 2026 22:00:29 +0000

The Parallel Lanes Nobody Uses

SIMD and the Eight-Lane Highway You've Been Driving Solo

Reading time: ~13 minutes

You ran ripgrep across a 2GB log file and it finished in half a second. grep would have taken ten. You called np.array * 2 and it finished before the function call overhead had time to register.

Here's what actually happened: your CPU has 256-bit registers that can process 8 floats simultaneously. Those tools used all eight lanes of an eight-lane highway. Your Python for-loop uses one.

This is what your CPU can actually do.

The Fundamental Idea

SIMD stands for Single Instruction, Multiple Data. It's not a clever trick. It's a first-class feature of every CPU you've used in the last twenty years.

The idea is direct. A normal CPU instruction operates on one value:

ADD rax, rbx      # add one 64-bit integer to one other 64-bit integer

A SIMD instruction operates on a packed vector of values in a single clock:

VADDPS ymm0, ymm1, ymm2   # add eight 32-bit floats at once

Eight additions. One instruction. One cycle.

The register ymm0 is 256 bits wide. You pack 8 floats (each 32 bits) into it and treat the whole thing as a single operand. The arithmetic unit is physically wider — eight adders in parallel — and the instruction wires them all to fire simultaneously.

This is not a metaphor. It's silicon.

How We Got Here: The Register Zoo

The story of SIMD is a story of Intel and AMD racing to add bigger and bigger registers while pretending backward compatibility wasn't getting worse.

MMX (1996) — Intel introduced the first SIMD extension in the Pentium MMX. Eight 64-bit registers (mm0–mm7) for integer operations. The catch: those registers were aliased to the mantissa fields of the x87 ST(0)–ST(7) floating-point registers. Switching between MMX and x87 FP required executing EMMS to reset the x87 tag word first. (I'm simplifying the aliasing here — the full story involves how x87 tracks "empty" register slots.) Programmers used it. Suffered for it. Moved on.

SSE (1999) — Streaming SIMD Extensions. Eight new 128-bit registers (xmm0–xmm7), finally independent of the FPU stack. Supported 4 single-precision floats or integer variants. Used heavily for 3D graphics and audio in the early 2000s.

SSE2 (2001) — Added double-precision floats and 128-bit integer operations. x86-64 made SSE2 mandatory, so as of 64-bit mode you can assume it exists. This is the baseline.

SSE3, SSSE3, SSE4.1, SSE4.2 (2004–2007) — A string of incremental additions. String comparison instructions, dot products, population counts. Useful but baroque. The naming got embarrassing.

AVX (2011) — Intel widened the registers to 256 bits (ymm0–ymm15). Now you could do 8 floats or 4 doubles at once. The ymm registers are actually the full-width versions of the xmm registers — xmm0 is the lower 128 bits of ymm0.

AVX2 (2013) — Extended AVX to integer operations and added gather instructions (load scattered values from memory into a vector register). Available on Intel Haswell and later, AMD Ryzen. This is the register set most production code targets today.

AVX-512 (2017) — 512-bit registers (zmm0–zmm31). 16 floats or 8 doubles at once. Intel pushed this hard in server chips; it's common in the data center. Desktop support is inconsistent — Intel disabled AVX-512 on Alder Lake desktop SKUs specifically because AVX-512 instructions are power-hungry enough to trigger thermal throttling, and Alder Lake's big/little core design made the behavior unpredictable. AMD added AVX-512 starting with Zen 4. The instruction set is 300+ pages of documentation.

The registers kept doubling. The theoretical throughput kept doubling. Most application code never noticed.

Why the Compiler Sometimes Does This For You

Modern compilers — GCC, Clang, MSVC, and rustc (which uses LLVM) — can auto-vectorize loops. This is when the compiler looks at your scalar loop and emits SIMD instructions for it without you asking.

This works well when:

The loop has no data dependencies between iterations (iteration N doesn't use the result of iteration N-1)
The data is contiguous in memory (array, not linked list)
The compiler can prove there's no aliasing (the input and output arrays don't overlap)
The trip count is known or the compiler can generate a scalar fallback for the remainder

A simple sum-of-squares is a textbook case the compiler handles automatically:

pub fn sum_squares(a: &[f32]) -> f32 {
    a.iter().map(|x| x * x).sum()
}

Compile with --release targeting AVX2 and... the multiply vectorizes (vmulps) but the sum stays scalar (vaddss). Wait, what?

Floating-point addition isn't associative — (a + b) + c can give a different result from a + (b + c) due to rounding. The compiler won't reorder your additions without permission, which means it can't pack 8 sums into a single vaddps. Switch to integers and the story changes:

pub fn sum_squares_i32(a: &[i32]) -> i32 {
    a.iter().map(|x| x * x).sum()
}

Now you get vpmulld and vpaddd on ymm registers — 8 integers at once, fully vectorized. Integer addition is associative, so LLVM can reorder freely. See both versions side by side on Compiler Explorer →

This is the kind of thing that makes auto-vectorization both powerful and frustrating. The compiler is doing the right thing — it won't change your program's semantics — but it means the "just write clean code and the compiler will vectorize it" advice has a large asterisk on it.

This breaks down further the moment things get complicated. Add a branch inside the loop: the compiler has to use masked operations or give up. Use a data structure it can't prove is contiguous: it has to generate both a vectorized path and a scalar fallback, with a runtime check. Access non-contiguous memory: it has to use gather instructions, which are slower than you'd hope. Add any function call it can't inline: it bails entirely.

Rust's ownership model actually helps here — slices guarantee contiguous memory and the borrow checker proves non-aliasing at compile time. That's information the auto-vectorizer can use. In C, the compiler has to assume two float* arguments might alias unless you annotate with restrict.

The compiler's auto-vectorizer is optimistic but conservative. You can inspect the emitted SIMD with cargo rustc --release -- --emit asm, or use Compiler Explorer to see exactly what LLVM generated. Read that output. It's educational in a way that is sometimes painful.

Intrinsics: Taking the Wheel

When auto-vectorization isn't enough, you can write SIMD code directly using intrinsics — functions in Rust's std::arch module that map one-to-one to specific CPU instructions.

This is not assembly. You're still writing Rust. You're just telling the compiler exactly which instruction to emit. The ISA-specific code lives inside unsafe blocks, making it explicit where you're stepping outside the compiler's guarantees:

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

/// Add two float slices element-wise using AVX.
/// Handles lengths that aren't a multiple of 8 with a scalar tail.
#[target_feature(enable = "avx")]
unsafe fn add_arrays(a: &[f32], b: &[f32], out: &mut [f32]) {
    let n = a.len().min(b.len()).min(out.len());
    let mut i = 0;
    while i + 8 <= n {
        let va = _mm256_loadu_ps(a.as_ptr().add(i));   // load 8 floats
        let vb = _mm256_loadu_ps(b.as_ptr().add(i));   // load 8 floats
        let vc = _mm256_add_ps(va, vb);                // add all 8
        _mm256_storeu_ps(out.as_mut_ptr().add(i), vc);  // store 8 floats
        i += 8;
    }
    // scalar tail for remainder (if n % 8 != 0)
    for j in i..n {
        out[j] = a[j] + b[j];
    }
}

The __m256 type is a 256-bit vector. _mm256_loadu_ps loads 8 unaligned single-precision floats. _mm256_add_ps adds them. One call, one instruction. The #[target_feature(enable = "avx")] attribute tells the compiler this function requires AVX — calling it on hardware without AVX is undefined behavior, which is why the function is unsafe.

Intrinsics code is not fun to write. The naming convention (_mm256_loadu_ps vs _mm256_load_ps vs _mm512_loadu_ps) requires memorizing a taxonomy. The Intel Intrinsics Guide (at intrinsics.intel.com) is the reference — it lists every intrinsic, the instruction it maps to, the latency, and the throughput. You'll spend time there.

The upside over C: Rust's type system catches width mismatches at compile time. If you accidentally pass an __m128 where an __m256 is expected, that's a type error, not a silent runtime bug. The unsafe boundary also makes it easy to audit — every line that touches raw SIMD is visually contained.

For a higher-level alternative, Rust's portable SIMD API (std::simd) provides type-safe, architecture-independent vector types like f32x8. It's available on nightly and progressing toward stable. When it lands, it will be the preferred way to write explicit SIMD without unsafe or platform-specific intrinsics.

Most application programmers don't write intrinsics. But the programmers who write the libraries you depend on — numpy, simdjson, ripgrep — absolutely do.

Where SIMD Actually Lives

String Search

Finding a byte in a buffer. You do it constantly, you never think about it, and it's the single operation where SIMD makes the most visceral difference. A naive loop checks one byte at a time. SIMD checks 32 with a single _mm256_cmpeq_epi8 — compare 32 bytes simultaneously, get a 32-bit mask of which positions matched.

memchr — the fundamental byte-search operation — is implemented with SIMD at every level: glibc's C implementation, and Rust's memchr crate (which we'll get to in a moment). The function you call every day is already vectorized.

ripgrep is fast partly because of SIMD-accelerated memchr. The memchr crate by Andrew Gallant implements memchr, memmem, and substring search using AVX2 (and AVX-512 where available). The core idea for substring search is Teddy — an algorithm that uses SIMD to find candidate positions in bulk, then verifies them. When ripgrep is blazing through a 2GB log file, it's pushing 32 bytes at a time through vectorized comparisons. This is why it outperforms grep by 5–10x on many workloads. It's not magic. It's lanes.

That's also why string search benchmarks look bizarre to anyone who hasn't seen SIMD before. A loop that calls find in a hot path and a SIMD-accelerated version can differ by 8x with identical O() complexity. The algorithm doesn't tell you the constant factor.

JSON Parsing

In 2019 Daniel Lemire wrote a whitepaper which proved that JSON parsing is fundamentally a SIMD problem, giving birth to simdjson. The bottleneck in parsing isn't the logic — it's scanning through bytes looking for structural characters ({, }, [, ], :, ,, ").

simdjson processes 64 bytes at a time using AVX-512 (or 32 with AVX2). It classifies every byte simultaneously — is this a structural character? A whitespace? A quote? — using bitwise SIMD operations to produce bitmasks. Then it uses those bitmasks to drive parsing without a byte-at-a-time loop.

The result: simdjson parses JSON at 2–3 GB/s on a modern CPU. The fastest pure-scalar parser does maybe 300–500 MB/s. The 6x difference is entirely SIMD.

That's why simdjson exists. That's why it's in MongoDB, Clickhouse, and dozens of other systems that care about throughput.

Image Processing

Every pixel is independent. Every channel is independent. This is SIMD's dream workload — no data dependencies, no branches, just arithmetic on contiguous arrays of bytes. SSE2 processes 16 pixels at once with saturating addition (u8x16::saturating_add in portable SIMD). OpenCV, libjpeg-turbo, libpng — they all have SIMD paths for their hot loops. When Photoshop applies a filter to a 24-megapixel image in under a second, this is why.

ML Inference

This is the one that matters most right now.

Neural network inference is fundamentally matrix multiplication: take a weight matrix, multiply by an input vector, pass through an activation function. Repeat. The core operation — multiply-accumulate on large matrices — is exactly what SIMD was built for.

AVX2's fused multiply-add (_mm256_fmadd_ps via std::arch, or f32x8::mul_add in portable SIMD) does a*b + c on 8 floats in one instruction. For a naive matrix multiply loop, this is an 8x multiplier before you've thought about anything else. Add tiling for cache efficiency and you're in the range of what high-performance BLAS libraries actually do.

AVX-512 with VNNI (Vector Neural Network Instructions, 2019) goes further — it adds instructions specifically for quantized integer dot products used in 8-bit inference. A single vpdpbusd instruction (exposed as _mm512_dpbusd_epi32 in intrinsics) processes 16 multiply-accumulates in one clock. llama.cpp, the library that lets you run large language models on consumer hardware, has hand-written AVX2 and AVX-512 kernels for its matrix multiplication. When you run a local model on your laptop, those kernels are running in tight loops for every token you generate.

The Mindset Shift

Here's the insight that changes how you write code even if you never touch an intrinsic.

SIMD forces you to think in batches, not items.

Scalar code says: "for each element, do this." SIMD code says: "take 8 elements, do this to all of them at once, advance 8." The data structure implications are real.

Arrays of Structures vs Structures of Arrays

Consider a particle system. You might model it like this:

struct Particle {
    x: f32, y: f32, z: f32,       // position
    vx: f32, vy: f32, vz: f32,    // velocity
    mass: f32,
}
let particles: Vec<Particle> = Vec::with_capacity(1_000_000);

This is AoS — Array of Structures. Each particle's data is packed together. Intuitive. Natural.

The goal: update all x positions — x += vx * dt — for every particle.

The problem: x and vx are separated by 24 bytes in each struct. When you load a SIMD vector of 8 x values, you also pull in y, z, vx, vy, vz, mass — data you don't need. Your cache lines are full of noise. Your SIMD registers require a scatter-gather to populate.

The SIMD-friendly layout is SoA — Structure of Arrays:

struct Particles {
    x:  Vec<f32>,
    y:  Vec<f32>,
    z:  Vec<f32>,
    vx: Vec<f32>,
    // ...
}

With SoA, all x values are contiguous. Loading &particles.x[i..i+8] gives 8 consecutive x values, ready to go. Loading &particles.vx[i..i+8] gives the matching 8 vx values. One fused multiply-add updates 8 particles. No scatter-gather. No cache waste.

This is not a micro-optimization. The difference in a physics simulation inner loop can be 4–8x. The code is otherwise identical.

That's why SoA and AoS matter — two data structures with identical asymptotic behavior, identical logical content, identical algorithmic logic. One is auto-vectorizable. One isn't. The difference is 8x. Nobody mentioned this in algorithms class.

This also explains why entity-component systems (ECS) — used in game engines like Unity DOTS and Bevy — look structurally odd until you see SIMD. ECS stores component data in contiguous arrays per component type, not per entity. That's SoA. The performance difference for physics and animation simulations is why the pattern exists.

Alignment

SIMD instructions have opinions about memory alignment. Aligned loads — _mm256_load_ps — require the address to be 32-byte aligned (the address mod 32 == 0). Unaligned loads — _mm256_loadu_ps — work on any address, but may be slower on older hardware.

On modern CPUs (Intel Skylake and later, AMD Zen 2 and later), unaligned loads are as fast as aligned loads — as long as you don't cross a 64-byte cache line boundary. So alignment mostly solves itself if you enforce it on your arrays and use _mm256_loadu_ps in your code.

In Rust, you control alignment with #[repr(align(32))]:

#[repr(C, align(32))]
struct AlignedBlock {
    data: [f32; 8],
}

This is the equivalent of C's __attribute__((aligned(32))) or alignas(32). It means: "I plan to load this with SIMD and I want the first element to be register-friendly."

You Don't Need to Write Intrinsics

The practical message is not "go rewrite your code in intrinsics." It's shorter:

Write in a way the compiler can vectorize. Keep your hot loops simple and branch-free. Lay your data out contiguously in the access order you need it. Prefer SoA over AoS in performance-critical code. Reach for libraries (numpy, simdjson, BLAS, any vectorized BLAS-backed ML framework) before reaching for intrinsics.

That's why numpy is fast and a Python for-loop isn't. numpy's inner loops are SIMD-vectorized C. When you call arr * 2, numpy dispatches to a vectorized multiply kernel operating on the entire array in chunks of 8 or 16 elements. Your Python for-loop multiplies one element per bytecode interpretation cycle.

Understand that when two seemingly equivalent implementations have an 8x performance difference, this is frequently why. Not cache (though that's related). Not branch prediction (though that matters too). The data layout didn't allow the CPU to use seven of its eight lanes.

If you do need explicit SIMD, Rust gives you options before you reach for raw intrinsics:

std::simd — Rust's portable SIMD API (nightly, progressing toward stable). Type-safe vector types like f32x8 that compile to the best available instructions on any architecture. This is the future.
wide — a stable crate providing portable SIMD types today. Good for production code that can't wait for std::simd.
pulp — runtime CPU feature detection with safe SIMD dispatch.

For C++ codebases, highway (Google's portable SIMD abstraction) serves a similar role. Don't write raw _mm256_* calls unless you've exhausted the higher-level options — though in Rust, at least the type system will catch width mismatches at compile time instead of letting you discover them at midnight.

What the CPU Looks Like Now

One instruction:
  ADD rax, rbx
  → adds two 64-bit integers
  → uses 64 bits of register space

One SIMD instruction:
  VADDPS ymm0, ymm1, ymm2
  → adds eight 32-bit floats
  → uses 256 bits of register space
  → eight physical adders firing simultaneously

Your loop over 8 million floats:
  Scalar:  8,000,000 add instructions
  AVX2:    1,000,000 add instructions (8x fewer)
  AVX-512: 500,000 add instructions (16x fewer)

The lanes are there. They've been there since 1999, getting wider every few years. Every calculation you've ever run in a Python loop touched one lane of a machine that had eight available.

Forem: Naz Quadri

The Invisible Negotiation Between Your Laptop and the Air

The Invisible Negotiation Between Your Laptop and the Air

WiFi: Radio Physics, Collision Avoidance, and the Name That Means Nothing

The Name Means Nothing

The Standards Naming Disaster

How WiFi Actually Works at the Radio Level

Channel Width: The Speed-vs-Coverage Tradeoff

2.4 GHz vs 5 GHz vs 6 GHz

The 2.4 GHz Channel Overlap Problem

CSMA/CA: How WiFi Avoids Collisions

The Hidden Node Problem

Why WiFi Gets Slower With More Devices

Association and Authentication

Finding the Network

The Four-Way Handshake (WPA2/WPA3)

Why Open WiFi Is Dangerous

WPA3-SAE: The Fix

Beamforming and MIMO

The Reason Your Video Call Drops

WiFi 6E and WiFi 7: Finally, Room to Breathe

The Part That Gets Me

Further Reading

What Is RAM, Actually?

What Is RAM, Actually?

From Leaking Capacitors to Cache Lines

The Smallest Unit of Memory You Own

How a Read Works

The Refresh Treadmill

SRAM: The Expensive Alternative

The DDR Lineage

ECC: When Bit Flips Matter

VRAM: Memory for a Different Kind of Processor

How the CPU Talks to RAM

The Addressing Hierarchy

What Those Timing Numbers Mean

The Cache Hierarchy

L1: The Core's Private Scratchpad

L2: The Per-Core Buffer

L3: The Shared Pool

Cache Lines: The 64-Byte Atom

Write-Back vs. Write-Through

Cache Coherency: The MESI Protocol

Prefetching: The CPU Guesses Your Future

The Numbers That Matter

The Full Stack of a Single Address

Further Reading

How a Viking King Ended Up in Your Earbuds

How a Viking King Ended Up in Your Earbuds

Bluetooth: From Runic Initials to 1600 Hops Per Second

Harald Bluetooth Gormsson

The Radio: 79 Channels and a Lot of Jumping

The Piconet: One Clock to Rule Them

Pairing and Bonding: The Handshake

Legacy Pairing (Bluetooth 2.0 and earlier)

Secure Simple Pairing (Bluetooth 2.1+)

Two Stacks, One Name: Classic vs BLE

BLE Advertising: Shouting Into the Void

GATT: The API Layer

Profiles: Why Your Car Can't Display Album Art

The Audio Codec Situation

Why Bluetooth Audio Has Latency

Coexistence: Sharing the 2.4 GHz Band

The Part That Gets Me

Further Reading

What Happens When You Plug Something In

What Happens When You Plug Something In

USB: The Protocol That Ate Every Port

A Brief History of Cable Hell

The Physical Layer: Two Wires and a Clever Trick

The Speed Naming Disaster

Enumeration: The Descriptor Dance

Device Classes: Why Your Webcam and Keyboard Share a Protocol

Endpoints and Pipes: Four Flavors of Data Transfer

Hubs and the Tiered Star Topology

How Many Devices Can You Actually Connect?

USB-C: The Connector That Knows Too Much

Why "Safely Remove" Exists

Power Delivery: From 500mA to Charging Laptops

The Polled Bus: Everything Waits for the Host

`select`: The 1983 Hammer

`poll`: Slightly Less Wrong

`epoll`: The Linux Answer (And Why Linux Won)

`kqueue`: BSD Did It Too

`O_NONBLOCK`: The Prerequisite Nobody Explains

The State Machine Behind `connect()`

The Slow Close: `close()` vs `shutdown()`

Why `dig` and `curl` Give Different Answers

`nohup` and `disown`: Why They Exist