<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Naz Quadri</title>
    <description>The latest articles on Forem by Naz Quadri (@nazq).</description>
    <link>https://forem.com/nazq</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3827975%2F4335d9d1-47be-432a-bbc3-f08557839c40.jpeg</url>
      <title>Forem: Naz Quadri</title>
      <link>https://forem.com/nazq</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/nazq"/>
    <language>en</language>
    <item>
      <title>The Invisible Negotiation Between Your Laptop and the Air</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:05:01 +0000</pubDate>
      <link>https://forem.com/nazq/the-invisible-negotiation-between-your-laptop-and-the-air-5b63</link>
      <guid>https://forem.com/nazq/the-invisible-negotiation-between-your-laptop-and-the-air-5b63</guid>
      <description>&lt;h1&gt;
  
  
  The Invisible Negotiation Between Your Laptop and the Air
&lt;/h1&gt;

&lt;h2&gt;
  
  
  WiFi: Radio Physics, Collision Avoidance, and the Name That Means Nothing
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~15 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You typed a URL into your browser and hit Enter.&lt;/p&gt;

&lt;p&gt;Before a single byte of your request left your laptop, your WiFi card performed a dance of radio physics, collision avoidance, and cryptographic negotiation that would make a diplomat proud. It listened to the air to make sure nobody else was talking. It picked a random backoff timer in case someone else had the same idea. It encrypted your data with a key derived from a four-way handshake that happened when you first connected. Then it modulated your bits across 52 subcarrier frequencies simultaneously, transmitted them as radio waves, and waited for an acknowledgement -- all in under a millisecond.&lt;/p&gt;

&lt;p&gt;You saw a webpage load. Here's what actually happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Name Means Nothing
&lt;/h2&gt;

&lt;p&gt;Let's get this out of the way: WiFi doesn't stand for anything. If I'm found to be wrong the internet is welcome to string me up.&lt;/p&gt;

&lt;p&gt;It's not "Wireless Fidelity." That's a backronym -- a meaning retrofitted onto a word that was chosen for entirely different reasons. In 1999, the &lt;strong&gt;WiFi Alliance&lt;/strong&gt; (then called WECA -- the Wireless Ethernet Compatibility Alliance) hired Interbrand, the same branding firm that named Prozac, to come up with a consumer-friendly name for the IEEE 802.11 standard. The engineers had been calling it "IEEE 802.11b Direct Sequence" in marketing materials 🤦🤦‍♂️🤦‍♀️. Nobody was buying it. Literally -- consumers didn't understand what it was.&lt;/p&gt;

&lt;p&gt;Interbrand proposed ten names. The consortium picked "WiFi" because it was short, memorable, and rhymed with "hi-fi" -- a term people already associated with quality audio equipment. The parallel was intentional: hi-fi meant high-quality sound; WiFi would mean high-quality wireless.&lt;/p&gt;

&lt;p&gt;Then someone on the Alliance's board insisted on adding a tagline: "The Standard for Wireless Fidelity." Phil Belanger, a founding member of the WiFi Alliance, has publicly called this a mistake. The tagline made people assume WiFi &lt;em&gt;stood for&lt;/em&gt; Wireless Fidelity, the way hi-fi stands for high fidelity. It doesn't. The WiFi Alliance eventually dropped the tagline, but the damage was done. Twenty-five years later, people still think it's an acronym.&lt;/p&gt;

&lt;p&gt;The 802.11 committee itself started in 1990. The first standard was ratified in 1997, offering a blistering 2 Mbit/s. Your morning coffee took longer to brew than a file transfer took to time out.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Standards Naming Disaster
&lt;/h2&gt;

&lt;p&gt;If you thought USB naming was bad -- and it is -- WiFi standards naming is worse. For two decades, the WiFi Alliance used the IEEE committee letter suffixes as consumer-facing product names. 802.11a. 802.11b. 802.11g. 802.11n. 802.11ac. 802.11ax. These aren't even alphabetical by release date. 802.11a and 802.11b were ratified the same year, but 802.11b shipped to consumers first. 802.11n came after 802.11g. The letters tell you nothing about which is newer, faster, or better.&lt;/p&gt;

&lt;p&gt;In 2018, the WiFi Alliance finally admitted nobody could remember the letter soup and introduced generational numbering. Retroactively.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Generation&lt;/th&gt;
&lt;th&gt;IEEE Standard&lt;/th&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;th&gt;Frequency&lt;/th&gt;
&lt;th&gt;Max Speed&lt;/th&gt;
&lt;th&gt;Channel Width&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;WiFi 1&lt;/td&gt;
&lt;td&gt;802.11a&lt;/td&gt;
&lt;td&gt;1999&lt;/td&gt;
&lt;td&gt;5 GHz&lt;/td&gt;
&lt;td&gt;54 Mbit/s&lt;/td&gt;
&lt;td&gt;20 MHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WiFi 2&lt;/td&gt;
&lt;td&gt;802.11b&lt;/td&gt;
&lt;td&gt;1999&lt;/td&gt;
&lt;td&gt;2.4 GHz&lt;/td&gt;
&lt;td&gt;11 Mbit/s&lt;/td&gt;
&lt;td&gt;20 MHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WiFi 3&lt;/td&gt;
&lt;td&gt;802.11g&lt;/td&gt;
&lt;td&gt;2003&lt;/td&gt;
&lt;td&gt;2.4 GHz&lt;/td&gt;
&lt;td&gt;54 Mbit/s&lt;/td&gt;
&lt;td&gt;20 MHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WiFi 4&lt;/td&gt;
&lt;td&gt;802.11n&lt;/td&gt;
&lt;td&gt;2009&lt;/td&gt;
&lt;td&gt;2.4 / 5 GHz&lt;/td&gt;
&lt;td&gt;600 Mbit/s&lt;/td&gt;
&lt;td&gt;20/40 MHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WiFi 5&lt;/td&gt;
&lt;td&gt;802.11ac&lt;/td&gt;
&lt;td&gt;2014&lt;/td&gt;
&lt;td&gt;5 GHz&lt;/td&gt;
&lt;td&gt;6.9 Gbit/s&lt;/td&gt;
&lt;td&gt;20/40/80/160 MHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WiFi 6&lt;/td&gt;
&lt;td&gt;802.11ax&lt;/td&gt;
&lt;td&gt;2021&lt;/td&gt;
&lt;td&gt;2.4 / 5 GHz&lt;/td&gt;
&lt;td&gt;9.6 Gbit/s&lt;/td&gt;
&lt;td&gt;20/40/80/160 MHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WiFi 6E&lt;/td&gt;
&lt;td&gt;802.11ax&lt;/td&gt;
&lt;td&gt;2021&lt;/td&gt;
&lt;td&gt;6 GHz&lt;/td&gt;
&lt;td&gt;9.6 Gbit/s&lt;/td&gt;
&lt;td&gt;20/40/80/160 MHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WiFi 7&lt;/td&gt;
&lt;td&gt;802.11be&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;td&gt;2.4 / 5 / 6 GHz&lt;/td&gt;
&lt;td&gt;46 Gbit/s&lt;/td&gt;
&lt;td&gt;20/40/80/160/320 MHz&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The retroactive rename was a good idea executed a decade too late. Your router's box now says "WiFi 6" instead of "802.11ax," which is progress. But the industry spent 20 years training people on the letter system, so every spec sheet still lists both. We're stuck in a bilingual world.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fymuf45kf62aywitvy6a7.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fymuf45kf62aywitvy6a7.jpeg" alt="WiFi standards timeline from 802.11a to 802.11be, with retroactive generational naming applied in 2018" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How WiFi Actually Works at the Radio Level
&lt;/h2&gt;

&lt;p&gt;Bluetooth, as I covered in the last post, hops across 79 narrow channels 1,600 times per second. WiFi does the opposite. It parks on a wide channel and splits its data across dozens of subcarrier frequencies simultaneously.&lt;/p&gt;

&lt;p&gt;The technique is called &lt;strong&gt;OFDM&lt;/strong&gt; -- Orthogonal Frequency Division Multiplexing. The "orthogonal" part is the key: the subcarriers are spaced so that the peak of each one lines up with the nulls of its neighbors. They overlap in frequency but don't interfere with each other. It's like a choir where everyone sings at a slightly different pitch -- the voices overlap, but you can still pick out each one.&lt;/p&gt;

&lt;p&gt;A standard 20 MHz WiFi channel contains 52 data subcarriers (plus 12 pilot and null subcarriers for synchronization and guard bands). Each subcarrier carries a portion of your data independently. If one subcarrier hits interference, the others keep going. The receiver reassembles the pieces.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryhvco82wleroa9ygwye.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryhvco82wleroa9ygwye.jpeg" alt="OFDM orthogonal subcarriers: overlapping waveforms with peaks at neighbors' zero-crossings, 52 subcarriers in a 20 MHz channel" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Channel Width: The Speed-vs-Coverage Tradeoff
&lt;/h3&gt;

&lt;p&gt;Wider channels mean more subcarriers mean more data. A 40 MHz channel has roughly twice the capacity of a 20 MHz channel. 80 MHz roughly doubles again. WiFi 7 introduces 320 MHz channels -- 16 times the width of the original 20 MHz channels.&lt;/p&gt;

&lt;p&gt;The tradeoff is brutal. In the 2.4 GHz band, there's only 70 MHz of usable spectrum (channels 1 through 13, though in the US only 1 through 11). A 20 MHz channel fits, but a 40 MHz channel consumes more than half the available space. An 80 MHz channel doesn't fit at all. Wider channels only work in 5 GHz and 6 GHz, where there's more room.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.4 GHz vs 5 GHz vs 6 GHz
&lt;/h3&gt;

&lt;p&gt;Three frequency bands, three different physics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.4 GHz&lt;/strong&gt; reaches everywhere and is useless in an apartment building. The physics is on your side — 12.5 cm wavelength diffracts around obstacles and punches through walls — but you share 70 MHz of spectrum with every WiFi router, Bluetooth device, baby monitor, and microwave in the building. Three non-overlapping channels. Thirty competing networks. Good luck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5 GHz&lt;/strong&gt; is where the actual work happens. Twenty-five non-overlapping channels, far less congestion, but the shorter wavelength (6 cm) means each wall costs you 3-6 dB — roughly halving usable range per wall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6 GHz&lt;/strong&gt; is the land grab. WiFi 6E and 7 opened 1,200 MHz of pristine spectrum — more than 2.4 and 5 combined. Seven non-overlapping 160 MHz channels. Three 320 MHz channels. No legacy devices, no microwaves, no Bluetooth. The range is even shorter, but if your router is in the same room, who cares.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why your 2.4 GHz network reaches the garden but your 5 GHz network dies at the bedroom wall.&lt;/strong&gt; It's not a defect. It's physics. Longer wavelengths bend around and penetrate obstacles better. The 2.4 GHz signal at 12.5 cm wavelength diffracts around a doorframe. The 5 GHz signal at 6 cm wavelength gets absorbed by it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 2.4 GHz Channel Overlap Problem
&lt;/h3&gt;

&lt;p&gt;The 2.4 GHz band has 11 channels in the US (13 in most other countries), each 20 MHz wide, spaced 5 MHz apart. They overlap. Channel 1 bleeds into channel 2, 3, 4, and 5. Only channels 1, 6, and 11 are far enough apart to not interfere with each other.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpnjnag36dxaav4a1zlfj.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpnjnag36dxaav4a1zlfj.jpeg" alt="2.4 GHz WiFi channel overlap: only channels 1, 6, and 11 are non-overlapping" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If your neighbor sets their router to channel 3 and yours is on channel 1, their signal bleeds directly into yours. You'd both be better off on channel 1 -- at least then the routers can hear each other and take turns. Partial overlap is worse than complete overlap, because the radios can't decode the interfering signal well enough to defer to it, but it's strong enough to corrupt their own transmissions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why "auto channel selection" on your router exists&lt;/strong&gt; -- and why it almost always picks 1, 6, or 11.&lt;/p&gt;




&lt;h2&gt;
  
  
  CSMA/CA: How WiFi Avoids Collisions
&lt;/h2&gt;

&lt;p&gt;WiFi is &lt;strong&gt;half-duplex&lt;/strong&gt;. Only one device can transmit on a channel at a time. Your router and your laptop take turns. Every phone, tablet, and smart thermostat connected to your network takes turns on the same channel. This is fundamentally different from Ethernet, which is full-duplex on modern switches.&lt;/p&gt;

&lt;p&gt;The mechanism that enforces turn-taking is &lt;strong&gt;CSMA/CA&lt;/strong&gt; -- Carrier Sense Multiple Access with Collision Avoidance.&lt;/p&gt;

&lt;p&gt;The algorithm is polite to a fault:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Listen.&lt;/strong&gt; Before transmitting, the device listens to the channel. If someone else is talking, wait.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wait for silence.&lt;/strong&gt; Once the channel has been quiet for a specific interval (called DIFS -- Distributed Inter-Frame Spacing, 34 microseconds on 802.11a/g), the device &lt;em&gt;still&lt;/em&gt; doesn't transmit immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Random backoff.&lt;/strong&gt; It picks a random number of time slots to wait (the "contention window"). Only after this random timer expires -- and the channel is &lt;em&gt;still&lt;/em&gt; quiet -- does it transmit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transmit and wait for ACK.&lt;/strong&gt; The access point must acknowledge every frame. No ACK means the frame was lost -- retry with a larger contention window.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The random backoff is the collision avoidance part. If two devices both hear the channel go quiet at the same time, the random timer makes it unlikely they'll transmit simultaneously. Not impossible -- but unlikely. On each collision, the contention window doubles -- exponential backoff.&lt;/p&gt;

&lt;p&gt;If that pattern sounds familiar, it should. When your HTTP client retries a failed request with exponential backoff and random jitter, it's solving the same problem at a different layer. WiFi does it in microseconds at the radio level to prevent two stations from colliding again. Your web client does it in seconds at the application level to prevent a thousand clients from hammering a recovering server simultaneously. Different timescale, different medium, same insight: &lt;em&gt;randomise the retry, and double the window on failure&lt;/em&gt;. The WiFi version came first -- inherited from Ethernet's CSMA/CD in the 1980s. Your &lt;code&gt;retry_with_jitter()&lt;/code&gt; function reinvented a 40-year-old radio technique.&lt;/p&gt;

&lt;p&gt;Compare this with Ethernet's &lt;strong&gt;CSMA/CD&lt;/strong&gt; (Collision &lt;em&gt;Detection&lt;/em&gt;). Ethernet devices transmit and listen simultaneously. If they detect a collision mid-transmission, both stop, send a jam signal, and retry. WiFi can't do this because a radio can't transmit and receive on the same frequency at the same time -- the transmitted signal is a billion times stronger than the received signal and would drown it out. WiFi avoids collisions because it can't detect them.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Hidden Node Problem
&lt;/h3&gt;

&lt;p&gt;There's a scenario CSMA/CA can't handle on its own. Imagine two laptops connected to the same access point, but on opposite sides of the building. Laptop A can hear the AP but not Laptop B. Laptop B can hear the AP but not Laptop A. Both sense an empty channel and transmit simultaneously. Their signals collide at the AP, and neither knows why their frames aren't getting through.&lt;/p&gt;

&lt;p&gt;The solution is &lt;strong&gt;RTS/CTS&lt;/strong&gt; -- Request to Send / Clear to Send. Before transmitting a large frame, a device sends a short RTS frame to the AP. The AP responds with a CTS frame that's heard by everyone in range. The CTS includes a duration field that tells all other devices to shut up for that long. This costs overhead -- two extra frames per transmission -- so it's typically only enabled for frames above a size threshold.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why WiFi Gets Slower With More Devices
&lt;/h3&gt;

&lt;p&gt;Twenty devices on one access point don't each get 1/20th of the bandwidth. They get less. Much less.&lt;/p&gt;

&lt;p&gt;Each device must wait its turn. More devices mean more contention, longer backoff windows, more collisions, more retries. The overhead is proportional to the number of devices contending, not the amount of data they're sending. A room full of idle phones still degrades your WiFi because they're all periodically transmitting management frames, probe requests, and keepalives -- each one claiming the channel for a moment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why your home WiFi feels slow during a family gathering.&lt;/strong&gt; It's not that your guests are using all your bandwidth. It's that 15 phones are all contending for the channel, and the contention overhead is eating your throughput alive. The bandwidth is there. The airtime isn't.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6p8blrhr43821hoimdc.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6p8blrhr43821hoimdc.jpeg" alt="CSMA/CA contention: successful transmission with random backoff vs collision with exponential backoff retry" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Association and Authentication
&lt;/h2&gt;

&lt;p&gt;When you first connect to a WiFi network, a multi-step negotiation happens before you can send a single data frame.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding the Network
&lt;/h3&gt;

&lt;p&gt;Your device finds networks in two ways. &lt;strong&gt;Passive scanning&lt;/strong&gt;: it listens on each channel for &lt;strong&gt;beacon frames&lt;/strong&gt; that access points broadcast every ~100 milliseconds. Each beacon contains the network name (SSID), supported rates, security type, and channel information. &lt;strong&gt;Active scanning&lt;/strong&gt;: your device sends &lt;strong&gt;probe requests&lt;/strong&gt; on each channel, asking "is anyone here?" Access points respond with probe responses containing the same information.&lt;/p&gt;

&lt;p&gt;That's why your phone's WiFi list populates -- it's cycling through channels, listening for beacons and sending probes. It's also why your phone is trackable in stores: those probe requests contain your device's MAC address. Modern phones randomize the MAC in probe requests to mitigate this, but the effectiveness varies.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Four-Way Handshake (WPA2/WPA3)
&lt;/h3&gt;

&lt;p&gt;Once you select a network and provide the password, the real cryptography begins. Your password never goes over the air. Instead, both sides derive the same &lt;strong&gt;Pairwise Master Key&lt;/strong&gt; (PMK) from the password and the network name using PBKDF2 (a key derivation function that runs SHA-1 4,096 times to make brute-force attacks expensive).&lt;/p&gt;

&lt;p&gt;Then comes the &lt;strong&gt;four-way handshake&lt;/strong&gt; (sorry no Bob and Alice):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AP → Client&lt;/strong&gt;: here's a random number (ANonce)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client → AP&lt;/strong&gt;: here's my random number (SNonce), plus a MIC (Message Integrity Code) proving I know the PMK&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AP → Client&lt;/strong&gt;: here's the group key (for broadcast traffic), encrypted with the derived key, plus a MIC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client → AP&lt;/strong&gt;: acknowledgement&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After these four frames, both sides have a &lt;strong&gt;Pairwise Transient Key&lt;/strong&gt; (PTK) derived from the PMK plus both random numbers. Every data frame is encrypted with this key using AES-CCMP (in WPA2) or AES-GCMP (in WPA3). The key is unique to this session, this client, this connection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Open WiFi Is Dangerous
&lt;/h3&gt;

&lt;p&gt;An open network (no password) has no four-way handshake. No PMK. No per-session keys. Every frame goes over the air in plaintext. Anyone with a WiFi adapter in &lt;strong&gt;monitor mode&lt;/strong&gt; (also called promiscuous mode) can read every byte. This includes the coffee shop WiFi, the airport WiFi, and the hotel WiFi.&lt;/p&gt;

&lt;p&gt;Monitor mode isn't exotic. A &lt;a href="https://www.alfa.com.tw/products/awus036achm" rel="noopener noreferrer"&gt;USB adapter like the Alfa AWUS036ACHM&lt;/a&gt; — about $40, fits in your pocket — puts your card into passive listening mode where it captures every frame on the channel, not just frames addressed to you. Combine it with &lt;a href="https://www.wireshark.org/" rel="noopener noreferrer"&gt;Wireshark&lt;/a&gt; and you're reading plaintext traffic in real time. This is a standard tool for network engineers and penetration testers. It's also why &lt;strong&gt;wardriving&lt;/strong&gt; was a thing — driving around with a laptop and a directional antenna, mapping open networks and logging traffic. Tools like &lt;a href="https://www.kismetwireless.net/" rel="noopener noreferrer"&gt;Kismet&lt;/a&gt; automated the process. Wardriving peaked in the mid-2000s when most networks were either open or using WEP (which could be cracked in minutes). WPA2 made passive sniffing useless against encrypted networks, but open networks remain fully transparent.&lt;/p&gt;

&lt;p&gt;HTTPS protects the content of your web requests, but DNS queries (unless you're using DoH or DoT), HTTP sites, and unencrypted app traffic are fully visible. The metadata alone — which domains you're visiting, when, how often — is valuable to an attacker. On an open network, that metadata is free for the taking.&lt;/p&gt;

&lt;h3&gt;
  
  
  WPA3-SAE: The Fix
&lt;/h3&gt;

&lt;p&gt;WPA3 replaced the PSK (Pre-Shared Key) authentication of WPA2 with &lt;strong&gt;SAE&lt;/strong&gt; -- Simultaneous Authentication of Equals. SAE uses a cryptographic protocol called Dragonfly that provides two critical properties WPA2 lacked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Forward secrecy&lt;/strong&gt;: if someone captures your encrypted traffic today and later learns the WiFi password, they &lt;em&gt;still&lt;/em&gt; can't decrypt the old traffic. Each session's keys are ephemeral and not derivable from the password alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resistance to offline dictionary attacks&lt;/strong&gt;: in WPA2, an attacker who captures the four-way handshake can take it home and brute-force the password offline. SAE makes each guess require an interactive exchange with the AP, making offline attacks impossible.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Beamforming and MIMO
&lt;/h2&gt;

&lt;p&gt;Early WiFi was omnidirectional -- the access point blasted signal in all directions equally, like a bare light bulb. Modern WiFi is more like a spotlight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Beamforming&lt;/strong&gt; uses multiple antennas to shape the transmitted signal so it's stronger in the direction of the receiving device. The AP sends the same data from each antenna with carefully calculated phase shifts. The signals constructively interfere in the target direction and destructively interfere elsewhere. The result: stronger signal where you need it, less wasted energy where you don't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MIMO&lt;/strong&gt; -- Multiple-Input, Multiple-Output -- uses multiple antennas on both ends to send independent data streams simultaneously over the same channel. A 4x4 MIMO system (four antennas on each end) can theoretically quadruple throughput compared to a single antenna. Your laptop's WiFi card likely has 2 antennas. Your router might have 4 or 8.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MU-MIMO&lt;/strong&gt; (Multi-User MIMO) extends this to serve multiple devices simultaneously. Instead of taking turns, the AP beamforms separate streams to different devices at the same time, each on different spatial paths. WiFi 5 introduced downlink MU-MIMO (AP to devices). WiFi 6 added uplink MU-MIMO (devices to AP).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why your router has weird antennas.&lt;/strong&gt; Those protruding rods aren't decorative. Each one is a separate antenna element, and the router needs spatial diversity -- antennas positioned at different angles and locations -- to create distinct spatial streams. Internal antennas (like in mesh systems) do the same thing; they're hidden inside the enclosure, but the physics is identical.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Reason Your Video Call Drops
&lt;/h2&gt;

&lt;p&gt;When your Zoom call freezes and your colleague's face turns into a Cubist painting, four things are probably going wrong at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distance and obstacles.&lt;/strong&gt; Every wall between you and your router costs roughly 3-6 dB of signal strength. A concrete wall can cost 10-15 dB. At -70 dBm (a typical "two bars" signal), your connection is viable but fragile. At -80 dBm, you're in trouble. At -85 dBm, you're disconnecting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Channel congestion.&lt;/strong&gt; In an apartment building, your AP might share a channel with a dozen neighbors. CSMA/CA forces everyone to take turns. Your video frame waits in queue while your neighbor's smart TV downloads a firmware update.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interference.&lt;/strong&gt; Microwave ovens operate at 2.45 GHz -- dead center of the 2.4 GHz WiFi band. A running microwave can obliterate WiFi channels 6 through 11. Bluetooth devices, cordless phones, baby monitors, and even poorly shielded USB 3.0 cables can add noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bufferbloat.&lt;/strong&gt; Your router buffers outgoing packets when the WiFi link is congested. Large buffers add latency without improving throughput. A 500ms buffer on a video call means your words arrive half a second late. The connection is technically up. The conversation is technically ruined. If your router supports SQM (Smart Queue Management) or FQ_CoDel (&lt;a href="https://en.wikipedia.org/wiki/CoDel" rel="noopener noreferrer"&gt;Fair Queuing Controlled Delay&lt;/a&gt;), enable it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal strength is not throughput.&lt;/strong&gt; RSSI (Received Signal Strength Indicator) tells you how loud the signal is, not how fast data moves. You can have strong RSSI and terrible throughput if the channel is congested. You can have weak RSSI and decent throughput if you're the only device on the channel. The WiFi icon on your phone shows signal strength. It tells you nothing about airtime contention, noise floor, or actual data rate.&lt;/p&gt;




&lt;h2&gt;
  
  
  WiFi 6E and WiFi 7: Finally, Room to Breathe
&lt;/h2&gt;

&lt;p&gt;The most significant WiFi advancement in the last decade isn't speed. It's spectrum.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WiFi 6E&lt;/strong&gt; opened the 6 GHz band -- 1,200 MHz of pristine, uncongested spectrum. No legacy devices. No microwaves. No Bluetooth. Every device on 6 GHz supports WiFi 6 or later, which means modern features like OFDMA (multi-user channel access) and BSS Coloring (interference mitigation) are universal. The 6 GHz band alone has more usable spectrum than 2.4 GHz and 5 GHz combined.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WiFi 7&lt;/strong&gt; (802.11be) builds on this with three headline features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;320 MHz channels&lt;/strong&gt;: available in the 6 GHz band, offering absurd peak bandwidth at the cost of range. A 320 MHz channel in 6 GHz can theoretically push 46 Gbit/s with 16 spatial streams. In practice, nobody has 16 spatial streams. But even with two streams, you're looking at multi-gigabit wireless speeds.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;4K-QAM&lt;/strong&gt;: &lt;a href="https://en.wikipedia.org/wiki/Quadrature_amplitude_modulation" rel="noopener noreferrer"&gt;QAM&lt;/a&gt; (Quadrature Amplitude Modulation) encodes multiple bits into each radio symbol. The number tells you how many distinct signal states it uses — and since each state is a bit pattern, the number is always a power of 2. WiFi 6 uses 1024-QAM: 2¹⁰ = 1024 states = 10 bits per symbol. WiFi 7 quadruples the states to 4096-QAM: 2¹² = 4096 states = 12 bits per symbol. Two extra bits per symbol is a 20% throughput increase for free — but distinguishing 4096 signal levels requires a much cleaner signal, so it only works at close range with minimal interference.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MLO -- Multi-Link Operation&lt;/strong&gt;: the feature I'm most excited about. A WiFi 7 device can connect to the AP on multiple bands simultaneously -- 2.4 GHz &lt;em&gt;and&lt;/em&gt; 5 GHz &lt;em&gt;and&lt;/em&gt; 6 GHz -- and aggregate the bandwidth or use the best link for each packet. Low-latency traffic goes to the least-congested band. Bulk downloads use all bands at once. If one band hits interference, traffic seamlessly shifts to another. This is the first time WiFi has been able to use multiple bands concurrently, and it fundamentally changes the reliability story.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgws5wm1q7ojd7q9iwlyz.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgws5wm1q7ojd7q9iwlyz.jpeg" alt="WiFi 7 Multi-Link Operation: simultaneous connections across 2.4, 5, and 6 GHz with automatic packet rerouting" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Part That Gets Me
&lt;/h2&gt;

&lt;p&gt;WiFi is the most successful wireless technology in human history. Over 18 billion devices use it. It works in your home, your office, on airplanes, in coffee shops, in hospitals, in warehouses. It operates in unlicensed spectrum that anyone can use, governed by standards written by committee, using a name that was chosen by a branding firm and means absolutely nothing.&lt;/p&gt;

&lt;p&gt;The engineers who designed 802.11 in 1997 couldn't have imagined 46 Gbit/s wireless links or video calls with 20 participants. But the core ideas they established -- OFDM modulation, CSMA/CA channel access, management frames for discovery and association -- still form the foundation of WiFi 7, nearly three decades later.&lt;/p&gt;

&lt;p&gt;Every time I open my laptop and start working without plugging in a cable, I'm relying on a system where my device listened to the air for 34 microseconds, picked a random number, waited that many slot times, modulated my data across 52 frequencies, encrypted it with a session key derived from a four-way handshake, and blasted it as radio waves toward a box on my shelf. That box did the reverse, routed my packets to the internet, and got a response back -- all before I noticed the page was loading.&lt;/p&gt;

&lt;p&gt;The best infrastructure is the kind you forget exists.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/IEEE_802.11" rel="noopener noreferrer"&gt;IEEE 802.11 — Wikipedia Overview&lt;/a&gt;&lt;/strong&gt; — the most accessible summary of the 802.11 family, from the original 2 Mbit/s standard through WiFi 7. Links to each amendment's technical details without requiring an IEEE subscription.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="http://www.kegel.com/c10k.html" rel="noopener noreferrer"&gt;The C10K Problem&lt;/a&gt;&lt;/strong&gt; — Dan Kegel's classic paper on handling 10,000 concurrent connections. Relevant here because CSMA/CA faces the same fundamental challenge: coordinating shared access among many contending parties without a central scheduler.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://wiki.wireshark.org/CaptureSetup/WLAN" rel="noopener noreferrer"&gt;Wireshark 802.11 Capture Setup&lt;/a&gt;&lt;/strong&gt; — how to capture WiFi frames in monitor mode with Wireshark. Seeing beacon frames, probe requests, and the four-way handshake in a packet capture is the fastest way to internalize the protocol.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://openwrt.org/docs/start" rel="noopener noreferrer"&gt;OpenWrt Documentation&lt;/a&gt;&lt;/strong&gt; — the open-source router firmware project. Its documentation covers WiFi configuration at a level that bridges theory and practice: channel selection, transmit power, band steering, and mesh networking on real hardware.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.wi-fi.org/discover-wi-fi/wi-fi-7" rel="noopener noreferrer"&gt;WiFi 7 (802.11be) Overview — WiFi Alliance&lt;/a&gt;&lt;/strong&gt; — the WiFi Alliance's summary of WiFi 7 features including MLO, 320 MHz channels, and 4K-QAM. A good non-spec-dense overview of where WiFi is heading.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri has never successfully connected to hotel WiFi on the first try and at this point considers it a personal failing. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>What Is RAM, Actually?</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:04:45 +0000</pubDate>
      <link>https://forem.com/nazq/what-is-ram-actually-2fn7</link>
      <guid>https://forem.com/nazq/what-is-ram-actually-2fn7</guid>
      <description>&lt;h1&gt;
  
  
  What Is RAM, Actually?
&lt;/h1&gt;

&lt;h2&gt;
  
  
  From Leaking Capacitors to Cache Lines
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~17 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You called &lt;a href="https://nazquadri.dev/blog/the-layer-below/10-malloc/" rel="noopener noreferrer"&gt;&lt;code&gt;malloc()&lt;/code&gt;&lt;/a&gt;. The kernel gave you an address. You stored a number there, read it back, and moved on with your life.&lt;/p&gt;

&lt;p&gt;But what &lt;em&gt;is&lt;/em&gt; that address? It's not a location on a chip — not in any way you'd recognize. It's an index into a grid of billions of capacitors, each one holding a single bit as an electrical charge that is, right now, draining away. The data you stored is evaporating. A circuit you've never thought about is racing to put it back before it disappears. And it does this for every single bit, millions of times per second, whether you're using the memory or not.&lt;/p&gt;

&lt;p&gt;This is what "random access memory" actually is. Let me show you the machinery.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Smallest Unit of Memory You Own
&lt;/h2&gt;

&lt;p&gt;Every bit in your RAM stick lives in a &lt;strong&gt;1T1C cell&lt;/strong&gt;: one transistor, one capacitor. That's it. Two components per bit. Your 16GB stick has roughly 137 billion of these cells.&lt;/p&gt;

&lt;p&gt;The capacitor stores the bit. A charged capacitor is a 1. A discharged capacitor is a 0. The transistor is a gate — it connects the capacitor to the outside world when activated, and isolates it when not.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsh3dr9pwnii7iipcg1vs.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsh3dr9pwnii7iipcg1vs.jpeg" alt="1T1C DRAM cell and the 4-step read cycle — word line activates, charge flows, sense amp detects, write back" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These cells are arranged in a two-dimensional grid. Rows and columns, like a spreadsheet. Every cell in a row shares a &lt;strong&gt;word line&lt;/strong&gt; — the wire that activates all the transistors in that row simultaneously. Every cell in a column shares a &lt;strong&gt;bit line&lt;/strong&gt; — the wire that carries the data out.&lt;/p&gt;

&lt;h3&gt;
  
  
  How a Read Works
&lt;/h3&gt;

&lt;p&gt;Reading a bit from DRAM is one of those processes that sounds simple and absolutely isn't.&lt;/p&gt;

&lt;p&gt;The memory controller activates the word line for the target row. Every transistor in that row turns on. Every capacitor in that row connects to its bit line. The charge stored in each capacitor is almost nothing — about 30 femtofarads of capacitance holding less than a volt. To put "femto" in perspective: a femtofarad is 10⁻¹⁵ farads. A typical AA battery stores roughly 5,000 coulombs of charge. A DRAM cell stores about 0.000000000000030 coulombs. The entire contents of your 32GB RAM, every single bit, could be powered by the static charge you build up shuffling across a carpet. That's how small this is. And yet the sense amplifier has to reliably distinguish it from zero.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;sense amplifier&lt;/strong&gt; at the end of each bit line detects this voltage difference. It's comparing the bit line voltage to a reference voltage, and the difference might be as small as 50 millivolts. The sense amp swings this to a full logic level — rail-high for a 1, rail-low for a 0. That's your bit.&lt;/p&gt;

&lt;p&gt;Here's the part that should bother you: &lt;strong&gt;reading is destructive&lt;/strong&gt;. Schrödinger would have loved DRAM 😹 — you can't observe the bit without destroying it.&lt;/p&gt;

&lt;p&gt;When the capacitor shares its charge with the bit line, the capacitor drains. The bit you just read is gone. The sense amplifier detected it, but the original charge is dissipated. Every single DRAM read erases the data it reads.&lt;/p&gt;

&lt;p&gt;That's why every read is followed by a &lt;strong&gt;write-back&lt;/strong&gt;. The sense amplifier, having determined the value, drives the bit line back to full voltage and recharges the capacitor. Every read is secretly a read-then-write. Your innocent &lt;code&gt;int x = array[0]&lt;/code&gt; triggers a destructive read and a restoration write at the hardware level.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Refresh Treadmill
&lt;/h3&gt;

&lt;p&gt;Even when nobody is reading your data, it's disappearing.&lt;/p&gt;

&lt;p&gt;Capacitors leak. The charge drains through the transistor's junction, through the dielectric, through physics being physics. A DRAM cell loses its charge in milliseconds. Left alone, every bit in your system would decay to garbage.&lt;/p&gt;

&lt;p&gt;The memory controller handles this with &lt;strong&gt;refresh cycles&lt;/strong&gt;. It systematically walks through every row and reads it — which, as you now know, drains the capacitors and writes the values back. The JEDEC spec for DDR4 requires every row to be refreshed within &lt;strong&gt;64 milliseconds&lt;/strong&gt;. For a chip with 65,536 rows, that's one row refresh every 976 nanoseconds.&lt;/p&gt;

&lt;p&gt;Your program has no idea this is happening. You never asked for it. You never see it. But roughly 5-10% of your memory bandwidth is consumed by refresh operations that exist solely because the hardware is fighting thermodynamics to keep your data alive.&lt;/p&gt;

&lt;p&gt;That's why it's called &lt;em&gt;dynamic&lt;/em&gt; RAM. (Penny drop ... )&lt;/p&gt;




&lt;h2&gt;
  
  
  SRAM: The Expensive Alternative
&lt;/h2&gt;

&lt;p&gt;If DRAM is a leaky bucket that needs constant refilling, &lt;strong&gt;SRAM&lt;/strong&gt; (Static RAM) is a proper container.&lt;/p&gt;

&lt;p&gt;An SRAM cell uses &lt;strong&gt;six transistors&lt;/strong&gt; per bit instead of one transistor and one capacitor. Four transistors form a cross-coupled pair of inverters — a &lt;strong&gt;latch&lt;/strong&gt; — that holds a stable 0 or 1 as long as power is applied. Two more transistors act as access gates.&lt;/p&gt;

&lt;p&gt;No capacitor. No leaking. No refresh cycles. The data stays put.&lt;/p&gt;

&lt;p&gt;The tradeoff is size. Six transistors per bit vs. two components per bit. An SRAM cell is roughly six times the area of a DRAM cell on the same process node. It's also more expensive per bit and draws more standby power.&lt;/p&gt;

&lt;p&gt;SRAM is what your CPU caches are made of. DRAM is what your sticks are made of. The entire cache hierarchy exists because we can afford small amounts of fast SRAM but need large amounts of cheap DRAM. That tension between speed and density shapes every modern computer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The DDR Lineage
&lt;/h2&gt;

&lt;p&gt;The DRAM in your machine is DDR SDRAM — &lt;strong&gt;Double Data Rate Synchronous Dynamic RAM&lt;/strong&gt;. "Double data rate" means it transfers data on both the rising and falling edges of the clock signal, effectively doubling throughput without doubling the clock speed.&lt;/p&gt;

&lt;p&gt;Each generation doubles the prefetch width and drops the voltage. &lt;strong&gt;DDR1&lt;/strong&gt; (2000, 2.5V, 2n prefetch) and &lt;strong&gt;DDR2&lt;/strong&gt; (2004, 1.8V, 4n prefetch) are ancient history — if either of those is still in production somewhere, someone should send flowers. &lt;strong&gt;DDR3&lt;/strong&gt; (2007, 1.5V, 8n prefetch) lasted nearly a decade and powered everything from the Mac Pro trash can to the PlayStation 4. If you built a computer between 2010 and 2016, this is what you used, and honestly it was fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DDR4&lt;/strong&gt; (2014, 1.2V, 8n prefetch with bank groups) is what most desktops and servers are still running in 2026. It added bank groups to reduce conflict penalties but kept the same prefetch width as DDR3 — the improvement was structural, not brute-force.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DDR5&lt;/strong&gt; (2021, 1.1V, 3200-6400+ MT/s, 16n prefetch) is the first generation that changed something fundamental.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;on-die ECC&lt;/strong&gt;. Every DDR5 chip has error correction built into the die itself. It can detect and correct single-bit errors within each internal transfer — not to be confused with system-level ECC (more on that shortly). You get error correction whether your motherboard supports ECC or not.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;dual channels per DIMM&lt;/strong&gt;. A DDR4 DIMM has one 64-bit channel. A DDR5 DIMM has two independent 32-bit channels. Same total width, but two independent channels mean two independent transactions can be in flight simultaneously. This cuts bank conflict stalls and improves utilization, especially for multi-threaded workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  ECC: When Bit Flips Matter
&lt;/h3&gt;

&lt;p&gt;A single bit flip in the wrong place can ruin your day.&lt;/p&gt;

&lt;p&gt;In 2003, &lt;a href="https://en.wikipedia.org/wiki/Electronic_voting_in_Belgium" rel="noopener noreferrer"&gt;Belgium's electronic voting system&lt;/a&gt; recorded an extra 4,096 votes for a candidate — later attributed to a single bit flip in RAM (bit 13 flipping from 0 to 1 adds exactly 4,096). Cosmic rays — high-energy particles from space — hit silicon and generate enough charge to flip a stored bit. It's rare per cell, but when you have 137 billion cells, the math gets uncomfortable. Google published a study in 2009 showing roughly one correctable error per gigabyte of RAM per year. Every first Tuesday in November, I quietly hope the cosmic rays take a day off from US voting machines, we struggle with societal ECC.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ECC RAM&lt;/strong&gt; adds an extra chip per channel that stores parity/syndrome bits. The most common scheme uses SECDED — Single Error Correction, Double Error Detection. It can correct any single-bit error and detect (but not correct) any two-bit error. This is why every server you've ever touched runs ECC memory. The cost premium is around 10-20%, and nobody running production workloads considers it optional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rowhammer&lt;/strong&gt; makes this worse. Discovered in 2014, rowhammer exploits the physical proximity of DRAM rows. Rapidly activating the same row — "hammering" it — causes electrical interference that flips bits in adjacent rows. It's a physical attack that crosses process boundaries. Attackers have used it to escalate privileges, escape VMs, and compromise entire systems through nothing but carefully timed memory access patterns. DDR5's on-die ECC mitigates single-bit rowhammer flips, but multi-bit attacks remain an active area of research.&lt;/p&gt;




&lt;h2&gt;
  
  
  VRAM: Memory for a Different Kind of Processor
&lt;/h2&gt;

&lt;p&gt;The RAM inside your graphics card is a different beast. It's optimized for a fundamentally different access pattern.&lt;/p&gt;

&lt;p&gt;Your CPU wants low latency — it needs one specific value &lt;em&gt;now&lt;/em&gt;. A GPU wants high throughput — it needs a million values &lt;em&gt;soon&lt;/em&gt;. This difference drives the entire GDDR and HBM design.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwf99m68bmun1v7kauxw4.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwf99m68bmun1v7kauxw4.jpeg" alt="Memory types comparison — DDR5, GDDR6X, and HBM3 bus widths and bandwidth" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GDDR&lt;/strong&gt; (Graphics DDR) shares DNA with regular DDR but makes different tradeoffs. GDDR6X, used in NVIDIA's RTX 4000-series, runs at higher data rates (up to 21 Gbps per pin) and uses a wider bus (256 or 384 bits vs. DDR's 64 bits). The per-pin latency is worse than DDR, but the raw bandwidth is staggering — over 1 TB/s on a high-end GPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HBM&lt;/strong&gt; (High Bandwidth Memory) takes a radically different approach. Instead of discrete chips on a PCB, HBM stacks multiple DRAM dies vertically — 8 or 12 layers tall — and connects them to the processor via an &lt;strong&gt;interposer&lt;/strong&gt;, a silicon bridge that sits beneath both the memory stacks and the processor die. Each HBM stack has a 1024-bit bus. An HBM3-equipped GPU might have six stacks, delivering over 3 TB/s of aggregate bandwidth.&lt;/p&gt;

&lt;p&gt;This is why AI accelerators use HBM — training and inference on large language models are memory-bandwidth-bound, not compute-bound. The math units can process data faster than conventional memory can feed them. HBM exists to close that gap.&lt;/p&gt;

&lt;p&gt;The NVIDIA H100 makes the cost of this choice concrete. It comes in two variants:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;H100 SXM&lt;/th&gt;
&lt;th&gt;H100 PCIe&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Memory type&lt;/td&gt;
&lt;td&gt;HBM3&lt;/td&gt;
&lt;td&gt;HBM2e&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Capacity&lt;/td&gt;
&lt;td&gt;80 GB&lt;/td&gt;
&lt;td&gt;80 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bandwidth&lt;/td&gt;
&lt;td&gt;3.35 TB/s&lt;/td&gt;
&lt;td&gt;~2 TB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Price (2025)&lt;/td&gt;
&lt;td&gt;~$30,000-40,000&lt;/td&gt;
&lt;td&gt;~$25,000-30,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same GPU die. Same 80GB. The SXM version costs $5,000-10,000 more, and a huge chunk of that premium is the newer HBM3 and the SXM board design that feeds it. The H200 pushed further — HBM3e, 141GB, 4.8 TB/s — and costs even more. When people ask why AI infrastructure is so expensive, a significant part of the answer is: stacking DRAM dies twelve layers tall on a silicon interposer and wiring each stack with a 1024-bit bus is not cheap.&lt;/p&gt;

&lt;p&gt;When people say a GPU has "80GB of HBM3," they're describing five stacks of vertically-interconnected DRAM dies, each with a bus wider than any DDR channel, all sitting millimeters from the compute die on a shared silicon interposer. That's not a memory stick you slot in. It's a piece of semiconductor engineering that was fabricated as part of the GPU package.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the CPU Talks to RAM
&lt;/h2&gt;

&lt;p&gt;In the old days — before 2003 — the memory controller lived on the motherboard's northbridge chip. Every memory access had to travel from the CPU, across the front-side bus, through the northbridge, and out to the DIMMs. It was slow and the bus was a bottleneck.&lt;/p&gt;

&lt;p&gt;AMD moved the memory controller onto the CPU die with the &lt;strong&gt;Athlon 64&lt;/strong&gt; in 2003. Intel followed with &lt;strong&gt;Nehalem&lt;/strong&gt; in 2008. Today, every modern CPU has an &lt;strong&gt;integrated memory controller&lt;/strong&gt; (IMC). The CPU talks directly to your RAM sticks with no intermediary chip.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Addressing Hierarchy
&lt;/h3&gt;

&lt;p&gt;A memory address isn't a flat index. It gets decomposed into a hierarchy of physical selectors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Channels&lt;/strong&gt;: Most desktop CPUs have 2 memory channels (DDR4/DDR5). Servers have 4, 6, or 8. Each channel is an independent path to a set of DIMMs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ranks&lt;/strong&gt;: Each DIMM can have 1, 2, or 4 ranks. A rank is a set of chips that respond together to fill the full data width of the channel (64 bits).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Banks&lt;/strong&gt;: Each rank is divided into banks — typically 16 bank groups of 4 banks each in DDR5. Banks can be accessed independently, allowing parallel operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rows and columns&lt;/strong&gt;: Within a bank, data is stored in a 2D array. The controller first opens a row (loading it into the bank's row buffer), then reads specific columns from that row.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Those Timing Numbers Mean
&lt;/h3&gt;

&lt;p&gt;Your RAM sticks have numbers printed on them like "CL16-18-18-36". These are latency timings, measured in clock cycles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CAS Latency (CL)&lt;/strong&gt;: The number of clock cycles between the column address command and data appearing on the bus. CL16 at 3200 MT/s means 10 nanoseconds (16 cycles / 1600 MHz actual clock). This is the most-quoted number but only tells part of the story.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;tRCD&lt;/strong&gt; (RAS to CAS Delay): How long after opening a row you can issue a column read. If the row you need isn't already open, you pay this penalty first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;tRP&lt;/strong&gt; (Row Precharge): How long it takes to close the current row before opening a new one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;tRAS&lt;/strong&gt; (Row Active Time): Minimum time a row must stay open before it can be precharged.&lt;/p&gt;

&lt;p&gt;The worst case — you need data from a row that isn't open, and a different row &lt;em&gt;is&lt;/em&gt; — costs you tRP + tRCD + CL. That's why the "random" in Random Access Memory is misleading. &lt;strong&gt;A row hit&lt;/strong&gt; (data is in the already-open row) takes CL cycles. &lt;strong&gt;A row miss&lt;/strong&gt; (need to precharge, open a new row, then read) takes tRP + tRCD + CL. In that scenario, "random" access is 3x slower than sequential access to the same row.&lt;/p&gt;

&lt;p&gt;This is why memory access patterns matter. This is why the memory controller reorders requests to maximize row hits. This is why prefetching exists.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cache Hierarchy
&lt;/h2&gt;

&lt;p&gt;You now have the full picture of DRAM: slow (50-80ns), cheap, dense, and leaky. The CPU core runs at ~4-5 GHz — one clock cycle takes roughly 0.2 nanoseconds. An L1 cache hit takes about 1 nanosecond. A DRAM access takes 50-80 nanoseconds, roughly 200-400 CPU clock cycles of thumb-twiddling.&lt;/p&gt;

&lt;p&gt;This gap — the &lt;strong&gt;memory wall&lt;/strong&gt; — has been growing since the 1980s. CPU clock speeds improved roughly 1000x between 1985 and 2005. DRAM latency improved about 10x in the same period. The solution is a hierarchy of progressively larger, slower, cheaper memories that hide the latency of the level below.&lt;/p&gt;

&lt;p&gt;All caches are built from SRAM. The six-transistor cells that don't leak, don't need refresh, and switch in under a nanosecond. The price you pay is density — which is why caches are measured in kilobytes and megabytes while main memory is measured in gigabytes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rqgdgtle66sxveyvx66.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rqgdgtle66sxveyvx66.jpeg" alt="Cache hierarchy pyramid — L1, L2, L3, and DRAM with size and latency at each level" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  L1: The Core's Private Scratchpad
&lt;/h3&gt;

&lt;p&gt;Each CPU core has its own &lt;strong&gt;L1 cache&lt;/strong&gt;, split into two halves: &lt;strong&gt;L1i&lt;/strong&gt; (instructions) and &lt;strong&gt;L1d&lt;/strong&gt; (data). Typical size: 32-64 KB each. Access time: roughly &lt;strong&gt;1 nanosecond&lt;/strong&gt;, or about 4-5 clock cycles on a modern core.&lt;/p&gt;

&lt;p&gt;32 KB sounds absurd for a working set. It is. But the L1 isn't meant to hold your working set — it's meant to hold the data and instructions the core needs &lt;em&gt;right now&lt;/em&gt;. Its hit rate on typical code is 95%+ because programs exhibit &lt;strong&gt;temporal locality&lt;/strong&gt; (you use something, you'll use it again soon) and &lt;strong&gt;spatial locality&lt;/strong&gt; (you use something, you'll use its neighbor soon).&lt;/p&gt;

&lt;h3&gt;
  
  
  L2: The Per-Core Buffer
&lt;/h3&gt;

&lt;p&gt;Each core also has a private &lt;strong&gt;L2 cache&lt;/strong&gt;. Typical size: 256 KB to 1 MB. Access time: roughly &lt;strong&gt;3-4 nanoseconds&lt;/strong&gt;. It's 4-8x slower than L1 but 4-16x larger.&lt;/p&gt;

&lt;p&gt;L2 catches the misses from L1. When L1 doesn't have what the core needs, L2 usually does. Combined L1+L2 hit rates typically exceed 97% for well-behaved code.&lt;/p&gt;

&lt;h3&gt;
  
  
  L3: The Shared Pool
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;L3 cache&lt;/strong&gt; is shared across all cores. On a modern desktop CPU it ranges from 8 MB (low-end) to 64 MB (AMD's X3D chips with stacked cache). On server processors, L3 can be 128 MB or more. Access time: roughly &lt;strong&gt;10-12 nanoseconds&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;L3's job is to catch misses from L2 and, critically, to be the rendezvous point for data shared between cores. If core 0 writes a value that core 4 needs, L3 is where they coordinate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cache Lines: The 64-Byte Atom
&lt;/h3&gt;

&lt;p&gt;The CPU never fetches a single byte from cache or memory. The smallest unit of transfer is a &lt;strong&gt;cache line&lt;/strong&gt; — 64 bytes on every modern x86 and ARM processor.&lt;/p&gt;

&lt;p&gt;When you read &lt;code&gt;array[0]&lt;/code&gt;, the CPU fetches the entire 64-byte block containing that address. If &lt;code&gt;array&lt;/code&gt; holds 4-byte integers, you just got &lt;code&gt;array[0]&lt;/code&gt; through &lt;code&gt;array[15]&lt;/code&gt; for free. This is spatial locality in action — and it's why iterating an array sequentially is fast. Each cache line fill pays for the next 15 accesses.&lt;/p&gt;

&lt;p&gt;It's also why your struct layout matters. If your struct is 72 bytes, every access touches two cache lines. If it's 64 bytes, it fits perfectly in one. If you have an array of structs where you only ever read one field, all the other fields are wasting cache space. That's the argument for struct-of-arrays vs. array-of-structs in performance-critical code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2p0255r989sf4ceoa2dr.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2p0255r989sf4ceoa2dr.jpeg" alt="False sharing — two cores invalidating the same cache line, and the fix with padding" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Write-Back vs. Write-Through
&lt;/h3&gt;

&lt;p&gt;When the CPU writes to a cached location, it has two strategies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write-through&lt;/strong&gt;: Write to the cache &lt;em&gt;and&lt;/em&gt; to the next level simultaneously. Simple, always consistent, but slow — every write incurs the latency of the slower level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write-back&lt;/strong&gt;: Write only to the cache. Mark the line as "dirty." Write it to the next level later, when the line is evicted. Faster for the common case (multiple writes to the same line before eviction), but more complex — you need to track which lines are dirty.&lt;/p&gt;

&lt;p&gt;Modern CPUs use write-back at every level. The performance difference is enormous. A write-through L1 would bottleneck on L2 latency for every store instruction. Write-back means the core can fire off dozens of writes to L1 at full speed, and the dirty lines trickle down to L2 and L3 in the background.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cache Coherency: The MESI Protocol
&lt;/h3&gt;

&lt;p&gt;The moment you have multiple cores with private caches, you have a consistency problem. If core 0 and core 4 both cache the same memory line, and core 0 writes to it, core 4's copy is stale.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;MESI protocol&lt;/strong&gt; (and its variant &lt;strong&gt;MOESI&lt;/strong&gt;, used by AMD) solves this by assigning each cache line one of four states:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Modified&lt;/strong&gt;: This cache has the only valid copy, and it's been written to. Main memory is stale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exclusive&lt;/strong&gt;: This cache has the only copy, and it matches main memory. Can be written without notifying anyone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared&lt;/strong&gt;: Multiple caches hold this line. All copies match main memory. Must notify others before writing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invalid&lt;/strong&gt;: This line is not valid. Must be fetched from elsewhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When core 0 writes to a Shared line, it broadcasts an invalidation to all other cores. They mark their copies Invalid. Core 0's copy becomes Modified. This takes ~40-100 nanoseconds depending on the topology — it has to cross the interconnect, hit the other core's cache controller, and wait for acknowledgment.&lt;/p&gt;

&lt;p&gt;This is the mechanism behind &lt;strong&gt;false sharing&lt;/strong&gt;, one of the most insidious performance bugs in concurrent programming. Two threads write to different variables, but those variables happen to sit in the same 64-byte cache line. The hardware sees writes to the same line from different cores and starts the invalidation ping-pong. Neither thread is doing anything wrong logically, but physically they're fighting over the same cache line. I've seen false sharing cause a 10x slowdown on workloads that looked perfectly parallel.&lt;/p&gt;

&lt;p&gt;The fix is usually alignment padding — force the two variables onto different cache lines. Most languages have annotations for this (&lt;code&gt;alignas(64)&lt;/code&gt; in C++, &lt;code&gt;#[repr(align(64))]&lt;/code&gt; in Rust, &lt;code&gt;CacheLinePad&lt;/code&gt; patterns in Go).&lt;/p&gt;

&lt;h3&gt;
  
  
  Prefetching: The CPU Guesses Your Future
&lt;/h3&gt;

&lt;p&gt;Modern CPUs don't wait for cache misses. They try to predict what you'll access next and fetch it before you ask.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;hardware prefetcher&lt;/strong&gt; monitors your access patterns. Sequential access is the easiest case — if you read cache lines N, N+1, N+2, the prefetcher starts loading N+3, N+4 before you get there. Stride patterns (every 4th element, every 8th) are also detected. Random access patterns defeat the prefetcher entirely.&lt;/p&gt;

&lt;p&gt;This is another reason sequential memory access is fast. You're not only getting 16 array elements per cache line — the prefetcher is loading the &lt;em&gt;next&lt;/em&gt; cache line while you're still processing the current one. The combination means a sequential array scan can approach the theoretical bandwidth of the memory subsystem, while random access pays the full latency penalty on every access.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers That Matter
&lt;/h2&gt;

&lt;p&gt;Here's the latency hierarchy that shapes every performance decision:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Typical Size&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;CPU Cycles (~4 GHz)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;L1 cache&lt;/td&gt;
&lt;td&gt;32-64 KB&lt;/td&gt;
&lt;td&gt;~1 ns&lt;/td&gt;
&lt;td&gt;~4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L2 cache&lt;/td&gt;
&lt;td&gt;256 KB - 1 MB&lt;/td&gt;
&lt;td&gt;~4 ns&lt;/td&gt;
&lt;td&gt;~16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L3 cache&lt;/td&gt;
&lt;td&gt;8-64 MB&lt;/td&gt;
&lt;td&gt;~12 ns&lt;/td&gt;
&lt;td&gt;~48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DRAM&lt;/td&gt;
&lt;td&gt;16-128 GB&lt;/td&gt;
&lt;td&gt;~50-80 ns&lt;/td&gt;
&lt;td&gt;~200-320&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every step down the hierarchy is roughly 4x slower and 10-100x larger. This isn't a coincidence — it's the design constraint. If L2 were as fast as L1, it would be the same size and cost. If DRAM were as fast as SRAM, we wouldn't need caches.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;memory wall&lt;/strong&gt; is the term for the growing gap between CPU speed and memory speed. In 1985, a CPU cycle and a memory access took roughly the same time. By 2005, the CPU was 1000x faster but memory was only 10x faster. Caches exist entirely to hide this 100x disparity. Every cache hit at L1 means the CPU avoided a 50-80 nanosecond stall — an eternity at 4 GHz.&lt;/p&gt;

&lt;p&gt;This is why data structures matter more than algorithms for many real-world workloads. A linked list traversal — pointer-chasing through random heap locations — defeats prefetching, defeats spatial locality, and hits DRAM latency on nearly every node. A flat array scan of the same data, even with a worse algorithmic complexity, can be faster because every access is an L1 hit.&lt;/p&gt;

&lt;p&gt;I'm not saying throw away your algorithms textbook. I'm saying the cost model it assumed — that all memory accesses cost the same — hasn't been true since the 1990s.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Stack of a Single Address
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;malloc()&lt;/code&gt; gave you an address. That address maps, through the page table and memory controller, to a specific channel, rank, bank, row, and column — an index into a grid of capacitors that are leaking right now. The memory controller refreshes every row every 64 milliseconds. A sense amplifier destroyed and rebuilt your data the last time anyone read it. Six layers of caching — L1i, L1d, L2, L3, TLB, prefetch buffers — exist to hide the fact that those capacitors take 300+ CPU cycles to respond.&lt;/p&gt;

&lt;p&gt;Every struct you lay out, every array you iterate, every concurrent data structure you design — the cache hierarchy is the invisible judge of your performance. The CPU hasn't been the bottleneck for decades. Memory has.&lt;/p&gt;

&lt;p&gt;And now you know what "memory" actually is: a grid of leaking capacitors, maintained by a refresh circuit running a race against physics, fronted by a hierarchy of SRAM caches playing an elaborate prediction game about what you'll need next.&lt;/p&gt;

&lt;p&gt;The next time you see a 64ms stall in a profile, you'll know where to look.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf" rel="noopener noreferrer"&gt;What Every Programmer Should Know About Memory — Ulrich Drepper (2007)&lt;/a&gt;&lt;/strong&gt; — the definitive deep dive, still relevant&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.jedec.org/standards-documents/docs/jesd79-5b" rel="noopener noreferrer"&gt;JEDEC DDR5 SDRAM Standard&lt;/a&gt;&lt;/strong&gt; — the official spec&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://users.ece.cmu.edu/~yoMDram/kim-isca14.pdf" rel="noopener noreferrer"&gt;Flipping Bits in Memory Without Accessing Them — Kim et al. (2014)&lt;/a&gt;&lt;/strong&gt; — the original rowhammer paper&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://research.google/pubs/pub35162/" rel="noopener noreferrer"&gt;DRAM Errors in the Wild: A Large-Scale Field Study — Schroeder et al. (2009)&lt;/a&gt;&lt;/strong&gt; — Google's study on real-world DRAM error rates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://igoro.com/archive/gallery-of-processor-cache-effects/" rel="noopener noreferrer"&gt;Gallery of Processor Cache Effects — Igor Ostrovsky&lt;/a&gt;&lt;/strong&gt; — excellent visual demonstrations of cache behavior&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri has mass produced more cache misses than he'd like to admit, mostly by iterating linked lists. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How a Viking King Ended Up in Your Earbuds</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:04:29 +0000</pubDate>
      <link>https://forem.com/nazq/how-a-viking-king-ended-up-in-your-earbuds-1c71</link>
      <guid>https://forem.com/nazq/how-a-viking-king-ended-up-in-your-earbuds-1c71</guid>
      <description>&lt;h1&gt;
  
  
  How a Viking King Ended Up in Your Earbuds
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Bluetooth: From Runic Initials to 1600 Hops Per Second
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~15 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You paired your headphones this morning. Your phone found them, you tapped "connect," and music started playing. That felt like nothing happened. But between that tap and the first beat of audio, your phone and your headphones agreed on an encryption key, negotiated a codec, established a frequency-hopping pattern across 79 radio channels, and started jumping between those channels 1,600 times per second -- all in a band shared with your WiFi router, your microwave, and every other wireless device in the building.&lt;/p&gt;

&lt;p&gt;The protocol that makes this work is named after a Viking king who's been dead for a thousand years.&lt;/p&gt;

&lt;p&gt;That's not a joke. It's one of the best naming stories in the history of technology.&lt;/p&gt;




&lt;h2&gt;
  
  
  Harald Bluetooth Gormsson
&lt;/h2&gt;

&lt;p&gt;In the mid-900s AD, Denmark was a mess. Warring tribes, fractured loyalties, no unified rule. Into this stepped &lt;strong&gt;Harald Gormsson&lt;/strong&gt;, who became King of Denmark around 958 AD and later King of Norway. He's remembered for two things: unifying the warring Danish tribes under a single kingdom, and converting the Danes to Christianity. His nickname was "Bluetooth" -- most likely from a conspicuously dead tooth that had turned dark blue, though some historians argue it derives from the Old Norse &lt;em&gt;blatand&lt;/em&gt; meaning "dark chieftain."&lt;/p&gt;

&lt;p&gt;Harald's real legacy was political unification. He took factions that refused to talk to each other and brought them under one protocol. One kingdom. One set of rules.&lt;/p&gt;

&lt;p&gt;Fast-forward about a thousand years to 1997.&lt;/p&gt;

&lt;p&gt;Intel engineer &lt;strong&gt;Jim Kardach&lt;/strong&gt; was at a bar in Toronto with Sven Mattisson from Ericsson, and they were wrestling with a problem: IBM, Ericsson, Nokia, Toshiba, and Intel all had competing short-range wireless standards. Each company wanted their own protocol to win. Nobody was budging.&lt;/p&gt;

&lt;p&gt;Kardach had been reading Frans G. Bengtsson's historical novel &lt;em&gt;The Long Ships&lt;/em&gt; -- a book about Vikings and the reign of Harald Bluetooth. The parallel was too perfect. Harald unified warring Scandinavian tribes. This new protocol needed to unify warring tech companies. Kardach proposed "Bluetooth" as a temporary codename while the consortium figured out a real name.&lt;/p&gt;

&lt;p&gt;The consortium was called the &lt;strong&gt;Bluetooth Special Interest Group&lt;/strong&gt; (SIG). They considered names like "RadioWire" and "PAN" (Personal Area Network). Focus groups were run. Marketing decks were produced. But by the time the final name was due, nothing had cleared trademark searches. "Bluetooth" had already stuck in the press and in internal documentation.&lt;/p&gt;

&lt;p&gt;The placeholder became the name.&lt;/p&gt;

&lt;p&gt;And the logo? It's not an abstract design. It's a &lt;strong&gt;bind rune&lt;/strong&gt; -- two runes from the Younger Futhark alphabet merged into a single glyph. &lt;strong&gt;ᚼ&lt;/strong&gt; (Hagall, Harald's H) and &lt;strong&gt;ᛒ&lt;/strong&gt; (Bjarkan, his B), overlaid on each other. Harald's initials, in the alphabet of his own era, sitting on billions of devices a millennium later.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3tnztmtgwrk6q8cqinwm.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3tnztmtgwrk6q8cqinwm.jpeg" alt="The Bluetooth logo deconstructed: Hagall and Bjarkan runes merging into the Bluetooth bind rune" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A 10th-century Viking king named the protocol in your pocket. Every time you see that angular logo on a speaker or a pair of earbuds, you're looking at runic initials from before the Norman Conquest of England.&lt;/p&gt;

&lt;p&gt;If like me you went to school in England, you've already met his family — you just don't know it. Harald Bluetooth's son was &lt;strong&gt;Sweyn Forkbeard&lt;/strong&gt;, who invaded England and was crowned king on Christmas Day 1013. Sweyn's son was &lt;strong&gt;Cnut the Great&lt;/strong&gt; — the king every English school kid learns about, and a simple misspelling away from class laughter.&lt;sup id="fnref1"&gt;1&lt;/sup&gt; Cnut ruled England, Denmark, and Norway simultaneously — the North Sea Empire, the largest domain in northern Europe.&lt;/p&gt;

&lt;p&gt;The man who named your wireless protocol is the grandfather of the most powerful Viking king who ever sat on the English throne. Harald himself never invaded England — he was busy unifying Denmark and making the Danes Christian. But his dynasty did. The &lt;a href="https://en.wikipedia.org/wiki/Jelling_stones" rel="noopener noreferrer"&gt;Jelling Stone&lt;/a&gt; he erected — "Denmark's baptismal certificate," still standing in Jutland — declared that he "won for himself all of Denmark and Norway and made the Danes Christian." His grandson won England too.&lt;/p&gt;

&lt;p&gt;If you watched the TV show &lt;a href="https://www.imdb.com/title/tt2306299/" rel="noopener noreferrer"&gt;&lt;em&gt;Vikings&lt;/em&gt;&lt;/a&gt; hoping to spot him, you were about 80 years too late — the show ends around 878 AD, and Harald didn't reign until 958. &lt;a href="https://www.imdb.com/title/tt11311302/" rel="noopener noreferrer"&gt;&lt;em&gt;Vikings: Valhalla&lt;/em&gt;&lt;/a&gt; starts in 1002, about 16 years after he died. He fell right into the gap between both shows. The Viking age had a lot of Haralds; this is the one that matters for your earbuds.&lt;/p&gt;

&lt;p&gt;I think about this every time I pair my headphones. I can't help it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Radio: 79 Channels and a Lot of Jumping
&lt;/h2&gt;

&lt;p&gt;Bluetooth operates in the &lt;strong&gt;2.4 GHz ISM band&lt;/strong&gt; -- the same unlicensed slice of radio spectrum used by WiFi, baby monitors, cordless phones, and microwave ovens. This band is crowded. Catastrophically crowded. The reason Bluetooth works at all in this environment is a technique borrowed from military radio: &lt;strong&gt;frequency hopping spread spectrum&lt;/strong&gt; (FHSS).&lt;/p&gt;

&lt;p&gt;The 2.4 GHz band is divided into 79 channels, each 1 MHz wide, spanning from 2.402 GHz to 2.480 GHz. A Bluetooth connection doesn't sit on one channel. It hops between all 79 channels in a pseudo-random sequence, changing frequency &lt;strong&gt;1,600 times per second&lt;/strong&gt;. That's a new channel every 625 microseconds.&lt;/p&gt;

&lt;p&gt;Why does this help? Because interference is usually narrowband. Your microwave blasts a few MHz of the 2.4 GHz band with noise. A WiFi router parks on a 20 MHz or 40 MHz chunk. But if your Bluetooth connection is only on any given frequency for 625 microseconds before jumping elsewhere, a microwave blast on channel 34 only corrupts 1 out of 79 hops. The other 78 get through clean. The protocol retransmits the missed packet on the next visit to a clean channel. You never notice.&lt;/p&gt;

&lt;p&gt;The hopping sequence is determined by the &lt;strong&gt;master device's clock and address&lt;/strong&gt;. Both devices in a connection know the sequence. To an outside observer without the sequence, the signal looks like noise spread across the entire band -- which is the entire point. FHSS was originally developed for military communications precisely because it makes signals hard to intercept or jam. It's the same pattern as the internet itself — &lt;a href="https://en.wikipedia.org/wiki/ARPANET" rel="noopener noreferrer"&gt;ARPANET&lt;/a&gt; was a military network designed to survive partial destruction, and now you use its descendant to order takeout. Military technology has a habit of ending up in your living room.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkafew0crr4c9qcl7drve.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkafew0crr4c9qcl7drve.jpeg" alt="Frequency hopping spread spectrum: Bluetooth hops across 79 channels, skipping interference zones" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Bluetooth Low Energy (BLE) does things differently -- it uses only 40 channels, each 2 MHz wide, and hops at a different rate. But the principle is identical: don't sit still, don't get jammed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Piconet: One Clock to Rule Them
&lt;/h2&gt;

&lt;p&gt;When Bluetooth devices connect, they form a &lt;strong&gt;piconet&lt;/strong&gt; -- a tiny ad-hoc network with one device running the show.&lt;/p&gt;

&lt;p&gt;The device that initiates the connection becomes the &lt;strong&gt;central&lt;/strong&gt; device (historically called "master" -- the Bluetooth SIG updated the terminology in 2020). The device that accepts becomes the &lt;strong&gt;peripheral&lt;/strong&gt; (formerly "slave"). The central's clock becomes the reference clock for the entire piconet. All frequency hopping is synchronized to it.&lt;/p&gt;

&lt;p&gt;A single piconet supports up to &lt;strong&gt;7 active peripherals&lt;/strong&gt;. That's not arbitrary -- peripheral addresses are 3 bits, giving you 8 slots, one of which is the central. Additional devices can be "parked" -- they stay synchronized to the piconet clock but can't send or receive data until unparked. Up to 255 devices can park on a single piconet.&lt;/p&gt;

&lt;p&gt;What happens when a device needs to be in two piconets at once? Your phone connected to your car stereo &lt;em&gt;and&lt;/em&gt; your smartwatch, for instance? It participates in a &lt;strong&gt;scatternet&lt;/strong&gt; -- overlapping piconets where a device acts as peripheral in one and central in another, time-slicing its radio between the two. This is why audio sometimes stutters briefly when your smartwatch gets a notification while you're streaming music to your car. The phone's radio just timesliced away from your A2DP stream for a moment to handle the watch's piconet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pairing and Bonding: The Handshake
&lt;/h2&gt;

&lt;p&gt;Before two devices can talk, they need to establish trust. This is &lt;strong&gt;pairing&lt;/strong&gt; -- the initial key exchange. Once paired, the keys are stored so reconnection is instant. That storage is called &lt;strong&gt;bonding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The pairing process has evolved across Bluetooth versions, but the core problem is always the same: two devices that have never met need to agree on a shared secret over a radio link that anyone nearby can listen to.&lt;/p&gt;

&lt;h3&gt;
  
  
  Legacy Pairing (Bluetooth 2.0 and earlier)
&lt;/h3&gt;

&lt;p&gt;Both devices display or require a &lt;strong&gt;PIN code&lt;/strong&gt;. You type the same PIN on both sides (or one side has a fixed PIN, like "0000" for simple devices). The PIN seeds the key generation. This is crude. A 4-digit PIN has only 10,000 possible values. An attacker recording the pairing exchange can brute-force it in seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Secure Simple Pairing (Bluetooth 2.1+)
&lt;/h3&gt;

&lt;p&gt;SSP introduced four models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Numeric Comparison&lt;/strong&gt;: both devices display a 6-digit number, you confirm they match. Uses Elliptic Curve Diffie-Hellman (ECDH) for the key exchange. Secure against passive eavesdropping and man-in-the-middle attacks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Passkey Entry&lt;/strong&gt;: one device displays a number, you type it on the other. For devices where one side has a keyboard but no display.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Out of Band (OOB)&lt;/strong&gt;: the pairing information is exchanged via a different channel -- NFC tap, QR code scan. Secure because the attacker would need to compromise both channels.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Just Works&lt;/strong&gt;: ECDH key exchange with no user confirmation. Protects against passive eavesdropping but &lt;em&gt;not&lt;/em&gt; against active man-in-the-middle attacks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;"Just Works" exists because headphones don't have screens or keyboards.&lt;/strong&gt; There's no way to display a confirmation number on a $30 pair of earbuds. The trade-off is deliberate: slightly lower security in exchange for the pairing experience not requiring a device with a display. For most consumer audio, this is the right call. For medical devices or payment terminals, it's absolutely not -- which is why those use Numeric Comparison or OOB.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Stacks, One Name: Classic vs BLE
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Classic Bluetooth and Bluetooth Low Energy (BLE) are fundamentally different protocol stacks&lt;/strong&gt; that happen to share a name, a radio band, and a logo.&lt;/p&gt;

&lt;p&gt;Classic Bluetooth (formally "BR/EDR" -- Basic Rate/Enhanced Data Rate) was designed for continuous data streams. Audio. File transfer. Serial port emulation. It keeps a persistent connection, and its power consumption reflects that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bluetooth Low Energy&lt;/strong&gt; (BLE, originally marketed as "Bluetooth Smart") was introduced in the Bluetooth 4.0 specification in 2010. It was designed from scratch for a different use case: devices that send tiny amounts of data infrequently and need to run on a coin cell battery for years. Fitness trackers. Temperature sensors. AirTags.&lt;/p&gt;

&lt;p&gt;The two stacks are so different that early Bluetooth 4.0 chips came in three varieties: Classic-only, BLE-only, and "dual-mode" chips that supported both. Your phone has a dual-mode chip. Your AirTag has a BLE-only chip. The fact that they both say "Bluetooth" on the box is a branding decision, not a technical statement.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Classic (BR/EDR)&lt;/th&gt;
&lt;th&gt;BLE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Designed for&lt;/td&gt;
&lt;td&gt;Continuous streams&lt;/td&gt;
&lt;td&gt;Short bursts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Channels&lt;/td&gt;
&lt;td&gt;79 x 1 MHz&lt;/td&gt;
&lt;td&gt;40 x 2 MHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data rate&lt;/td&gt;
&lt;td&gt;1-3 Mbps&lt;/td&gt;
&lt;td&gt;1-2 Mbps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Power&lt;/td&gt;
&lt;td&gt;Milliwatts&lt;/td&gt;
&lt;td&gt;Microwatts (sleep)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Range&lt;/td&gt;
&lt;td&gt;~10-100m&lt;/td&gt;
&lt;td&gt;~10-100m (similar)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Connection&lt;/td&gt;
&lt;td&gt;Persistent&lt;/td&gt;
&lt;td&gt;Wake-transmit-sleep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical use&lt;/td&gt;
&lt;td&gt;Audio, file transfer&lt;/td&gt;
&lt;td&gt;Sensors, beacons, tags&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  BLE Advertising: Shouting Into the Void
&lt;/h2&gt;

&lt;p&gt;One of the most elegant things about BLE is that devices can broadcast data &lt;em&gt;without ever pairing&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;A BLE device in &lt;strong&gt;advertising mode&lt;/strong&gt; sends out small broadcast packets -- up to 31 bytes of data -- on three dedicated advertising channels (37, 38, and 39, deliberately spread across the band to avoid single-frequency interference). These packets go out at a configurable interval, anywhere from every 20 milliseconds to every 10.24 seconds.&lt;/p&gt;

&lt;p&gt;This is how your AirTag works. It never pairs with the phones that relay its location. It broadcasts a rotating identifier on advertising channels. Every iPhone in range passively picks up these advertisements as part of normal BLE scanning, encrypts the location with Apple's key, and relays it to the Find My network. The AirTag's battery lasts a year because it does almost nothing except advertise a few bytes every couple of seconds.&lt;/p&gt;

&lt;p&gt;This is also how Bluetooth beacons work in retail stores, museums, and airports. The beacon doesn't know or care who's listening. It advertises a UUID, and any app configured to listen for that UUID can react. The beacon is stateless. The intelligence lives on the receiver side.&lt;/p&gt;

&lt;p&gt;The advertising data is structured: each field has a type and a value. Common types include the device name, service UUIDs, manufacturer-specific data, and TX power level (which receivers use to estimate distance based on signal strength). Bluetooth 5.0 extended advertising to allow much larger payloads -- up to 255 bytes in the extended format -- enabling richer broadcasts without requiring a connection.&lt;/p&gt;




&lt;h2&gt;
  
  
  GATT: The API Layer
&lt;/h2&gt;

&lt;p&gt;When a BLE device does establish a connection (as opposed to passively advertising), the data exchange follows a structured protocol called &lt;strong&gt;GATT&lt;/strong&gt; -- the Generic Attribute Profile. If you've ever worked with a BLE device programmatically, GATT is the API you talked to.&lt;/p&gt;

&lt;p&gt;GATT organizes data into a hierarchy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Profile&lt;/strong&gt;: a collection of services that define a use case (e.g., "Heart Rate Profile")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service&lt;/strong&gt;: a group of related data points, identified by a UUID (e.g., "Heart Rate Service" = &lt;code&gt;0x180D&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Characteristic&lt;/strong&gt;: an individual data value within a service (e.g., "Heart Rate Measurement" = &lt;code&gt;0x2A37&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Descriptor&lt;/strong&gt;: metadata about a characteristic (e.g., units, valid range, notification settings)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A BLE heart rate monitor exposes a Heart Rate Service. Inside that service is a Heart Rate Measurement characteristic that a connected phone can read or subscribe to for notifications. When the value changes, the peripheral pushes an update to the central without the central having to poll.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbf9vatule3fu65i4lenj.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbf9vatule3fu65i4lenj.jpeg" alt="BLE GATT hierarchy: Profile, Service, Characteristic, and Descriptor layers with Heart Rate example" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The beauty of GATT is standardization. The Bluetooth SIG has defined hundreds of standard service and characteristic UUIDs. Any heart rate monitor from any manufacturer uses the same UUIDs for the same data. Your running app doesn't need a driver for each brand of chest strap -- it looks for service &lt;code&gt;0x180D&lt;/code&gt; and reads characteristic &lt;code&gt;0x2A37&lt;/code&gt;. The interoperability is baked into the protocol.&lt;/p&gt;




&lt;h2&gt;
  
  
  Profiles: Why Your Car Can't Display Album Art
&lt;/h2&gt;

&lt;p&gt;Back in Classic Bluetooth territory, the concept of &lt;strong&gt;profiles&lt;/strong&gt; defines what devices can actually &lt;em&gt;do&lt;/em&gt; with their connection. A profile is a specification for a particular use case.&lt;/p&gt;

&lt;p&gt;The ones you interact with daily:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A2DP&lt;/strong&gt; (Advanced Audio Distribution Profile): stereo audio streaming. This is what your headphones use for music.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HFP&lt;/strong&gt; (Hands-Free Profile): phone calls. Mono audio, microphone, call control. This is why call audio sounds worse than music -- it's a different profile with a different codec at a lower bandwidth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HID&lt;/strong&gt; (Human Interface Device): keyboards, mice, game controllers. Uses the same report format as &lt;a href="https://nazquadri.dev/blog/the-layer-below/13-usb/" rel="noopener noreferrer"&gt;USB HID&lt;/a&gt;, which is why Bluetooth keyboards can send the same scan codes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SPP&lt;/strong&gt; (Serial Port Profile): emulates an RS-232 serial port. Beloved by embedded engineers. Gives you a simple bidirectional byte stream between two devices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AVRCP&lt;/strong&gt; (Audio/Video Remote Control Profile): play, pause, skip, volume. And -- in newer versions -- track metadata like title, artist, and album art.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one is why your car stereo might not display album art from your phone. AVRCP has multiple versions. Version 1.0 only supports play/pause/skip. Version 1.3 added track metadata (title, artist, album). Version 1.6 added album art. If your car's Bluetooth module implements AVRCP 1.3 and your phone sends 1.6 features, the car doesn't understand the album art data. Both sides negotiate down to the highest mutually supported version.&lt;/p&gt;

&lt;p&gt;This is the profile negotiation problem in miniature. Every "why doesn't X work with Y" Bluetooth complaint usually traces back to mismatched profile versions, not a fundamental incompatibility.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Audio Codec Situation
&lt;/h2&gt;

&lt;p&gt;When you stream music over A2DP, the audio must be compressed to fit through Bluetooth's bandwidth constraints. The codec doing that compression has a massive impact on what you hear.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SBC&lt;/strong&gt; (Sub-Band Codec) is the mandatory baseline. Every A2DP device must support it. It was designed in the early 2000s for the computational constraints of the era. It works. It's also noticeably worse than the source material at its default settings. SBC at the standard bitrate of 328 kbps sounds muddy in the highs and lacks spatial detail. It's the reason "Bluetooth audio sounds bad" became conventional wisdom in the 2010s.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AAC&lt;/strong&gt; is Apple's preferred codec. iPhones use AAC for Bluetooth audio by default. It sounds substantially better than SBC at the same bitrate, but its quality depends on the encoder implementation — and Android's AAC encoder has historically been worse than Apple's, making AAC-over-Bluetooth inconsistent across platforms. This is the kernel of truth behind every Apple user's claim that their audio sounds better than Android — but it's the &lt;em&gt;encoder&lt;/em&gt;, not the platform. Pair high-end earbuds that support aptX or LDAC with an Android phone and the codec difference evaporates. The Bose QuietComfort Ultras on my desk don't care what Apple thinks about AAC.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;aptX&lt;/strong&gt; is Qualcomm's proprietary codec family. aptX Classic targets "CD-like quality." aptX HD pushes to 24-bit/48kHz. aptX Adaptive dynamically adjusts bitrate based on connection quality. All require Qualcomm licensing and hardware support on both ends.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LDAC&lt;/strong&gt; is Sony's codec. It can push up to 990 kbps -- nearly three times SBC's standard rate. At its best, it approaches lossless quality over Bluetooth. At its worst (in congested RF environments), it drops to 330 kbps and falls back to SBC-like quality. LDAC is open-sourced and available in Android's AOSP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LC3&lt;/strong&gt; (Low Complexity Communication Codec) is the new standard, part of the Bluetooth 5.2 LE Audio specification. LC3 achieves better subjective quality than SBC at &lt;em&gt;half the bitrate&lt;/em&gt;. The Bluetooth SIG's own listening tests showed that subjects preferred LC3 at 160 kbps over SBC at 345 kbps. LC3 is mandatory for LE Audio devices, which means it will eventually become the universal baseline, replacing SBC's 20-year reign.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Bluetooth Audio Has Latency
&lt;/h2&gt;

&lt;p&gt;The latency you notice when watching video with Bluetooth headphones -- lips moving before words arrive -- comes from multiple sources stacking up.&lt;/p&gt;

&lt;p&gt;The audio codec needs a buffer of samples to compress. SBC's frame size is 128 samples. At 44.1 kHz, that's about 2.9 milliseconds per frame. But the codec usually batches multiple frames before transmission.&lt;/p&gt;

&lt;p&gt;The Bluetooth baseband adds its own buffering. Classic Bluetooth's connection interval is typically 7.5 milliseconds, meaning data can only be sent at those intervals. The data then crosses the radio link, gets received, buffered, decoded, and sent to the DAC.&lt;/p&gt;

&lt;p&gt;Total end-to-end latency for SBC over Classic Bluetooth is typically &lt;strong&gt;150-250 milliseconds&lt;/strong&gt;. That's noticeable for video. Some phones compensate by delaying the video to match. Some don't. aptX Low Latency claims under 40 ms, but both sides need to support it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LE Audio changes the game.&lt;/strong&gt; LC3 was designed for lower latency from the start. LE Audio's isochronous channels provide guaranteed timing (unlike Classic Bluetooth's retransmission-based reliability). The target for LE Audio is &lt;strong&gt;20-30 milliseconds&lt;/strong&gt; end-to-end -- below the threshold where most people perceive audio-visual desynchronization. LE Audio also brings &lt;strong&gt;Auracast&lt;/strong&gt; -- broadcast audio where one source can stream to unlimited receivers simultaneously. Think public venue announcements, silent disco, hearing aid loops. One transmitter, every listener.&lt;/p&gt;




&lt;h2&gt;
  
  
  Coexistence: Sharing the 2.4 GHz Band
&lt;/h2&gt;

&lt;p&gt;Bluetooth and WiFi both live in the 2.4 GHz ISM band. They are neighbors who didn't choose each other. The fact that they coexist at all is a small engineering miracle.&lt;/p&gt;

&lt;p&gt;WiFi (802.11b/g/n in 2.4 GHz) uses wide channels -- 20 MHz or 40 MHz. A single WiFi channel covers 20 to 40 of Bluetooth's 79 narrowband channels. WiFi transmits at higher power (up to 100 mW vs Bluetooth's typical 1-2.5 mW). By rights, WiFi should obliterate Bluetooth.&lt;/p&gt;

&lt;p&gt;It doesn't, because of &lt;strong&gt;Adaptive Frequency Hopping&lt;/strong&gt; (AFH). Introduced in Bluetooth 1.2, AFH lets a Bluetooth connection detect which channels are occupied by WiFi or other interference and remove them from the hopping sequence. If channels 20-40 are blasted by a WiFi router, the Bluetooth piconet hops across the remaining 59 channels instead. The hop rate stays the same -- 1,600 per second -- but the pattern avoids the known-bad frequencies.&lt;/p&gt;

&lt;p&gt;On devices that have both a WiFi and Bluetooth radio (which is every phone, laptop, and tablet), a more sophisticated mechanism kicks in: &lt;strong&gt;coexistence signaling&lt;/strong&gt;. The WiFi and Bluetooth controllers share a physical wire or a bus connection. When the WiFi radio is about to transmit, it signals the Bluetooth controller, which can defer its own transmission by a fraction of a millisecond. And vice versa. This is why your phone's WiFi and Bluetooth don't constantly interfere with each other even though the antennas are millimeters apart.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0nys26b2cff5jluem5yq.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0nys26b2cff5jluem5yq.jpeg" alt="WiFi and Bluetooth coexistence: Adaptive Frequency Hopping avoids WiFi channels in the 2.4 GHz band" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The 5 GHz and 6 GHz WiFi bands (802.11ac/ax) don't overlap with Bluetooth at all. If your WiFi network runs exclusively on 5 GHz, there's zero contention. This is the simplest fix for the person who Googles "Bluetooth interference WiFi" -- move your WiFi to 5 GHz and the two protocols never see each other.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Part That Gets Me
&lt;/h2&gt;

&lt;p&gt;I keep coming back to the naming.&lt;/p&gt;

&lt;p&gt;In 1997, Jim Kardach was reading a novel about a Viking king who unified Scandinavian tribes. He was sitting across from engineers whose companies couldn't agree on a wireless standard. He saw the parallel and threw out a name as a placeholder.&lt;/p&gt;

&lt;p&gt;That placeholder outlived every serious marketing attempt to replace it. It outlived IBM's exit from the consortium. It outlived the original standard it was created for -- Classic Bluetooth is slowly being superseded by BLE and LE Audio, but the name persists. The runic bind-rune logo persists. Harald Bluetooth Gormsson, dead for over a thousand years, has his initials on more devices than any human in history.&lt;/p&gt;

&lt;p&gt;Harald unified tribes. Bluetooth unifies devices. The metaphor was perfect in 1997, and it's even more perfect now that the protocol connects 5 billion devices annually.&lt;/p&gt;

&lt;p&gt;I wonder what he'd make of it. A 10th-century king who couldn't read Latin, who ruled a country smaller than South Carolina, whose greatest technological achievement was a runestone — and his initials are on 5 billion devices. His name is spoken more often now than it was during his own reign. Not by subjects or chroniclers, but by people saying "turn on Bluetooth" while fumbling with their car stereo.&lt;/p&gt;

&lt;p&gt;He'd probably understand the unification part. Getting Ericsson and Nokia and Intel to agree on a protocol isn't so different from getting Jutland and Zealand to stop fighting. Both require convincing proud, territorial powers that the shared standard serves them better than the private one. Both require someone stubborn enough to hold the table together until the deal is done.&lt;/p&gt;

&lt;p&gt;The frequency hopping, the piconets, the GATT profiles — he wouldn't understand any of it. But the politics? A king who unified warring tribes would recognise a standards body instantly. Same game. Same stakes. Smaller swords. Similar beards.&lt;/p&gt;

&lt;p&gt;Not bad for a placeholder.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.bluetooth.com/specifications/specs/core-specification/" rel="noopener noreferrer"&gt;Bluetooth Core Specification&lt;/a&gt;&lt;/strong&gt; — the full spec from the Bluetooth SIG. Start with Volume 1 (Architecture &amp;amp; Terminology Overview) before diving into the radio or L2CAP layers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/The_Long_Ships_(novel)" rel="noopener noreferrer"&gt;The Long Ships&lt;/a&gt;&lt;/strong&gt; — the Frans G. Bengtsson novel that Jim Kardach was reading when he proposed "Bluetooth" as a placeholder name. A Viking adventure that's far better than it has any right to be.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://devzone.nordicsemi.com/guides/short-range-guides/b/bluetooth-low-energy" rel="noopener noreferrer"&gt;Nordic Semiconductor DevZone — BLE Tutorial&lt;/a&gt;&lt;/strong&gt; — Nordic makes the chips inside most BLE peripherals. Their tutorial walks through advertising, connections, GATT profiles, and power optimization with real hardware examples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://wiki.wireshark.org/CaptureSetup/Bluetooth" rel="noopener noreferrer"&gt;Wireshark Bluetooth Capture Setup&lt;/a&gt;&lt;/strong&gt; — how to sniff Bluetooth traffic with Wireshark. Requires a compatible sniffer dongle, but seeing actual HCI packets demystifies the protocol faster than any spec chapter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.bluetooth.com/learn-about-bluetooth/recent-enhancements/le-audio/" rel="noopener noreferrer"&gt;Bluetooth LE Audio and the LC3 Codec&lt;/a&gt;&lt;/strong&gt; — the Bluetooth SIG's overview of LE Audio, including the LC3 codec that replaces SBC. Lower bitrate, better quality, and the foundation for Auracast broadcast audio.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri pairs his headphones every morning and thinks about a Viking king every single time. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;The one who (supposedly) tried to command the tide to prove to his flattering courtiers that even kings can't control the sea. Cnut wasn't trying to prove he had power over the sea. He was proving the opposite — demonstrating to sycophantic courtiers that royal power has limits. The story is usually told backwards. A thousand years of getting the point wrong. Sounds like a naming committee. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>What Happens When You Plug Something In</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:04:14 +0000</pubDate>
      <link>https://forem.com/nazq/what-happens-when-you-plug-something-in-3n50</link>
      <guid>https://forem.com/nazq/what-happens-when-you-plug-something-in-3n50</guid>
      <description>&lt;h1&gt;
  
  
  What Happens When You Plug Something In
&lt;/h1&gt;

&lt;h2&gt;
  
  
  USB: The Protocol That Ate Every Port
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~15 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You plugged in a USB drive. A notification popped up. You dragged some files over. You yanked the cable out without clicking "safely remove" and felt a tiny pang of guilt.&lt;/p&gt;

&lt;p&gt;That guilt is warranted, but not for the reason you think. Between the moment you pushed that connector in and the moment your OS mounted a filesystem, your computer conducted a multi-round negotiation involving device identity, power budgets, endpoint capabilities, and transfer scheduling — all over two twisted wires carrying differential signals at 480 million bits per second. The drive didn't announce itself. Your computer had to &lt;em&gt;ask&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That's the first thing most people get wrong about USB. It's not a peer-to-peer protocol. It's a polled, host-controlled bus where nothing speaks unless spoken to.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Brief History of Cable Hell
&lt;/h2&gt;

&lt;p&gt;Before USB, connecting a peripheral to a PC was an exercise in suffering. Serial ports (RS-232) needed specific baud rates, stop bits, and flow control settings. Parallel ports (DB-25) were faster but required fat cables and had timing problems. PS/2 ports worked for keyboards and mice but nothing else. SCSI gave you throughput but demanded termination resistors and SCSI IDs and the patience of a monk, I am no monk. Every peripheral type had its own connector, its own driver model, its own failure mode.&lt;/p&gt;

&lt;p&gt;In 1994, a consortium of seven companies — Compaq, DEC, IBM, Intel, Microsoft, NEC, and Nortel — decided this was insane. Ajay Bhatt at Intel led the architecture work. The goal: one connector, one protocol, hot-pluggable, self-describing, and capable of powering small devices. USB 1.0 shipped in January 1996.&lt;/p&gt;

&lt;p&gt;It took another two years and the iMac G3 for anyone to actually care. Apple dropped every legacy port on that machine and bet the entire peripheral story on USB. It was either visionary or reckless. It was both.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Physical Layer: Two Wires and a Clever Trick
&lt;/h2&gt;

&lt;p&gt;A USB 2.0 cable has four wires: VBUS (+5V power), GND, D+, and D-. The data travels on D+ and D- using &lt;strong&gt;differential signaling&lt;/strong&gt; — the receiver doesn't look at the voltage on either wire individually. It looks at the &lt;em&gt;difference&lt;/em&gt; between them.&lt;/p&gt;

&lt;p&gt;Why? Noise immunity. If electromagnetic interference hits your cable, it hits both wires roughly equally. The voltage on D+ goes up by 50mV, and the voltage on D- goes up by 50mV. The difference stays the same. The signal survives. This is the same trick used by Ethernet, HDMI, and every other protocol that needs to work in environments full of switching power supplies and WiFi radios.&lt;/p&gt;

&lt;p&gt;A logical 1 (called a "J state" in USB-speak) is D+ high, D- low. A logical 0 ("K state") is the reverse. The encoding scheme is &lt;strong&gt;NRZI&lt;/strong&gt; — Non-Return-to-Zero Inverted — which means a 0 bit causes a transition and a 1 bit doesn't. To prevent long runs of 1s from losing clock sync, the protocol uses &lt;strong&gt;bit stuffing&lt;/strong&gt;: after six consecutive 1 bits, a 0 is inserted. The receiver strips it out. This means USB's actual data throughput is slightly less than the raw bit rate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp6w38io34fozon2xhg0e.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp6w38io34fozon2xhg0e.jpeg" alt="USB differential signaling — D+ and D- waveforms with common-mode noise rejection" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;USB 3.0 (SuperSpeed) added a completely separate set of wires — two differential pairs for a full-duplex link — on top of the existing USB 2.0 wires. That's why USB 3.0 cables are thicker. That's also why a USB 3.0 device works in a USB 2.0 port at USB 2.0 speeds: the old wires are still there, doing the same job they always did.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Speed Naming Disaster
&lt;/h2&gt;

&lt;p&gt;USB's speed grades are a masterclass in how not to name things. The USB Implementers Forum — the standards body that governs USB — has renamed speeds multiple times, creating a layered mess where the marketing name, the spec name, and the thing developers actually say are three different strings.&lt;/p&gt;

&lt;p&gt;Here's what actually matters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Generation&lt;/th&gt;
&lt;th&gt;Raw Speed&lt;/th&gt;
&lt;th&gt;Spec Name&lt;/th&gt;
&lt;th&gt;What People Say&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;USB 1.0&lt;/td&gt;
&lt;td&gt;1.5 Mbit/s&lt;/td&gt;
&lt;td&gt;Low Speed&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;USB 1.1&lt;/td&gt;
&lt;td&gt;12 Mbit/s&lt;/td&gt;
&lt;td&gt;Full Speed&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;USB 2.0&lt;/td&gt;
&lt;td&gt;480 Mbit/s&lt;/td&gt;
&lt;td&gt;Hi-Speed&lt;/td&gt;
&lt;td&gt;"USB 2"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;USB 3.0&lt;/td&gt;
&lt;td&gt;5 Gbit/s&lt;/td&gt;
&lt;td&gt;SuperSpeed (now "USB 3.2 Gen 1")&lt;/td&gt;
&lt;td&gt;"USB 3"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;USB 3.1&lt;/td&gt;
&lt;td&gt;10 Gbit/s&lt;/td&gt;
&lt;td&gt;SuperSpeed+ (now "USB 3.2 Gen 2")&lt;/td&gt;
&lt;td&gt;"USB 3.1"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;USB 3.2&lt;/td&gt;
&lt;td&gt;20 Gbit/s&lt;/td&gt;
&lt;td&gt;SuperSpeed+ (now "USB 3.2 Gen 2x2")&lt;/td&gt;
&lt;td&gt;(confused silence)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;USB4&lt;/td&gt;
&lt;td&gt;40 Gbit/s&lt;/td&gt;
&lt;td&gt;USB4 Gen 3x2&lt;/td&gt;
&lt;td&gt;"Thunderbolt, sort of"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;USB4 v2.0&lt;/td&gt;
&lt;td&gt;80 Gbit/s&lt;/td&gt;
&lt;td&gt;USB4 Gen 4&lt;/td&gt;
&lt;td&gt;¯_(ツ)_/¯&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note that "Full Speed" is the &lt;em&gt;slow one&lt;/em&gt;. 12 Mbit/s. It was full speed in 1998. The name stuck forever. This is what happens when you name speeds after marketing adjectives instead of numbers.&lt;/p&gt;

&lt;p&gt;Also note that what was once called "USB 3.0" was retroactively renamed to "USB 3.2 Gen 1" — a spec version number that didn't exist when the product shipped. Cable and device manufacturers print whichever name they feel like. The result is that "USB 3.2" on a box could mean 5, 10, or 20 Gbit/s. Good luck out there!&lt;/p&gt;




&lt;h2&gt;
  
  
  Enumeration: The Descriptor Dance
&lt;/h2&gt;

&lt;p&gt;This is the part most developers never see, and it's the part that makes USB actually work.&lt;/p&gt;

&lt;p&gt;When you plug in a device, here's what happens:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Electrical detection.&lt;/strong&gt; The host sees a voltage change on D+ or D-. A full-speed or high-speed device pulls D+ high through a 1.5kΩ resistor. A low-speed device pulls D- high instead. The host now knows something is there, and it knows the speed class.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Reset.&lt;/strong&gt; The host drives both D+ and D- low for at least 10ms. This is a bus reset. It tells the device to abandon any previous state and start fresh.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Default address.&lt;/strong&gt; After the reset, the device listens on address 0. Every USB device starts its life as address 0. Only one device can be at address 0 at a time — this is why there's a brief period after plugging in where the host won't detect a second new device.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Get Device Descriptor.&lt;/strong&gt; The host sends a control transfer to address 0, asking for the device's &lt;strong&gt;device descriptor&lt;/strong&gt; — an 18-byte structure containing the vendor ID, product ID, device class, number of configurations, and the USB spec version the device supports. This is the handshake. "Who are you?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Set Address.&lt;/strong&gt; The host assigns a unique address (1–127) to the device. From now on, the device only responds to that address. Address 0 is free for the next newcomer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6: Get Configuration Descriptor.&lt;/strong&gt; The host asks for the full configuration descriptor, which describes every interface and endpoint the device offers. A webcam might report a video interface with an isochronous endpoint and an audio interface with another isochronous endpoint. A USB drive reports a mass storage interface with two bulk endpoints (in and out).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 7: Set Configuration.&lt;/strong&gt; The host picks a configuration (most devices have exactly one) and activates it.&lt;/p&gt;

&lt;p&gt;That's 7 steps, multiple round-trip transactions, all before a single byte of actual data moves. The whole process takes somewhere between 100ms and a few seconds, depending on the device and the OS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkcuj27be7xmwgpdwuty.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkcuj27be7xmwgpdwuty.jpeg" alt="USB enumeration sequence — 7-step host-device handshake from detection to configuration" width="800" height="1071"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why some USB devices take a moment to "appear"&lt;/strong&gt; — your OS isn't being lazy. Nobody expects the Spanish Inquisition! 🦜&lt;/p&gt;




&lt;h2&gt;
  
  
  Device Classes: Why Your Webcam and Keyboard Share a Protocol
&lt;/h2&gt;

&lt;p&gt;USB defines &lt;strong&gt;device classes&lt;/strong&gt; — standardized interfaces that let any OS talk to any device of that type without a vendor-specific driver. This is the reason you can plug a keyboard into any computer and have it work immediately. The keyboard announces "I'm a HID device" during enumeration, and the OS loads its built-in HID driver.&lt;/p&gt;

&lt;p&gt;The important ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://nazquadri.dev/blog/the-layer-below/01-pressing-a-key/" rel="noopener noreferrer"&gt;HID (Human Interface Device)&lt;/a&gt;&lt;/strong&gt; — keyboards, mice, game controllers, touchscreens. Reports are small fixed-size packets sent on interrupt endpoints. The HID spec defines a report descriptor language that lets devices describe their own button layouts and axis mappings. It's surprisingly expressive and surprisingly painful to parse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mass Storage&lt;/strong&gt; — USB drives, external SSDs. Uses bulk transfers and speaks SCSI commands over USB (a protocol called BOT — Bulk-Only Transport). Your thumb drive is pretending to be a SCSI disk from the 1990s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CDC (Communications Device Class)&lt;/strong&gt; — serial ports, network adapters. This is what Arduino boards use to show up as &lt;code&gt;/dev/ttyACM0&lt;/code&gt;. CDC-ACM (Abstract Control Model) emulates a serial port. CDC-ECM and CDC-NCM emulate Ethernet interfaces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio&lt;/strong&gt; — microphones, speakers, DACs. Uses isochronous transfers for guaranteed timing. The class defines controls for volume, mute, sample rate selection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Video&lt;/strong&gt; — webcams (UVC — USB Video Class). Also isochronous. Defines format negotiation so the host can request specific resolutions and frame rates.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The beauty of the class system is substitutability. Any UVC-compliant webcam works with any UVC driver. That's not a given — it's an engineering achievement. Go try plugging a random printer into a random computer and watch it fail because the printer class (Printing) is a thin spec and most printers ship with vendor-specific protocols.&lt;/p&gt;




&lt;h2&gt;
  
  
  Endpoints and Pipes: Four Flavors of Data Transfer
&lt;/h2&gt;

&lt;p&gt;Every USB device exposes one or more &lt;strong&gt;endpoints&lt;/strong&gt; — numbered channels (0–15, each direction) that carry data in or out. Endpoint 0 is special: it's the &lt;strong&gt;control endpoint&lt;/strong&gt;, used for enumeration and device management. Every device must have it.&lt;/p&gt;

&lt;p&gt;The other endpoints use one of three transfer types, and which one a device picks determines everything about its performance characteristics:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bulk transfers&lt;/strong&gt; — reliable, no timing guarantee. The host sends data when the bus is free. If a packet gets corrupted, it's retried. Used by mass storage devices and printers. You get correctness, but the data arrives "whenever." Bulk transfers get the leftover bandwidth after all other transfer types are served.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interrupt transfers&lt;/strong&gt; — small, periodic, guaranteed latency. The host polls the device at a fixed interval (between 1ms and 255ms for USB 2.0). The device either has data or it doesn't. Used by HID devices. Your keyboard gets polled every 1ms for its current key state. The word "interrupt" is misleading — this is still polling, not a hardware interrupt. The device can't yell at the host. It waits to be asked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Isochronous transfers&lt;/strong&gt; — What an amazing word, &lt;em&gt;iso&lt;/em&gt; same, &lt;em&gt;chronous&lt;/em&gt; to do with time, so these are for real time feeds, guaranteed bandwidth, no retries. A fixed amount of bus time is reserved at a fixed interval. If a packet is corrupted, it's gone. No do-overs. Used by audio and video devices, because a retransmitted audio sample that arrives 5ms late is worse than a dropped sample. Your ears interpolate. Your audio stack doesn't have time to wait. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why audio uses isochronous transfers&lt;/strong&gt; — correctness is less important than timing. A glitch in audio is a brief click. A stall while waiting for a retransmission is a gap in playback. The protocol picks the lesser evil.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hubs and the Tiered Star Topology
&lt;/h2&gt;

&lt;p&gt;USB's physical topology is a &lt;strong&gt;tiered star&lt;/strong&gt;. The host controller is at the root. Hubs branch outward. Devices connect to hubs (or directly to the host's root hub). The spec allows up to 5 levels of hubs and 127 devices total on one bus — a 7-bit address space, because address 0 is reserved for devices mid-enumeration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F338s65izanixahoxwrpd.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F338s65izanixahoxwrpd.jpeg" alt="USB tiered star topology — root hub to 5 tiers, 127 devices" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every host controller has a &lt;strong&gt;root hub&lt;/strong&gt; built in — that's what your motherboard's USB ports connect to internally. When you plug a hub into a port, the hub itself goes through enumeration (it's a USB device too, class 09h). Then the hub monitors its downstream ports for new connections and reports them to the host.&lt;/p&gt;

&lt;p&gt;The host manages &lt;em&gt;all&lt;/em&gt; traffic. A hub doesn't make routing decisions. It's a repeater with port management. At USB 2.0 speeds, all devices on a hub share bandwidth. A hub connected to a 480 Mbit/s port gives each downstream device a time slice of that 480 Mbit/s, not 480 Mbit/s each.&lt;/p&gt;

&lt;p&gt;USB 3.0 changed this — SuperSpeed hubs have separate transaction translators for USB 2.0 and USB 3.0 traffic, so a slow device on one port doesn't starve a fast device on another. But the fundamental model — host-scheduled, star-topology, no peer-to-peer — remains.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Many Devices Can You Actually Connect?
&lt;/h3&gt;

&lt;p&gt;Your laptop has, say, 3 USB-C ports. Each is a root hub port. Plug a 7-port hub into each, and those hubs can cascade further. The math: 3 ports × 5 tiers of 7-port hubs = far more than 127. The address space is the hard ceiling, not the physical ports.&lt;/p&gt;

&lt;p&gt;But long before you hit 127 devices, reality intervenes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Power.&lt;/strong&gt; Each USB 2.0 port supplies 500mA at 5V (2.5W). A 7-port hub without its own power adapter divides that 2.5W across all downstream devices. Four bus-powered hubs deep and each device gets fractions of a watt. External hard drives spin down. Webcams refuse to enumerate. Your Keychron keyboard works fine because it draws 100mA — keyboards are cheap dates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bandwidth.&lt;/strong&gt; All USB 2.0 devices behind a hub share 480 Mbit/s (realistically ~280 Mbit/s after protocol overhead). Two external drives on the same USB 2.0 hub will each get ~140 Mbit/s. Add a webcam streaming 1080p and the drives slow to a crawl. USB 3.x helps — each device gets its own 5 Gbit/s lane — but only if every hub in the chain is USB 3.x. One USB 2.0 hub in the chain and everything downstream drops to 2.0 speeds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enumeration storms.&lt;/strong&gt; Plug in a powered hub with 7 devices attached and the host controller has to enumerate all of them — sequentially. Each device goes through the descriptor dance: reset, address 0, Get Device Descriptor, Set Address, Get Configuration, Set Configuration. Seven devices × ~100ms each = nearly a second of the host controller doing nothing but paperwork. Plug in a hub with daisy-chained hubs below it and you can stall enumeration for several seconds. That's the pause you feel when you plug in a USB dock.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real limit for most people isn't 127 devices. It's power and bandwidth.&lt;/strong&gt; This is why every serious USB setup has powered hubs, and why Thunderbolt docks exist — they provide dedicated PCIe lanes and power delivery that USB's shared-bus model can't match.&lt;/p&gt;




&lt;h2&gt;
  
  
  USB-C: The Connector That Knows Too Much
&lt;/h2&gt;

&lt;p&gt;USB-C is not a speed. It's a &lt;strong&gt;connector&lt;/strong&gt; — a reversible (remember trying to plug in the older version in the dark 😖) 24-pin plug that can carry USB 2.0, USB 3.x, USB4, Thunderbolt 3/4, DisplayPort, HDMI (via alt mode), analog audio, and power delivery up to 240W. All through the same physical port.&lt;/p&gt;

&lt;p&gt;This flexibility is also why USB-C cables are a minefield.&lt;/p&gt;

&lt;p&gt;The connector has 24 pins, but not all cables wire all 24. A USB-C cable that only carries USB 2.0 has 4 active data pins (D+/D-) plus power and ground. A USB 3.2 cable adds the SuperSpeed pairs. A Thunderbolt 4 cable wires everything and includes signal conditioning electronics in the connector itself (active cable). They all look identical from the outside. Buyer beware !&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration Channel (CC) pins&lt;/strong&gt; — two pins on the connector used for cable detection, orientation (which way is "up" for the reversible plug), and capability negotiation. When you plug in a USB-C cable, the CC pins are the first thing that talks. They determine what the cable can carry before any data flows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;USB Power Delivery (PD)&lt;/strong&gt; runs over the CC pins as a separate protocol layer. It negotiates voltage and current between source and sink. The original USB spec allowed 5V at 100mA (500mW). USB PD 3.1 allows up to 48V at 5A (240W). That's a 480x increase in power delivery through a connector lineage that started as a peripheral attachment protocol. PD negotiation uses structured messages — the source advertises what it can provide (called PDOs — Power Data Objects), and the sink requests what it needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alternate Modes&lt;/strong&gt; let the USB-C connector carry non-USB protocols. DisplayPort alt mode repurposes the SuperSpeed lanes to carry DisplayPort signals. Thunderbolt alt mode does the same for Thunderbolt/PCIe. The CC pins negotiate which alt mode to use. This is why a single USB-C port on a laptop can drive a 4K display, charge the laptop, and connect a USB hub — different pin functions running simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikfbj7gvq3wo2682f5cr.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikfbj7gvq3wo2682f5cr.jpeg" alt="USB-C 24-pin connector layout with color-coded pin groups" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why "some USB-C cables don't work"&lt;/strong&gt; — a cable that only wires USB 2.0 pins physically cannot carry a DisplayPort signal. The electronics aren't there. The connector fits, the protocol fails, and you blame the monitor.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why "Safely Remove" Exists
&lt;/h2&gt;

&lt;p&gt;You've yanked a USB drive out mid-use. Maybe nothing happened. Maybe you lost a file. The reason "safely remove" exists is a combination of write caching and in-flight transfers.&lt;/p&gt;

&lt;p&gt;When your OS writes to a USB mass storage device, it doesn't necessarily send every write immediately. The OS may &lt;strong&gt;write-cache&lt;/strong&gt; — hold dirty pages in RAM and flush them to the device in batches. This is faster (fewer small transactions on a bulk endpoint) but means that at any given moment, the device's on-disk state may be behind what the OS thinks it wrote.&lt;/p&gt;

&lt;p&gt;Clicking "safely remove" does three things: flushes all cached writes, completes any in-flight transfers, and then tells the device it's safe to power down. The OS unmounts the filesystem first, so no new writes can start.&lt;/p&gt;

&lt;p&gt;On Linux, the kernel's block layer handles this via &lt;code&gt;sync&lt;/code&gt; and cache flush commands sent as SCSI SYNCHRONIZE CACHE over the USB mass storage protocol. On macOS, the same logic runs through IOKit. On Windows, it's the "surprise removal" path if you yank without ejecting.&lt;/p&gt;

&lt;p&gt;Modern operating systems (macOS since Catalina, Windows 10 since 1809) default to a "quick removal" policy for USB drives — write caching is disabled by default, so every write goes to the device immediately. This makes safe removal less critical but makes writes slower. The trade-off: you can yank the cable guilt-free, but large file copies take longer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why copying to a USB drive feels slower than copying between internal disks&lt;/strong&gt; — beyond the raw interface speed difference, the default write policy adds per-transaction latency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Power Delivery: From 500mA to Charging Laptops
&lt;/h2&gt;

&lt;p&gt;The original USB 1.0 spec in 1996 provided 5V at 100mA — 500mW, barely enough to power a mouse. A device could request a "high-power" configuration of 500mA (2.5W) during enumeration. This was elegant: the host knew exactly how much power each device needed because the device said so in its configuration descriptor.&lt;/p&gt;

&lt;p&gt;Then phones happened. People wanted to charge phones via USB. The Battery Charging Specification (BC 1.2) introduced "dedicated charging ports" that could supply 1.5A at 5V — 7.5W. Detection used the D+/D- lines (shorting them together signaled a charging port). It worked, sort of. It also created a mess of proprietary "quick charge" protocols where chargers and phones did voltage negotiation outside the USB spec.&lt;/p&gt;

&lt;p&gt;USB Power Delivery cleaned this up. PD is a protocol that runs on the CC pins of USB-C connectors, completely independent of the data lines. It supports multiple voltage levels (5V, 9V, 15V, 20V, and with PD 3.1's Extended Power Range, 28V, 36V, and 48V) and programmable current limits.&lt;/p&gt;

&lt;p&gt;A PD source advertises its capabilities. A PD sink requests what it needs. The negotiation happens in milliseconds. A laptop charger providing 20V at 5A (100W) and a phone requesting 9V at 3A (27W) use the same protocol, the same connector, and potentially the same cable.&lt;/p&gt;

&lt;p&gt;PD 3.1 Extended Power Range (EPR) pushes this to 48V at 5A — 240W. That's enough to power a high-end gaming laptop. From a protocol that started powering mice. The specification required new safety mechanisms: EPR cables must be electronically marked (the cable has a chip in its plug that reports its voltage rating to the source), and the source and sink continuously monitor for faults.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3vp6y6m8hq1c7j709f9.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3vp6y6m8hq1c7j709f9.jpeg" alt="USB power delivery evolution — from 0.5W in 1996 to 240W in 2021" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Polled Bus: Everything Waits for the Host
&lt;/h2&gt;

&lt;p&gt;I keep coming back to this because it's the thing that surprises developers most: nothing on a USB bus talks unless the host asks it to.&lt;/p&gt;

&lt;p&gt;Your keyboard doesn't fire an interrupt when you press a key. The host controller sends an IN token packet to the keyboard's interrupt endpoint every 1ms. The keyboard either responds with data or responds with NAK (nothing to report). The host controller is doing this for &lt;em&gt;every device on the bus&lt;/em&gt;, continuously, interleaving transactions across all active endpoints.&lt;/p&gt;

&lt;p&gt;The host controller hardware (EHCI for USB 2.0, xHCI for USB 3.x) maintains a schedule — a frame-based timeline of transactions. Each USB frame is 1ms (USB 2.0) or 125 microseconds (USB 2.0 microframes / USB 3.x). The controller walks through the schedule every frame, issuing IN and OUT token packets, collecting responses, and reporting results to the OS via &lt;a href="https://nazquadri.dev/blog/the-layer-below/04-reading-a-file/" rel="noopener noreferrer"&gt;DMA&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is fundamentally different from PCIe, where devices can initiate transactions (bus mastering). USB's host-controlled model is simpler, more predictable, and easier to secure — a malicious USB device can't flood the bus because it never gets to talk without permission. But it also means latency is bounded by the polling rate. You can't go faster than "one sample per frame."&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Every wire we've talked about today requires a physical connection. That cable between your keyboard and your computer is a guarantee — a dedicated channel, a known speed, a predictable latency.&lt;/p&gt;

&lt;p&gt;Bluetooth throws all of that away. It shares a radio band with WiFi, microwaves, and baby monitors. It hops between 79 channels, 1600 times per second, to avoid interference. It negotiates connections in a mesh where every device is both a peer and a potential relay. And somehow, your AirPods still manage to play music while your keyboard types and your mouse tracks — all on the same 2.4 GHz band. And then there's a Viking ...&lt;/p&gt;

&lt;p&gt;⣴⣶⣶⠶⢤⣄⠀⣤⣤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀&lt;br&gt;
⠈⠉⠛⢿⣆⠈⠻⢮⡙⢿⣯⠓⢤⣀⣀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀&lt;br&gt;
⠀⠀⠀⠀⠙⣇⠀⠀⠹⣆⢹⣿⣿⣿⣿⣿⣿⣍⠛⠳⢦⣀⠀⠀⠀⠀⠀⠀⠀⠀&lt;br&gt;
⠀⠀⠀⠀⠀⠹⣆⠀⠀⠈⢻⣿⣿⣿⣿⣿⣿⣿⣷⡘⢆⠙⢷⣄⠀⠀⠀⠀⠀⠀&lt;br&gt;
⠀⠀⠀⠀⠀⠀⢹⡄⠀⠀⠀⠹⣿⣿⣿⣿⣿⣿⣿⣷⠘⣯⡀⠙⢷⡄⠀⠀⠀⠀&lt;br&gt;
⠀⠀⠀⠀⠀⠀⠈⢿⡀⢻⣀⠀⠈⠙⠛⢿⣿⠿⠿⠿⠷⠤⠤⠤⠴⠿⠦⠀⠀⠀&lt;br&gt;
⠀⠀⠀⠀⠀⠀⠀⣸⣷⡄⠙⢧⣄⠀⠀⠈⣿⣶⣶⣶⠖⠒⠒⠒⠒⠒⠒⠒⠂⠀&lt;br&gt;
⠀⠀⠀⠀⠀⠀⢠⣿⣏⣿⣦⡀⠈⠀⣀⣼⣯⠙⣿⣇⠐⠿⠀⠀⠿⠂⠀⠺⠇⠀&lt;br&gt;
⠀⠀⠀⠀⠀⠀⣸⣿⣿⣿⣿⣿⣿⣿⣿⡿⢿⣿⡿⣿⠷⣶⡶⢶⣶⣶⣶⣶⡶⠆&lt;br&gt;
⠀⠀⠀⠀⠀⣀⣿⣿⡯⢂⣿⣿⣿⣿⣫⡄⢠⡿⢡⡟⠀⠉⠙⣿⣿⣿⡿⣯⠀⠀&lt;br&gt;
⠀⠀⠀⠀⠘⣿⣿⣿⣾⡿⠛⠙⠻⣿⣯⣶⠟⢠⠞⠀⠀⠉⠉⠛⠋⠀⢣⠹⣷⡀&lt;br&gt;
⠀⠀⠀⠀⢸⣿⣿⣿⡟⠉⠙⠢⠀⠘⠋⠀⢊⣁⣤⢷⣤⣠⣄⡀⠀⢠⣆⣀⣽⡿&lt;br&gt;
⠀⠀⠀⢀⣼⣿⣴⡿⠉⠓⠄⠀⣴⡿⠿⠿⠛⠋⠈⠀⠀⠀⢀⣤⡶⡾⠛⠛⣿⠀&lt;br&gt;
⠀⠀⠀⠀⢻⠿⣿⡉⠑⠀⠀⣼⣿⣷⠂⠢⣄⣀⠀⢀⣤⠆⡽⢀⣤⣦⣴⡶⠋⠀&lt;br&gt;
⠀⠀⠀⠀⠀⣾⢿⣿⣷⠀⠘⢿⣿⠳⣄⠀⣬⣿⣾⠟⢋⣄⣠⠟⠻⠶⢶⣆⠀⠀&lt;br&gt;
⠀⠀⠀⠀⠀⣿⣼⣿⣿⡆⠀⠀⠙⣦⡙⣷⣾⠟⠛⢛⣛⣋⡁⠀⡀⢀⢀⣿⠀⠀&lt;br&gt;
⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⠀⠀⠀⠘⣿⣿⠷⠚⠋⠉⠉⢻⣿⠿⣷⣿⣾⣿⠀⠀&lt;br&gt;
⠀⠀⠀⠀⠀⠚⠛⠉⠼⠟⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠈⠀⠀⠼⣿⠿⠋⠀⠀&lt;/p&gt;

&lt;p&gt;That's the next layer below.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.usb.org/documents" rel="noopener noreferrer"&gt;USB 2.0/3.2/4 Specifications&lt;/a&gt;&lt;/strong&gt; — the official specs from USB-IF. Dense but definitive; start with the architecture overview chapters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.beyondlogic.org/usbnutshell/usb1.shtml" rel="noopener noreferrer"&gt;USB in a NutShell&lt;/a&gt;&lt;/strong&gt; — Beyond Logic's classic tutorial that walks through USB concepts from endpoints to descriptors. The best free introduction to the protocol.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man8/lsusb.8.html" rel="noopener noreferrer"&gt;lsusb(8) man page&lt;/a&gt;&lt;/strong&gt; — the Linux utility for listing USB devices and decoding their descriptors. Pair with &lt;code&gt;lsusb -v&lt;/code&gt; to see what your kernel actually negotiated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.usb.org/usbc" rel="noopener noreferrer"&gt;USB Type-C and USB Power Delivery&lt;/a&gt;&lt;/strong&gt; — USB-IF's overview of the USB-C connector spec and Power Delivery protocol. Explains why one cable shape now carries data, video, and 240W of power.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://nvmexpress.org/specifications/" rel="noopener noreferrer"&gt;NVM Express Base Specification&lt;/a&gt;&lt;/strong&gt; — the NVMe spec provides an instructive contrast: where USB polls for device status, NVMe uses submission and completion queues with MSI-X interrupts for near-zero-latency I/O.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri owns more USB cables than any reasonable person should and still can't find one that does Thunderbolt. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>malloc Is Not Free</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:03:58 +0000</pubDate>
      <link>https://forem.com/nazq/malloc-is-not-free-3jn2</link>
      <guid>https://forem.com/nazq/malloc-is-not-free-3jn2</guid>
      <description>&lt;h1&gt;
  
  
  malloc Is Not Free
&lt;/h1&gt;

&lt;h2&gt;
  
  
  You Called malloc(64). Here's What Actually Happened.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~13 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You called &lt;code&gt;malloc(64)&lt;/code&gt;. You got a pointer back. You stored your data. You moved on.&lt;/p&gt;

&lt;p&gt;The kernel, the MMU hardware, the page table walker, and possibly the disk were all involved.&lt;/p&gt;

&lt;p&gt;Not necessarily in every malloc call — your allocator is pretty clever about batching all that work. But at some point, behind that innocent function call, something had to go talk to the operating system, the OS had to talk to the memory hardware, and the hardware had to talk to RAM. Understanding where that happens — and when — is the difference between "I wonder why this crashed" and "oh, of course it crashed."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bug That Changed How We Allocate
&lt;/h2&gt;

&lt;p&gt;Picture this: a service allocates a 512MB buffer at startup, does a quick sanity check, then proceeds with the real work. It runs fine in staging. In production, under load, it OOM-kills itself hours later.&lt;/p&gt;

&lt;p&gt;The buffer was allocated. No error. The pointer was valid. The problem was that &lt;em&gt;allocating&lt;/em&gt; memory and &lt;em&gt;having&lt;/em&gt; memory are two different things in Linux — and we'll explain exactly why in the next three sections.&lt;/p&gt;

&lt;p&gt;If you've ever seen a &lt;code&gt;SIGSEGV&lt;/code&gt; on a line that seemed obviously fine, or watched a program get killed by the OOM killer when it had "plenty of memory," or wondered why &lt;code&gt;valgrind&lt;/code&gt; sometimes finds bugs that &lt;code&gt;asan&lt;/code&gt; misses — this is the layer where that happens.&lt;/p&gt;




&lt;h2&gt;
  
  
  The First Player: Your Allocator
&lt;/h2&gt;

&lt;p&gt;When you call &lt;code&gt;malloc(64)&lt;/code&gt;, the first thing that intercepts your request is not the kernel. It's a userspace library — &lt;strong&gt;ptmalloc2&lt;/strong&gt; if you're on glibc (the default on most Linux systems), or jemalloc if you're on Firefox or FreeBSD, or Daan Leijen's mimalloc (2019, Microsoft Research) if you're on something that prizes both speed and memory efficiency.&lt;/p&gt;

&lt;p&gt;The allocator's job is to be fast by being greedy. On its first run, it asks the kernel for a big chunk of memory — far more than your 64 bytes. Then it subdivides that chunk itself, handing out pieces to your program without involving the kernel at all.&lt;/p&gt;

&lt;p&gt;This pool of memory is called the &lt;strong&gt;heap&lt;/strong&gt;. It's not a fixed-size thing. It starts small and grows on demand. The allocator manages it with data structures (linked lists of free blocks, size-class bins, various clever tricks depending on which allocator you're using) and tries hard to satisfy your requests from memory it already has.&lt;/p&gt;

&lt;p&gt;Most &lt;code&gt;malloc&lt;/code&gt; calls never reach the kernel. That's the whole point.&lt;/p&gt;

&lt;p&gt;That's why Java, Python, and Rust programs appear to use far more RAM than you'd expect. Their allocators (and the JVM/runtime heap managers) request large chunks upfront from the kernel to amortize the syscall overhead. The virtual size looks enormous. The resident size — physical pages actually in RAM — is often much smaller.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd5t2c88fxsqihjz7qsl1.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd5t2c88fxsqihjz7qsl1.jpeg" alt="Userspace memory layout — code, data, heap, stack, kernel" width="800" height="1071"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  When the Pool Runs Dry
&lt;/h2&gt;

&lt;p&gt;But eventually, the allocator's pool fills up. You've handed out more than you initially reserved. Time to ask the kernel for more.&lt;/p&gt;

&lt;p&gt;The allocator has two ways to do this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;brk()&lt;/code&gt; / &lt;code&gt;sbrk()&lt;/code&gt;&lt;/strong&gt; — the old way, and still used for small allocations. These syscalls move the "program break" — a pointer that marks the end of the heap segment. Move it up, and the virtual address space between the old break and the new break is yours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;mmap()&lt;/code&gt;&lt;/strong&gt; — the modern way, used for large allocations (typically over 128KB, though the threshold is configurable). &lt;code&gt;mmap()&lt;/code&gt; asks the kernel for a new &lt;em&gt;anonymous&lt;/em&gt; mapping — a fresh region of virtual address space not backed by any file. (You saw &lt;code&gt;mmap&lt;/code&gt; doing file-backed work in &lt;a href="https://nazquadri.dev/blog/the-layer-below/04-reading-a-file/" rel="noopener noreferrer"&gt;&lt;em&gt;What Actually Happens When You Read a File&lt;/em&gt;&lt;/a&gt; — here it's the same syscall, but for memory that doesn't map to a file at all.)&lt;/p&gt;

&lt;p&gt;Both syscalls return to the allocator with a pointer to a new region of virtual address space. The allocator updates its internal bookkeeping, carves off the chunk you asked for, and returns it to you.&lt;/p&gt;

&lt;p&gt;Here's the thing: this all happens &lt;em&gt;extremely&lt;/em&gt; fast. The kernel updates a data structure and returns. No RAM has been touched. No physical memory has been allocated. The pointer you receive points to... nothing, yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Kernel's Comfortable Lie
&lt;/h2&gt;

&lt;p&gt;This is the part that bends everyone's brain the first time: &lt;strong&gt;virtual memory is a lie the kernel tells your program.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your process has a &lt;strong&gt;virtual address space&lt;/strong&gt; — on a 64-bit system, a staggering 128TB of addressable space, minimum. The kernel hands you slices of this virtual space freely. It's cheap. It's just numbers in a data structure.&lt;/p&gt;

&lt;p&gt;Physical RAM is different. Physical RAM is scarce. A machine with 16GB of RAM has 16GB of RAM, and sharing it across dozens of processes requires actual bookkeeping.&lt;/p&gt;

&lt;p&gt;The kernel resolves this tension with a policy called &lt;strong&gt;overcommit&lt;/strong&gt;. When your allocator calls &lt;code&gt;mmap()&lt;/code&gt; for 512MB, the kernel doesn't check whether 512MB of physical RAM is available. It checks whether 512MB of virtual address space is available (almost always yes), updates its virtual memory map, and returns success.&lt;/p&gt;

&lt;p&gt;The assumption is: you'll never actually touch all that memory. Most programs allocate a lot and use a fraction. The kernel is gambling on this, and statistically it wins.&lt;/p&gt;

&lt;p&gt;This is why you can &lt;code&gt;malloc(500GB)&lt;/code&gt; on a machine with 8GB of RAM, get a non-NULL pointer back, and not crash — until you actually try to use it.&lt;/p&gt;

&lt;p&gt;That's why &lt;code&gt;malloc&lt;/code&gt; returning non-NULL doesn't mean you have memory. The kernel said yes to the virtual address. It hasn't committed any RAM. You find out at first write, via a page fault — or, if RAM is truly exhausted, via the OOM killer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// This does NOT allocate 1GB of physical RAM. It reserves&lt;/span&gt;
&lt;span class="c1"&gt;// 1GB of virtual address space. Pages only become real on first write.&lt;/span&gt;
&lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// passes on most Linux systems with default overcommit settings&lt;/span&gt;
&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;           &lt;span class="c1"&gt;// *this* is when things get interesting&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Page Fault: Where Memory Gets Real
&lt;/h2&gt;

&lt;p&gt;You write to your newly allocated memory for the first time. The CPU translates the virtual address to a physical address using the &lt;strong&gt;page table&lt;/strong&gt; — a tree of mappings the kernel maintains for each process.&lt;/p&gt;

&lt;p&gt;The page table entry for your address says: "valid virtual address, but no physical frame assigned yet."&lt;/p&gt;

&lt;p&gt;The CPU raises a &lt;strong&gt;page fault&lt;/strong&gt;. This is a hardware exception — the CPU stops executing your instruction, saves its state, and jumps to the kernel's page fault handler. Your process is suspended, mid-instruction, while the kernel handles it.&lt;/p&gt;

&lt;p&gt;The page fault handler looks at the virtual address and decides what to do. For a freshly allocated anonymous page, the answer is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find a free physical &lt;strong&gt;frame&lt;/strong&gt; (a 4KB chunk of RAM)&lt;/li&gt;
&lt;li&gt;Zero it out — this is a security requirement enforced by the kernel. Without it, you'd get physical memory that still contains another process's data. POSIX mandates zeroing for &lt;code&gt;mmap&lt;/code&gt; with &lt;code&gt;MAP_ANONYMOUS&lt;/code&gt;, but Linux zeroes all new pages regardless, because handing out stale memory is how you get information leaks.&lt;/li&gt;
&lt;li&gt;Write the mapping into the page table: "this virtual address → this physical frame"&lt;/li&gt;
&lt;li&gt;Flush the relevant address translation cache entry (the &lt;strong&gt;TLB&lt;/strong&gt; — more on this in a moment)&lt;/li&gt;
&lt;li&gt;Return to the CPU, which re-executes the faulting instruction&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your process resumes, completely unaware that anything happened. From your code's perspective, you just wrote to a pointer. But you actually just caused a hardware exception, ran kernel code, and allocated physical RAM.&lt;/p&gt;

&lt;p&gt;This is demand paging — the kernel is lazy by design. It doesn't allocate physical memory until you demand it by accessing the page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jb14kw0252elihvewfa.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jb14kw0252elihvewfa.jpeg" alt="Page fault sequence — the process never knows it happened" width="800" height="1071"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Page Table: A Tree the Hardware Walks
&lt;/h2&gt;

&lt;p&gt;Every memory access your process makes — every load, every store, every function call — has to go through address translation. The CPU takes your virtual address and turns it into a physical address.&lt;/p&gt;

&lt;p&gt;This translation happens in hardware, in a component called the &lt;strong&gt;MMU&lt;/strong&gt; (Memory Management Unit). The MMU walks the &lt;strong&gt;page table&lt;/strong&gt; — a multi-level tree structure that the kernel maintains in RAM.&lt;/p&gt;

&lt;p&gt;On x86-64 with the common 4-level configuration, a 64-bit virtual address is split into five fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; 63       48 47   39 38   30 29   21 20   12 11        0
┌───────────┬───────┬───────┬───────┬───────┬──────────┐
│  (unused) │  PML4 │  PDP  │  PD   │  PT   │  Offset  │
└───────────┴───────┴───────┴───────┴───────┴──────────┘
     16        9       9       9       9        12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each level is a 512-entry table stored in a physical page. The MMU starts at the PML4 (the root, whose physical address is stored in x86-64's &lt;code&gt;CR3&lt;/code&gt; register — ARM64 calls the equivalent &lt;code&gt;TTBR0_EL1&lt;/code&gt;, but the concept is identical), indexes into it with bits 47–39, follows the pointer to the next level, indexes again, and so on down until it has the physical address of the page. It then adds the 12-bit offset to get the final physical byte address. Every architecture with virtual memory does some version of this walk — the number of levels and register names change, but the tree structure is universal.&lt;/p&gt;

&lt;p&gt;That's four memory reads just to translate one address. On every. Single. Memory. Access.&lt;/p&gt;

&lt;p&gt;Obviously, this would be catastrophically slow if it really happened that way. The hardware maintains a cache for recent translations: the &lt;strong&gt;&lt;a href="https://nazquadri.dev/blog/the-layer-below/15-ram/" rel="noopener noreferrer"&gt;TLB&lt;/a&gt;&lt;/strong&gt; (Translation Lookaside Buffer). Most address translations hit the TLB and cost a single cycle or two. When the kernel updates a page table entry — as it does during a page fault — it has to flush the affected TLB entries, or the CPU would use stale translations.&lt;/p&gt;

&lt;p&gt;TLB flushes are one reason context switches are expensive. When the kernel switches from your process to another, it loads the other process's page table root into &lt;code&gt;CR3&lt;/code&gt;, which invalidates (most of) the TLB. The next process has to re-warm the TLB from scratch.&lt;/p&gt;

&lt;p&gt;Spectre and Meltdown exploited the relationship between virtual memory and CPU caches to leak data across process boundaries. The attacks used speculative execution to access memory the process shouldn't see, then measured &lt;em&gt;cache&lt;/em&gt; timing — not TLB timing — to observe what was read. The cache side-channel is the key: a speculatively loaded cache line leaves a timing fingerprint even after the speculative execution is rolled back. I cover the full mechanism in &lt;a href="https://nazquadri.dev/blog/the-layer-below/11-how-code-runs/" rel="noopener noreferrer"&gt;&lt;em&gt;How Your Python Code Actually Runs&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkk4bfdeihghvl1izfms9.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkk4bfdeihghvl1izfms9.jpeg" alt="4-level page table walk — four memory reads, or the TLB shortcut" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  When There Are No Free Frames
&lt;/h2&gt;

&lt;p&gt;The page fault handler needs a free physical frame. What if there aren't any?&lt;/p&gt;

&lt;p&gt;This is where the kernel earns its keep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reclaim path.&lt;/strong&gt; The kernel's memory management subsystem maintains lists of physical frames sorted roughly by how recently they were accessed — this is the &lt;strong&gt;LRU&lt;/strong&gt; (Least Recently Used) approximation. When free frames run out, the kernel tries to reclaim some.&lt;/p&gt;

&lt;p&gt;Clean pages — pages whose content matches what's on disk (think: file-backed mmaps, shared libraries) — can be simply discarded. The data is already on disk. If the process touches that page again, it page-faults back in. No data lost, just latency.&lt;/p&gt;

&lt;p&gt;Dirty pages — pages with data that exists only in RAM — have to be written somewhere before the frame can be reused. If swap space is configured, they go to swap. The page table entry is updated to "swapped out, here's the location on disk." If the process touches that page again, it takes a much more expensive page fault (disk I/O), the page is read back from swap, and execution resumes.&lt;/p&gt;

&lt;p&gt;That's why &lt;code&gt;mlock()&lt;/code&gt; exists. If you need to guarantee that a page stays in RAM — real-time audio processing, cryptographic keys that must never touch swap — you call &lt;code&gt;mlock()&lt;/code&gt;. This forces the page fault to happen immediately and pins the physical frame.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When even that isn't enough.&lt;/strong&gt; If the kernel has churned through its reclaim options and still can't find a free frame, it invokes the &lt;strong&gt;OOM killer&lt;/strong&gt;: the out-of-memory killer. This is a last-resort mechanism that selects a process — based on a scoring heuristic that considers memory usage, process priority, how long it's been running, and a few other factors — and kills it with &lt;code&gt;SIGKILL&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The process has no say. There's no handler to catch it. One frame you're executing instructions. Next frame, black. No cut scene. No credits. As Henry Hill would say — "And that's that."&lt;/p&gt;

&lt;p&gt;If you've ever had a long-running service vanish with no error, no log entry, no exit code — check &lt;code&gt;/var/log/kern.log&lt;/code&gt; or &lt;code&gt;dmesg&lt;/code&gt; for "Out of memory: Kill process". There's a very good chance the OOM killer found you before you found it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Fork and the Copy-On-Write Trick
&lt;/h2&gt;

&lt;p&gt;Demand paging enables one of the most useful cheap operations in Unix: &lt;code&gt;fork()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;fork()&lt;/code&gt; duplicates the parent's entire virtual address space, but it doesn't copy any of the physical pages. Instead, it marks all pages copy-on-write (COW): both parent and child map to the same physical frames, but the page table entries are marked read-only. The first write to any page by either party triggers a page fault; the handler copies that one frame, and the two processes get independent copies.&lt;/p&gt;

&lt;p&gt;A thousand-page process can fork in microseconds.&lt;/p&gt;

&lt;p&gt;That's why web servers and shell pipelines can spawn child processes so cheaply — the kernel is not copying megabytes of RAM, it's updating page table entries.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack Is Different
&lt;/h2&gt;

&lt;p&gt;One more thing worth knowing: the stack is handled differently from the heap.&lt;/p&gt;

&lt;p&gt;The stack starts at a fixed virtual address and grows downward (on x86). The kernel doesn't actually map all of it upfront — it only maps a few pages, and relies on page faults to extend it when needed. If your function goes deeper, it faults into a new page, the kernel checks that it's still within the stack's allowed region, and extends the mapping.&lt;/p&gt;

&lt;p&gt;This is also why stack overflows aren't caught by &lt;code&gt;malloc&lt;/code&gt; failure — you don't call any allocator function. You just write past the end of the stack's mapped region, hit a page that was deliberately left unmapped (the &lt;strong&gt;guard page&lt;/strong&gt;), and get a segfault. The guard page is not an accident; it's there specifically to catch stack overflows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;infinite_recurse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;   &lt;span class="c1"&gt;// touches a new stack page on each call&lt;/span&gt;
    &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// force the page to be mapped&lt;/span&gt;
    &lt;span class="n"&gt;infinite_recurse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// → SIGSEGV on guard page after ~8000-16000 frames (default 8MB stack)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What You Actually Control
&lt;/h2&gt;

&lt;p&gt;The system we've described is mostly automatic. But you have more control than you might think:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/proc/sys/vm/overcommit_memory&lt;/code&gt;&lt;/strong&gt; — set to &lt;code&gt;2&lt;/code&gt; to disable overcommit entirely. &lt;code&gt;malloc&lt;/code&gt; will return NULL when memory is actually exhausted. This is appropriate for databases and other programs that prefer a clean error to a sudden kill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;mlock()&lt;/code&gt; / &lt;code&gt;mlockall()&lt;/code&gt;&lt;/strong&gt; — pin pages in RAM, prevent swapping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;madvise()&lt;/code&gt;&lt;/strong&gt; — tell the kernel how you plan to use a memory region. &lt;code&gt;MADV_SEQUENTIAL&lt;/code&gt; lets it read-ahead pages. &lt;code&gt;MADV_FREE&lt;/code&gt; tells it the pages are unused and can be reclaimed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;malloc_trim()&lt;/code&gt;&lt;/strong&gt; (glibc-specific) — tell the allocator to release unused heap memory back to the kernel. Useful for long-running services that allocate a lot and then don't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LD_PRELOAD&lt;/code&gt; swap&lt;/strong&gt; — because allocators are userspace, you can swap them out entirely. Replace ptmalloc with jemalloc or mimalloc by setting &lt;code&gt;LD_PRELOAD&lt;/code&gt;. Companies have done this to get 10–30% memory savings and significant throughput improvements with zero code changes.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Full Picture
&lt;/h2&gt;

&lt;p&gt;Let's run &lt;code&gt;malloc(64)&lt;/code&gt; one more time, end to end:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Your code calls &lt;code&gt;malloc(64)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The allocator checks its free list. If a suitable chunk exists, it returns a pointer immediately. Done.&lt;/li&gt;
&lt;li&gt;If not, the allocator calls &lt;code&gt;mmap()&lt;/code&gt; (or &lt;code&gt;brk()&lt;/code&gt;) to request more virtual address space from the kernel.&lt;/li&gt;
&lt;li&gt;The kernel updates the virtual memory area (VMA) list for your process. No physical RAM involved yet.&lt;/li&gt;
&lt;li&gt;The allocator carves off 64 bytes and returns the pointer.&lt;/li&gt;
&lt;li&gt;Your code writes to that pointer for the first time.&lt;/li&gt;
&lt;li&gt;The MMU translates the virtual address. Page table says: valid range, no physical frame.&lt;/li&gt;
&lt;li&gt;CPU raises a page fault.&lt;/li&gt;
&lt;li&gt;Kernel page fault handler runs. Finds a free physical frame. Zeroes it. Updates page table. Flushes TLB entry.&lt;/li&gt;
&lt;li&gt;CPU re-executes the faulting instruction. The write completes.&lt;/li&gt;
&lt;li&gt;Your program continues, completely unaware that hardware exceptions and kernel code were involved.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's the machinery behind a function call so common you type it without thinking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man2/mmap.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 2 mmap&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man2/brk.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 2 brk&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — The kernel interfaces your allocator actually uses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man2/mlock.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 2 mlock&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — How to pin memory in RAM if you need a guarantee.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.kernel.org/doc/gorman/html/understand/" rel="noopener noreferrer"&gt;Understanding the Linux Virtual Memory Manager&lt;/a&gt;&lt;/strong&gt; — Mel Gorman's book, available free online. The definitive deep dive into everything above and more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man5/proc_pid_maps.5.html" rel="noopener noreferrer"&gt;&lt;code&gt;/proc/&amp;lt;pid&amp;gt;/maps&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man5/proc_pid_maps.5.html" rel="noopener noreferrer"&gt;&lt;code&gt;/proc/&amp;lt;pid&amp;gt;/smaps&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — Watch your own process's virtual memory regions in real time. &lt;code&gt;smaps&lt;/code&gt; shows RSS (resident set size) per region — the difference between virtual and physical becomes very concrete very fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://lwn.net/Articles/250967/" rel="noopener noreferrer"&gt;What every programmer should know about memory&lt;/a&gt;&lt;/strong&gt; — Ulrich Drepper's 2007 paper. Still the best single document on the full memory hierarchy.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri once had the OOM killer murder his database at 3am and deserved it. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>The Event Loop You're Already Using</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:03:42 +0000</pubDate>
      <link>https://forem.com/nazq/the-event-loop-youre-already-using-5305</link>
      <guid>https://forem.com/nazq/the-event-loop-youre-already-using-5305</guid>
      <description>&lt;h1&gt;
  
  
  The Event Loop You're Already Using
&lt;/h1&gt;

&lt;h2&gt;
  
  
  select, poll, epoll, and the System Calls Behind Every Async Framework
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~13 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You wrote &lt;code&gt;await fetch(url)&lt;/code&gt;. Your Node.js server handled ten thousand simultaneous connections while it waited. Your CPU usage barely moved.&lt;/p&gt;

&lt;p&gt;Here's what actually happened: your code called into a JavaScript engine, which called libuv, which called &lt;code&gt;epoll_wait&lt;/code&gt;, which asked the kernel to wake it up when any of ten thousand &lt;a href="https://nazquadri.dev/blog/the-layer-below/03-file-descriptors/" rel="noopener noreferrer"&gt;file descriptors&lt;/a&gt; had data ready. The kernel said nothing for 40 milliseconds. Then it said: "three of them are ready." Your event loop woke up and processed exactly those three. The other 9,997 connections cost you nothing while they waited.&lt;/p&gt;

&lt;p&gt;That's the whole trick. One syscall. The kernel does the waiting. You do the work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters Beyond "Async Is Fast"
&lt;/h2&gt;

&lt;p&gt;You've heard that async I/O is efficient. You may have accepted that on faith, or from a benchmark someone posted. But without understanding the layer underneath, you're flying on instruments you don't know how to read.&lt;/p&gt;

&lt;p&gt;Here's the bug you'll eventually hit: you write some async Python, everything looks right, and it's still blocking. Your entire server stalls for 300 milliseconds every time a particular function runs. You add more workers. It keeps happening. The problem is a single synchronous call buried in a dependency — a &lt;code&gt;time.sleep&lt;/code&gt;, a blocking DNS lookup, an &lt;code&gt;open()&lt;/code&gt; on a network filesystem. That call doesn't yield to the event loop. It holds the thread hostage until it returns.&lt;/p&gt;

&lt;p&gt;Understanding the mechanism is the only way to understand why that's catastrophic — and why the fix isn't "just add async," it's "figure out where you're not actually doing non-blocking I/O."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem That Needed Solving
&lt;/h2&gt;

&lt;p&gt;Before we get to the solutions, let's understand what problem they solve.&lt;/p&gt;

&lt;p&gt;It's 1983. You're writing a server. A client connects. You &lt;code&gt;read()&lt;/code&gt; from the socket. If there's no data yet, your process blocks — it goes to sleep, the CPU runs something else, and you wake up when data arrives. This is called &lt;strong&gt;blocking I/O&lt;/strong&gt;, and for one client, it's totally fine.&lt;/p&gt;

&lt;p&gt;Scale it up. A thousand clients. Each &lt;code&gt;read()&lt;/code&gt; call could block. Your single-threaded process blocks on the first client who has nothing to say yet, while the other 999 who have data are sitting there waiting. The obvious fix is threads — one thread per client. But a thousand threads is a thousand stacks (usually 8MB each by default), a thousand kernel scheduling contexts, and constant context switching overhead.&lt;/p&gt;

&lt;p&gt;In 1983, you couldn't afford that. In 2024, the math still gets ugly fast. A modern web server at scale handles hundreds of thousands of connections. You cannot have hundreds of thousands of threads.&lt;/p&gt;

&lt;p&gt;What you want is a way to say: "Here's a list of a hundred thousand file descriptors. Tell me when any of them have something interesting." One call. The kernel blocks until there's work. You wake up, process exactly what's ready, go back to sleep.&lt;/p&gt;

&lt;p&gt;That's the problem. Here are the solutions, in chronological order of how well they work.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;select&lt;/code&gt;: The 1983 Hammer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;select&lt;/code&gt;&lt;/strong&gt; was the first answer, and it arrived with 4.2BSD in 1983 — the Berkeley team's first attempt to solve the multiplexing problem in a standard way. The interface looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;nfds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fd_set&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;readfds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fd_set&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;writefds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;fd_set&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;exceptfds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;timeval&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You give it three sets of file descriptors — ones you want to read, ones you want to write, ones where you care about exceptions — and a timeout. It blocks until something is ready or the timeout expires. When it returns, the sets have been modified in place to show you &lt;em&gt;which&lt;/em&gt; ones are ready.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;fd_set&lt;/code&gt; is a bitmask. On most systems, it's 1024 bits. That's your limit: 1024 file descriptors, max. (You can recompile with a larger &lt;code&gt;FD_SETSIZE&lt;/code&gt;, but you can't escape the O(n) scan.) If you need more, &lt;code&gt;select&lt;/code&gt; literally cannot help you.&lt;/p&gt;

&lt;p&gt;But it gets worse. Every time you call &lt;code&gt;select&lt;/code&gt;, you have to rebuild the set of descriptors you care about, pass it to the kernel, and the kernel has to walk every bit of that mask to figure out which ones changed. For 1000 connections, that's 1000 checks on every call, even if only one descriptor became ready.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;select&lt;/code&gt; is O(n) where n is the number of file descriptors you're watching, &lt;em&gt;regardless of how many are actually active&lt;/em&gt;. At scale, this becomes expensive.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foni56is5sf4j52y258eh.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foni56is5sf4j52y258eh.jpeg" alt="select() O(n) scan — checking all 1024 bits for 3 ready fds" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;poll&lt;/code&gt;: Slightly Less Wrong
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;poll&lt;/code&gt;&lt;/strong&gt; arrived as a POSIX standardization of the same concept, without the 1024 limit. Instead of a bitmask, you pass an array of &lt;code&gt;struct pollfd&lt;/code&gt; structures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;pollfd&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt;   &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// the file descriptor&lt;/span&gt;
    &lt;span class="kt"&gt;short&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// what you're interested in&lt;/span&gt;
    &lt;span class="kt"&gt;short&lt;/span&gt; &lt;span class="n"&gt;revents&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// what actually happened (filled by kernel)&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;poll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;pollfd&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;fds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nfds_t&lt;/span&gt; &lt;span class="n"&gt;nfds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No arbitrary limit on the number of fds. Better event granularity. Same fundamental problem: you still rebuild the entire array on every call, pass it to the kernel, and the kernel still has to walk every entry to check what's ready.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;poll&lt;/code&gt; is O(n) in the same way &lt;code&gt;select&lt;/code&gt; is. It fixed the 1024 limit and cleaned up the API, but it didn't fix the performance cliff at high connection counts.&lt;/p&gt;

&lt;p&gt;Both &lt;code&gt;select&lt;/code&gt; and &lt;code&gt;poll&lt;/code&gt; have another problem: every call copies the entire list of descriptors from user space to kernel space. For 100,000 connections, that's 100,000 copies of a struct on every call, in a tight loop. The data movement alone becomes your bottleneck — an O(n) copy cost layered on top of the O(n) scan.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;epoll&lt;/code&gt;: The Linux Answer (And Why Linux Won)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;epoll&lt;/code&gt;&lt;/strong&gt; landed in Linux 2.5.44 in 2002. It rethinks the whole interface.&lt;/p&gt;

&lt;p&gt;Instead of passing the full list of descriptors on every call, you create an &lt;code&gt;epoll&lt;/code&gt; instance — a kernel-managed data structure that persists between calls — and add file descriptors to it once. Then you just ask "what's ready?", and the kernel has the state it needs already.&lt;/p&gt;

&lt;p&gt;Three syscalls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Create the epoll instance&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;epfd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;epoll_create1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Register a file descriptor with it (once, not every loop)&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;epoll_event&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EPOLLIN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sockfd&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="n"&gt;epoll_ctl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;epfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EPOLL_CTL_ADD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sockfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Wait for events (this is the blocking call in your event loop)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;epoll_wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;epfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MAX_EVENTS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout_ms&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;code&gt;epoll_wait&lt;/code&gt; returns only the descriptors that are &lt;em&gt;actually ready&lt;/em&gt;. If you're watching 100,000 connections and 3 have data, &lt;code&gt;epoll_wait&lt;/code&gt; returns 3. You process 3. The kernel doesn't enumerate the other 99,997.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;epoll&lt;/code&gt; is O(1) for the waiting — adding and removing descriptors is O(log n) once, not O(n) per call (&lt;code&gt;epoll&lt;/code&gt; stores its interest set in a red-black tree). The difference between &lt;code&gt;select&lt;/code&gt; and &lt;code&gt;epoll&lt;/code&gt; at 100,000 connections is the difference between 100,000 operations per loop iteration and approximately 3.&lt;/p&gt;

&lt;p&gt;This is why Node.js became credible at high connection counts. Node runs on libuv, which uses &lt;code&gt;epoll&lt;/code&gt; on Linux. One thread, one event loop, one &lt;code&gt;epoll_wait&lt;/code&gt; call, and the kernel does the heavy lifting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flwgxqmhpe4zutc8r8rl6.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flwgxqmhpe4zutc8r8rl6.jpeg" alt="select/poll vs epoll — copy everything vs return only what's ready" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;kqueue&lt;/code&gt;: BSD Did It Too
&lt;/h2&gt;

&lt;p&gt;If you're on macOS or FreeBSD, the equivalent is &lt;strong&gt;&lt;code&gt;kqueue&lt;/code&gt;&lt;/strong&gt;, which appeared in FreeBSD 4.1 in 2000 — actually two years before &lt;code&gt;epoll&lt;/code&gt;. Different API, same idea: persistent kernel state, O(1) wakeups, batch event delivery.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;kq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kqueue&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;kevent&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ident&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sockfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EVFILT_READ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EV_ADD&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;EV_ENABLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="n"&gt;kevent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Wait for events&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;kevent&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;MAX_EVENTS&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kevent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MAX_EVENTS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;kqueue&lt;/code&gt; is more elegant than &lt;code&gt;epoll&lt;/code&gt; — a single syscall handles both registration and waiting — and it watches more than just file descriptors. You can use the same interface to watch for process exits, signals, timers, and file system changes. The event model is unified. But it's BSD-only, so it doesn't run on Linux.&lt;/p&gt;

&lt;p&gt;This is the proliferation problem that every cross-platform runtime faces. You want &lt;code&gt;epoll&lt;/code&gt; on Linux, &lt;code&gt;kqueue&lt;/code&gt; on macOS/BSD, and IOCP on Windows (which is a completely different model). The next section is about how each major runtime solves this.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;O_NONBLOCK&lt;/code&gt;: The Prerequisite Nobody Explains
&lt;/h2&gt;

&lt;p&gt;Here's the part that trips people up. Having an efficient waiting mechanism isn't enough. The individual I/O operations themselves have to be non-blocking, or the whole thing falls apart.&lt;/p&gt;

&lt;p&gt;By default, &lt;code&gt;read()&lt;/code&gt; blocks. If you call &lt;code&gt;read()&lt;/code&gt; on a socket with no data available, your thread sleeps until data arrives. &lt;code&gt;epoll&lt;/code&gt; told you the socket was ready, so in normal operation this doesn't happen — but edge cases and bugs can still get you here. More importantly, some operations — &lt;code&gt;connect()&lt;/code&gt;, &lt;code&gt;write()&lt;/code&gt; on a full buffer — can block even when &lt;code&gt;epoll&lt;/code&gt; says you're good to go.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;O_NONBLOCK&lt;/code&gt;&lt;/strong&gt; is a flag you set on a file descriptor to change this behavior:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fcntl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F_GETFL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;fcntl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F_SETFL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;O_NONBLOCK&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;O_NONBLOCK&lt;/code&gt; set, I/O operations never block. If a &lt;code&gt;read()&lt;/code&gt; would have blocked (no data available), it returns immediately with &lt;code&gt;EAGAIN&lt;/code&gt; or &lt;code&gt;EWOULDBLOCK&lt;/code&gt;. If a &lt;code&gt;write()&lt;/code&gt; would have blocked (send buffer full), same thing.&lt;/p&gt;

&lt;p&gt;Your code then handles &lt;code&gt;EAGAIN&lt;/code&gt; by going back to the event loop — "okay, nothing ready, I'll wait for &lt;code&gt;epoll&lt;/code&gt; to tell me when to try again."&lt;/p&gt;

&lt;p&gt;Without &lt;code&gt;O_NONBLOCK&lt;/code&gt;, even a single blocking operation inside your event loop stalls everything. The whole async model breaks down. This is where that "buried synchronous call" bug comes from: some dependency opens a file descriptor, forgets to set &lt;code&gt;O_NONBLOCK&lt;/code&gt;, calls &lt;code&gt;read()&lt;/code&gt;, and your entire event loop freezes while it waits.&lt;/p&gt;

&lt;p&gt;"Use async all the way down" isn't style advice. It's a correctness requirement. Call &lt;code&gt;time.sleep(1)&lt;/code&gt; inside an event loop and you've blocked the only thread — no callbacks fire, no connections are served, the whole system goes dark for one second. Python's &lt;code&gt;asyncio.sleep&lt;/code&gt; exists for exactly this reason: it yields back to the event loop instead of blocking the thread. Same reason you never call &lt;code&gt;requests.get&lt;/code&gt; in async code — it's a blocking HTTP client that holds the thread hostage while it waits for a response. Use &lt;code&gt;aiohttp&lt;/code&gt; or &lt;code&gt;httpx&lt;/code&gt; instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the Frameworks Plug In
&lt;/h2&gt;

&lt;p&gt;Let's trace how the high-level abstractions land on these syscalls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Node.js / libuv
&lt;/h3&gt;

&lt;p&gt;libuv runs a loop: check timers, check I/O callbacks, call &lt;code&gt;epoll_wait&lt;/code&gt; (Linux) or &lt;code&gt;kqueue&lt;/code&gt; (macOS) with whatever timeout makes sense given pending timers. When it wakes up, it dispatches callbacks. Your &lt;code&gt;await fetch(url)&lt;/code&gt; eventually becomes a socket, which gets registered with &lt;code&gt;epoll&lt;/code&gt;, which wakes up libuv, which calls your callback, which resolves the Promise. The "event loop" you've heard about? It's this loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python asyncio
&lt;/h3&gt;

&lt;p&gt;Same pattern, different language. &lt;code&gt;asyncio&lt;/code&gt;'s default &lt;code&gt;SelectorEventLoop&lt;/code&gt; uses Python's &lt;code&gt;selectors&lt;/code&gt; module, which picks the best available backend for the platform — &lt;code&gt;epoll&lt;/code&gt; on Linux, &lt;code&gt;kqueue&lt;/code&gt; on macOS, &lt;code&gt;select&lt;/code&gt; as a last resort. On Linux, you're already on &lt;code&gt;epoll&lt;/code&gt; out of the box. &lt;code&gt;uvloop&lt;/code&gt; wraps libuv for even better performance, but the default is not as slow as people assume. Windows gets &lt;code&gt;ProactorEventLoop&lt;/code&gt; backed by IOCP. The &lt;code&gt;await&lt;/code&gt; keyword doesn't make I/O async — the &lt;em&gt;underlying selector&lt;/em&gt; does. &lt;code&gt;await&lt;/code&gt; just lets the event loop know your coroutine is willing to pause.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tokio (Rust)
&lt;/h3&gt;

&lt;p&gt;Tokio uses &lt;a href="https://docs.rs/mio/latest/mio/" rel="noopener noreferrer"&gt;mio&lt;/a&gt;, a thin safe wrapper around &lt;code&gt;epoll&lt;/code&gt;/&lt;code&gt;kqueue&lt;/code&gt;/IOCP. Its async runtime is more explicit about what it's doing than Node.js — you can see the reactor and the executor as separate components. The reactor watches file descriptors; the executor schedules tasks. An &lt;code&gt;.await&lt;/code&gt; in Tokio suspends a task and hands control back to the executor, which runs other tasks until the reactor reports that the descriptor is ready. Then the task is rescheduled.&lt;/p&gt;

&lt;h3&gt;
  
  
  Go goroutines
&lt;/h3&gt;

&lt;p&gt;Go's runtime is the most opaque of the four. You write blocking-looking code — &lt;code&gt;conn.Read(buf)&lt;/code&gt; — and the runtime makes it non-blocking behind your back. When a goroutine would block on I/O, the runtime parks the goroutine, registers the descriptor with the network poller (which uses &lt;code&gt;epoll&lt;/code&gt; on Linux), and continues running other goroutines on the same OS thread. When the data arrives, the poller wakes up the parked goroutine. From your perspective, it blocked. In reality, &lt;code&gt;epoll_wait&lt;/code&gt; was called and the goroutine was context-switched away.&lt;/p&gt;

&lt;p&gt;That's why Go can have a million goroutines — they're not OS threads, and they don't block OS threads when they wait on I/O.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zq9m70xt51ya6157e6e.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zq9m70xt51ya6157e6e.jpeg" alt="Four runtimes, same primitives — epoll_wait, kevent, IOCP" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What The Kernel Is Actually Doing
&lt;/h2&gt;

&lt;p&gt;It's worth briefly understanding how &lt;code&gt;epoll&lt;/code&gt; does its job efficiently.&lt;/p&gt;

&lt;p&gt;When you register a file descriptor with &lt;code&gt;epoll_ctl&lt;/code&gt;, the kernel attaches a callback to the descriptor's wait queue. This is a list of sleeping tasks that should be woken up when the descriptor becomes ready.&lt;/p&gt;

&lt;p&gt;When a packet arrives, it follows a specific path: the NIC triggers a hardware interrupt, which causes the CPU to run the network driver's interrupt handler, which feeds the packet up through the kernel's network stack, which places data in the socket's receive buffer. At that point, the socket's wait queue callbacks fire — including the one &lt;code&gt;epoll&lt;/code&gt; registered. The callback adds the descriptor to &lt;code&gt;epoll&lt;/code&gt;'s ready list.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;epoll_wait&lt;/code&gt; returns, it's just reading off that ready list. No scanning. No iteration over your 100,000 connections. The work was done when the packet arrived, not when you asked.&lt;/p&gt;

&lt;p&gt;That's why &lt;code&gt;EAGAIN&lt;/code&gt; isn't an error. When &lt;code&gt;O_NONBLOCK&lt;/code&gt; is set and &lt;code&gt;read()&lt;/code&gt; returns -1 with &lt;code&gt;errno == EAGAIN&lt;/code&gt;, the socket is politely saying "nothing ready yet." Your event loop re-registers with &lt;code&gt;epoll&lt;/code&gt; and waits. Every async I/O library handles this for you, which is why you've probably never seen it directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Same Kernel, The Same Primitives
&lt;/h2&gt;

&lt;p&gt;The abstractions you use every day — async/await, goroutines, green threads, futures — are different UX choices on top of the same three or four kernel primitives. The kernel API hasn't changed much since &lt;code&gt;epoll&lt;/code&gt; landed in 2002. What's changed is how well we've wrapped it.&lt;/p&gt;

&lt;p&gt;Node's innovation wasn't non-blocking I/O. The kernel had that. Node's innovation was making it the &lt;em&gt;default&lt;/em&gt; — putting it in a single-threaded event loop and forcing the programming model to accommodate it. Python's &lt;code&gt;asyncio&lt;/code&gt; brought the same model to a language that was threading-first. Rust's Tokio gave you the same power with compile-time correctness guarantees. Go hid the whole thing behind a familiar synchronous-looking syntax.&lt;/p&gt;

&lt;p&gt;All roads lead to &lt;code&gt;epoll_wait&lt;/code&gt;. The question is just how many layers of abstraction are between you and the call.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man7/epoll.7.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 7 epoll&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — the Linux epoll interface, with edge-triggered vs. level-triggered details (which we glossed over — it's worth reading)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man2/select.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 2 select&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man2/poll.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 2 poll&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — the originals, for historical grounding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="http://www.kegel.com/c10k.html" rel="noopener noreferrer"&gt;The C10K Problem&lt;/a&gt;&lt;/strong&gt; — Dan Kegel's 2001 writeup that defined the problem and surveyed every solution available at the time. A historical artifact and still essential reading.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.libuv.org/en/v1.x/design.html" rel="noopener noreferrer"&gt;libuv design overview&lt;/a&gt;&lt;/strong&gt; — how Node's I/O layer works, with diagrams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://tokio.rs/blog/2019-10-scheduler" rel="noopener noreferrer"&gt;Tokio internals&lt;/a&gt;&lt;/strong&gt; — the Tokio scheduler design post, remarkably readable&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri once blocked the entire event loop with a single time.sleep(0.1) and spent two hours blaming the network. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>What "Connected" Means in TCP</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:03:13 +0000</pubDate>
      <link>https://forem.com/nazq/what-connected-means-in-tcp-13jb</link>
      <guid>https://forem.com/nazq/what-connected-means-in-tcp-13jb</guid>
      <description>&lt;h1&gt;
  
  
  What "Connected" Means in TCP
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The Three-Packet Handshake Between Two Kernels Who've Never Met
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~13 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You called &lt;code&gt;connect()&lt;/code&gt;. Your code moved on. You're "connected."&lt;/p&gt;

&lt;p&gt;Nothing physical connected. No wire was plugged in. No circuit was closed. Three packets flew across the network and landed in two kernel data structures — a hash table entry on your machine and a hash table entry on the server — and that's it. That's the whole "connection." A gentleman's agreement between two kernels who've never met, maintained by nothing more than both sides keeping their word.&lt;/p&gt;

&lt;p&gt;The moment either kernel loses that state — crash, memory pressure, a firewall that forgets to tell anyone — your "connection" evaporates. The wire is still there. The bytes stop flowing.&lt;/p&gt;

&lt;p&gt;Most of us debug socket code for years without understanding what &lt;code&gt;connect()&lt;/code&gt; actually does. Every mysterious hang is a debt being collected. Let's fix it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The State Machine Behind &lt;code&gt;connect()&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The kernel maintains a state machine for every TCP connection. You've probably seen a diagram of it in a textbook and immediately forgotten it. That's fair — the full diagram has eleven states and looks like a fire escape route — which it is, in a way. Every state represents a moment when one side might crash and the other needs a graceful exit. &lt;a href="https://datatracker.ietf.org/doc/html/rfc793" rel="noopener noreferrer"&gt;RFC 793&lt;/a&gt; defined all eleven in 1981.&lt;/p&gt;

&lt;p&gt;But here's the part that matters: when you call &lt;code&gt;connect()&lt;/code&gt;, your kernel doesn't "connect" to anything. It starts a negotiation. A very specific, three-packet negotiation called the &lt;strong&gt;handshake&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here's what actually happens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your machine                    Remote machine
     │                               │
     │  ──── SYN ──────────────────► │   "I want to talk. My sequence starts at X."
     │                               │
     │  ◄─── SYN-ACK ─────────────── │   "OK. Mine starts at Y. I got your X."
     │                               │
     │  ──── ACK ──────────────────► │   "Got it. Now we're both synchronized."
     │                               │
   ESTABLISHED                  ESTABLISHED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three packets. That's the whole handshake.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;SYN&lt;/strong&gt; packet contains a randomly chosen &lt;strong&gt;sequence number&lt;/strong&gt; — a 32-bit integer that will be used to label every byte your machine sends. The server responds with its own SYN (it needs to start its own sequence), plus an ACK for yours. Your ACK completes the circle.&lt;/p&gt;

&lt;p&gt;After those three packets, both kernels have entered the &lt;code&gt;ESTABLISHED&lt;/code&gt; state in their connection tables. The "connection" now exists.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vef7ej9zumzdmd91bd6.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vef7ej9zumzdmd91bd6.jpeg" alt="TCP three-way handshake — the connection lives here, not on the wire" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A Pair of Hash Table Entries
&lt;/h2&gt;

&lt;p&gt;What does that state actually look like? On Linux, every socket is a &lt;a href="https://nazquadri.dev/blog/the-layer-below/03-file-descriptors/" rel="noopener noreferrer"&gt;file descriptor&lt;/a&gt; — the same integer handle the kernel uses for files, pipes, and everything else. The kernel maintains a hash table of sockets keyed by a four-tuple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(source IP, source port, destination IP, destination port)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This four-tuple uniquely identifies a connection. Your machine can have thousands of connections to the same server on the same port — as long as the source port is different, they're different entries in the hash table.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;connect()&lt;/code&gt; returns, there is one entry in your kernel's connection table and one in the server's. That's your "connection." It's not a circuit. It's not a reserved channel. It's two structs in two hash tables on two machines, agreeing to honor a sequence number protocol.&lt;/p&gt;

&lt;p&gt;When you're not sending anything, nothing is happening on the wire. The connection just... sits there. In RAM. Two kernels keeping state about each other.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Sequence Numbers Exist
&lt;/h2&gt;

&lt;p&gt;The original problem TCP solved is: the internet is unreliable. Packets get dropped. They arrive out of order. Routers duplicate them. They might take wildly different paths.&lt;/p&gt;

&lt;p&gt;IP doesn't care. IP's job is to route packets best-effort and move on. TCP's job is to build a reliable, ordered byte stream on top of that chaos.&lt;/p&gt;

&lt;p&gt;The way it does this is with &lt;strong&gt;sequence numbers&lt;/strong&gt;. Every byte you send has a position in the stream. If packet 3 arrives before packet 2, the receiver buffers it and waits. If packet 2 never arrives, the receiver asks for it again. If the same data arrives twice (duplicated in transit), the sequence number identifies the duplicate and it gets discarded.&lt;/p&gt;

&lt;p&gt;The sequence number you start with isn't zero. It's chosen randomly, for security: if it were predictable, an attacker could inject packets into your stream by guessing the next expected sequence number. This attack is called a &lt;strong&gt;blind injection attack&lt;/strong&gt; — Kevin Mitnick used predictable sequence numbers in his 1994 attack on Tsutomu Shimomura to hijack a trusted connection. &lt;a href="https://datatracker.ietf.org/doc/html/rfc6528" rel="noopener noreferrer"&gt;RFC 6528&lt;/a&gt; randomizes initial sequence numbers specifically to prevent this. The randomness is the defense.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcmco8obe1jx08alw494s.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcmco8obe1jx08alw494s.jpeg" alt="TCP byte stream reordering — packets arrive out of order, bytes don't" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Buffer Between You and the Wire
&lt;/h2&gt;

&lt;p&gt;Here's the piece of the mental model that surprises people most.&lt;/p&gt;

&lt;p&gt;When you call &lt;code&gt;send()&lt;/code&gt;, your bytes do not go on the wire. They go into a buffer in the kernel.&lt;/p&gt;

&lt;p&gt;The kernel's &lt;strong&gt;socket send buffer&lt;/strong&gt; sits between your application and the network stack. &lt;code&gt;send()&lt;/code&gt; copies your bytes there and returns immediately. The kernel sends them when it decides to — based on network conditions, the receiver's capacity, and a timer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# sock = socket already in ESTABLISHED state
&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello world&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# At this point: bytes are in kernel buffer.
# They have NOT left your machine.
# The remote end has NOT received them.
# send() returned anyway.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This has a corollary that catches people off guard: &lt;code&gt;send()&lt;/code&gt; can return &lt;em&gt;before the bytes leave your machine&lt;/em&gt;. It can return while your laptop is offline. It just means "I accepted your bytes." Delivery is a separate promise, made later, without your involvement.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;receive buffer&lt;/strong&gt; works the other way. When packets arrive, the kernel puts the data in the receive buffer and sends an ACK back to the sender — "got it" — before your application calls &lt;code&gt;recv()&lt;/code&gt;. Your code might be sleeping. The kernel is already acknowledging on your behalf.&lt;/p&gt;

&lt;p&gt;This decoupling is what makes TCP reliable without making your code complicated. The kernel manages retransmits, flow control, and reordering. You see a clean byte stream.&lt;/p&gt;

&lt;p&gt;That's why &lt;code&gt;send()&lt;/code&gt; returns before the data is delivered. There's a kernel buffer between &lt;code&gt;send()&lt;/code&gt; and the wire. &lt;code&gt;send()&lt;/code&gt; accepts bytes; it doesn't send them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Window Size and the Invisible Flow Controller
&lt;/h2&gt;

&lt;p&gt;The receiver tells the sender how much space it has in its receive buffer. This is the &lt;strong&gt;window size&lt;/strong&gt;, advertised in every TCP segment.&lt;/p&gt;

&lt;p&gt;If the receiver's buffer fills up — because the application is slow to call &lt;code&gt;recv()&lt;/code&gt; — the window size shrinks. When it hits zero, the sender stops sending. Entirely.&lt;/p&gt;

&lt;p&gt;This is called &lt;strong&gt;flow control&lt;/strong&gt;, and it's operating silently in every TCP connection you've ever used. Your &lt;code&gt;send()&lt;/code&gt; call doesn't hang because of the network. It hangs because the &lt;em&gt;application on the other end isn't reading fast enough&lt;/em&gt;, and that information has propagated backward through two kernel buffers and a TCP window advertisement.&lt;/p&gt;

&lt;p&gt;That's why your socket &lt;code&gt;send()&lt;/code&gt; sometimes blocks. The buffer is full. The buffer is full because the window is zero. The window is zero because the remote application is backed up. You're feeling the pressure of something happening three hops away.&lt;/p&gt;

&lt;p&gt;That's why database connection pools backpressure: a slow consumer in the application tier propagates all the way back to the TCP send buffer of the client.&lt;/p&gt;




&lt;h2&gt;
  
  
  Nagle's Algorithm: The Uninvited Optimizer
&lt;/h2&gt;

&lt;p&gt;In the 1980s, a network engineer named John Nagle noticed that people were sending single-character packets over slow serial links. Every keypress in a terminal session became a 41-byte TCP/IP frame — 40 bytes of headers, 1 byte of data. The network was clogging up with tiny packets.&lt;/p&gt;

&lt;p&gt;His fix was &lt;strong&gt;Nagle's algorithm&lt;/strong&gt;: don't send a small packet if there's outstanding unacknowledged data. Wait until you have a full packet's worth, or until your outstanding data gets acknowledged.&lt;/p&gt;

&lt;p&gt;It made a lot of sense in 1984. It still makes sense for bulk data transfers.&lt;/p&gt;

&lt;p&gt;It is a disaster for latency-sensitive protocols.&lt;/p&gt;

&lt;p&gt;The classic symptom: you're writing a client that sends a small request and waits for a response. The request is 10 bytes. Nagle buffers it because you sent a header 5 milliseconds ago that hasn't been ACK'd yet. Your round-trip time triples for no reason.&lt;/p&gt;

&lt;p&gt;The fix is &lt;code&gt;TCP_NODELAY&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setsockopt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IPPROTO_TCP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TCP_NODELAY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's why every high-performance network library sets &lt;code&gt;TCP_NODELAY&lt;/code&gt; by default now. Nagle's algorithm is opt-out behavior inherited from a world with 1200 baud modems.&lt;/p&gt;

&lt;p&gt;That's why Redis, PostgreSQL's wire protocol, and every low-latency RPC framework explicitly disable Nagle. The moment you're doing request/response over TCP, you want control over when the packet leaves.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Slow Close: &lt;code&gt;close()&lt;/code&gt; vs &lt;code&gt;shutdown()&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Here's where the state machine comes back to bite you.&lt;/p&gt;

&lt;p&gt;When you're done with a connection, you call &lt;code&gt;close()&lt;/code&gt;. What actually happens is more involved.&lt;/p&gt;

&lt;p&gt;TCP is a &lt;strong&gt;full-duplex&lt;/strong&gt; protocol. You have a stream going in each direction, independently. Closing a connection means closing both streams, but they don't have to close at the same time.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;close()&lt;/code&gt; closes the whole socket — both directions. If you haven't read all incoming data yet, that data is lost. This bites you when you're parsing an HTTP response: you &lt;code&gt;close()&lt;/code&gt; before reading the error body and get a connection reset instead of the error message you needed.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;shutdown()&lt;/code&gt; is more precise:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SHUT_WR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# I'm done sending. Remote can still send to me.
&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SHUT_RD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# I'm done receiving.
&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SHUT_RDWR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Both.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you &lt;code&gt;shutdown(SHUT_WR)&lt;/code&gt;, your kernel sends a &lt;strong&gt;FIN&lt;/strong&gt; packet to the remote end. That FIN means "I'm done sending data." The remote end can still send data back to you, and you can still receive it. Both sides have to send a FIN before the connection is truly closed.&lt;/p&gt;

&lt;p&gt;The four-packet close handshake (FIN → ACK → FIN → ACK) mirrors the three-packet open, but split across time: the two sides often close independently, because one side might have more to say.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4ft83ad6uv4mklutkvs.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4ft83ad6uv4mklutkvs.jpeg" alt="TCP connection close — TIME_WAIT waits 2xMSL" width="800" height="1071"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  TIME_WAIT: The Ghost That Haunts Your Port Numbers
&lt;/h2&gt;

&lt;p&gt;After the final ACK, the connection isn't immediately gone. The kernel enters &lt;strong&gt;TIME_WAIT&lt;/strong&gt; state and holds the four-tuple — source IP, source port, destination IP, destination port — for &lt;strong&gt;2 × MSL&lt;/strong&gt; (Maximum Segment Lifetime).&lt;/p&gt;

&lt;p&gt;On Linux, &lt;code&gt;TCP_TIMEWAIT_LEN&lt;/code&gt; is hardcoded to 60 seconds — Linux sets this constant directly rather than computing 2×MSL, though the 2×MSL formula from &lt;a href="https://datatracker.ietf.org/doc/html/rfc793" rel="noopener noreferrer"&gt;RFC 793&lt;/a&gt; describes the intent. This is not configurable via sysctl. The similarly named &lt;code&gt;tcp_fin_timeout&lt;/code&gt; controls something else entirely — the FIN_WAIT_2 timeout, not TIME_WAIT. Confusing the two is a common mistake, and one I've made more than once.&lt;/p&gt;

&lt;p&gt;Why? Because that final ACK might have been lost. If the remote end re-sends its FIN (because it didn't get the ACK), your kernel needs to be able to ACK it. If the connection were immediately gone, your kernel would send a RST instead, which is rude and could leave the remote end confused.&lt;/p&gt;

&lt;p&gt;There's a second reason: the internet is not instantaneous. Old duplicate packets from the &lt;em&gt;previous&lt;/em&gt; connection on this same four-tuple might still be in transit somewhere. TIME_WAIT prevents a new connection from misinterpreting those ancient packets as belonging to it.&lt;/p&gt;

&lt;p&gt;This matters when you're running a server that handles thousands of short-lived connections. Every client that disconnects leaves a TIME_WAIT entry behind. On a busy server, you can accumulate tens of thousands of these entries, each holding a port number hostage for 2 minutes.&lt;/p&gt;

&lt;p&gt;That's why servers run out of ports. Not because they're out of &lt;em&gt;listening&lt;/em&gt; ports. Because the ephemeral port range — the range the kernel uses for outbound connections — is full of TIME_WAIT ghosts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# See the carnage&lt;/span&gt;
ss &lt;span class="nt"&gt;-s&lt;/span&gt;
&lt;span class="c"&gt;# TIME-WAIT: 18432  ← these are connections waiting to die&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix on Linux is &lt;code&gt;SO_REUSEADDR&lt;/code&gt;, which lets you bind to an address/port combination that's still in TIME_WAIT. Most server frameworks set this automatically. When you see mysterious "address already in use" errors after restarting a server, you've met TIME_WAIT in person.&lt;/p&gt;

&lt;p&gt;That's why your server runs out of ports after handling many short-lived connections. Not listening ports — ephemeral ports. The range is full of TIME_WAIT ghosts.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Gentleman's Agreement
&lt;/h2&gt;

&lt;p&gt;Let's put it all together.&lt;/p&gt;

&lt;p&gt;When you call &lt;code&gt;connect()&lt;/code&gt;, you initiate a negotiation. Three packets create state in two kernel tables. The "connection" is those two table entries, nothing more. No circuit. No reserved bandwidth. No wire that's "yours."&lt;/p&gt;

&lt;p&gt;While the connection is open, two kernel buffers — yours and theirs — mediate every byte. &lt;code&gt;send()&lt;/code&gt; means "here, kernel, deal with this." &lt;code&gt;recv()&lt;/code&gt; means "give me whatever showed up." The network happens in between, managed entirely by kernel code you didn't write, running on a schedule you don't control.&lt;/p&gt;

&lt;p&gt;The receiver's window size limits how fast you can send. Nagle's algorithm may hold your packets hostage for milliseconds. Congestion control — a whole other topic we're glossing over here — may slow your throughput based on packet loss detected on the path between you.&lt;/p&gt;

&lt;p&gt;When you close the connection, the kernel sends FINs, waits for the remote side, and then holds the ghost of the connection for up to 60 seconds.&lt;/p&gt;

&lt;p&gt;None of this is visible to your code. You called &lt;code&gt;connect()&lt;/code&gt;. You called &lt;code&gt;send()&lt;/code&gt;. You called &lt;code&gt;recv()&lt;/code&gt;. The bytes arrived in order and the stream made sense.&lt;/p&gt;

&lt;p&gt;The "connection" is a gentleman's agreement. Both sides promise to track sequence numbers, ACK each other's bytes, respect each other's window sizes, and retransmit anything that goes unacknowledged. Neither side promises to stay alive, respond quickly, or tell the other if the application crashes. There's no heartbeat. No "are you still there?" The agreement holds until one side breaks it — or just goes silent and lets the retransmission timer expire.&lt;/p&gt;

&lt;p&gt;Every TCP connection on the planet is two structs in two hash tables, honouring a handshake that happened milliseconds or hours ago. The wire doesn't care. The agreement is all there is.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man7/tcp.7.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 7 tcp&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — Linux TCP socket options. Dense. Has everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man8/ss.8.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 8 ss&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — The socket statistics tool. Use &lt;code&gt;ss -s&lt;/code&gt; to see TIME_WAIT counts in real time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.amazon.com/TCP-Illustrated-Vol-Addison-Wesley-Professional/dp/0201633469" rel="noopener noreferrer"&gt;TCP Illustrated, Vol. 1&lt;/a&gt;&lt;/strong&gt; — Stevens. The definitive reference. Still accurate 30 years later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://beej.us/guide/bgnet/" rel="noopener noreferrer"&gt;Beej's Guide to Network Programming&lt;/a&gt;&lt;/strong&gt; — Free, practical, shows the real system calls. Where most of us actually learned sockets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://datatracker.ietf.org/doc/html/rfc793" rel="noopener noreferrer"&gt;RFC 793&lt;/a&gt;&lt;/strong&gt; — The original TCP specification. Readable. Worth skimming just to see how much they got right in 1981.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri has mass produced more TIME_WAIT ghosts than any responsible engineer should. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Your DNS is Lying to You</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:02:58 +0000</pubDate>
      <link>https://forem.com/nazq/your-dns-is-lying-to-you-emd</link>
      <guid>https://forem.com/nazq/your-dns-is-lying-to-you-emd</guid>
      <description>&lt;h1&gt;
  
  
  Your DNS is Lying to You
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What Actually Happens Between a URL and the First Byte
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~13 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You typed &lt;code&gt;api.example.com&lt;/code&gt; into your browser — or &lt;code&gt;curl&lt;/code&gt;'d it, or your service tried to connect to it — and something happened. Some bytes arrived. You moved on.&lt;/p&gt;

&lt;p&gt;It is not a lookup table. It is a distributed, eventually consistent database with a 40-year-old trust model, deployed across millions of machines that have no obligation to agree with each other. When it goes wrong — and it does go wrong — the failure modes are some of the most maddening in all of networking, because the answer you get looks valid. It's just wrong.&lt;/p&gt;

&lt;p&gt;There are four distinct roles. Most people know one of them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bug That Made This Click
&lt;/h2&gt;

&lt;p&gt;Picture this: a microservice can't connect to a dependency. Health checks pass. &lt;code&gt;curl&lt;/code&gt; works fine from your laptop. The service throws connection errors that make no sense.&lt;/p&gt;

&lt;p&gt;The service is running in a Docker container. Inside the container, &lt;code&gt;curl api.internal.corp&lt;/code&gt; returns a different IP than &lt;code&gt;dig api.internal.corp&lt;/code&gt; run from the same container.&lt;/p&gt;

&lt;p&gt;Different. IP.&lt;/p&gt;

&lt;p&gt;Same host. Same moment. Different tool. Different answer.&lt;/p&gt;

&lt;p&gt;We'll learn exactly why that's possible before the end of this post.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cast of Characters
&lt;/h2&gt;

&lt;p&gt;Before the mechanics, let's name the players. There are four distinct roles in the DNS resolution chain, and conflating them is the source of most confusion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The stub resolver&lt;/strong&gt; lives on your machine. It's not really a DNS server — it can't do much on its own. It's the code in libc (or your OS networking stack) that takes a hostname and says "I need an IP for this" by forwarding the question to someone who can actually answer it. On Linux, that's &lt;code&gt;getaddrinfo()&lt;/code&gt;. On macOS it goes through &lt;code&gt;mDNSResponder&lt;/code&gt;. Every DNS query your applications make starts here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The recursive resolver&lt;/strong&gt; (also called a "full-service resolver" or sometimes misleadingly the "DNS server") does the actual work. This is the server your stub resolver talks to. Its job is to walk the DNS tree from the root all the way down to a definitive answer. Your ISP runs one. Google runs one at &lt;code&gt;8.8.8.8&lt;/code&gt;. Cloudflare at &lt;code&gt;1.1.1.1&lt;/code&gt;. Your office probably has one too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The authoritative nameserver&lt;/strong&gt; actually owns the answer. If you bought &lt;code&gt;example.com&lt;/code&gt; and set up your DNS records, you pointed your registrar at some authoritative nameservers. Those servers are the canonical source of truth for your zone. They don't do recursion — they just answer questions about records they own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The root nameservers&lt;/strong&gt; are where recursion starts when a recursive resolver has no cached answer. There are exactly 13 of them by IP — hundreds of physical machines behind those 13 addresses via anycast. Why 13? Because the original DNS protocol used 512-byte UDP packets, and 13 NS records was the maximum that fit 🤷‍♂️. They don't know where &lt;code&gt;api.example.com&lt;/code&gt; is, but they know who handles &lt;code&gt;.com&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1kfu53y2eilvlnoxvxj6.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1kfu53y2eilvlnoxvxj6.jpeg" alt="DNS resolution chain — four layers of delegation" width="800" height="1071"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Happens When You Type a URL
&lt;/h2&gt;

&lt;p&gt;Let's trace it. You type &lt;code&gt;https://api.example.com/v1/users&lt;/code&gt; and hit Enter.&lt;/p&gt;

&lt;p&gt;Your browser extracts the hostname: &lt;code&gt;api.example.com&lt;/code&gt;. It calls into the OS resolver. Before any network packet leaves your machine, the OS checks three things in order:&lt;/p&gt;

&lt;p&gt;First, &lt;code&gt;/etc/hosts&lt;/code&gt;. This is a flat text file that predates DNS by over a decade. It's checked before anything else, unconditionally. If &lt;code&gt;api.example.com&lt;/code&gt; appears in &lt;code&gt;/etc/hosts&lt;/code&gt;, the search is over — no network query happens at all. This is why adding entries to &lt;code&gt;/etc/hosts&lt;/code&gt; works for local development, and it's also why corporate malware occasionally modified it to redirect banking sites. It's also why many devs are confused when their DNS changes don't seem to take effect: they have a stale hosts file entry they forgot about from six months ago.&lt;/p&gt;

&lt;p&gt;Second, the local DNS cache. Your OS, and often a local daemon (systemd-resolved on modern Linux, mDNSResponder on macOS), keeps a cache of recent answers. If the cache has a fresh entry, done.&lt;/p&gt;

&lt;p&gt;Third, and only if neither of those had an answer: a query goes out to the recursive resolver specified in &lt;code&gt;/etc/resolv.conf&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# /etc/resolv.conf — the file that decides where your DNS queries go
nameserver 192.168.1.1      # your router, probably
nameserver 8.8.8.8          # fallback: Google
search corp.internal        # try appending this domain to short names
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;nameserver&lt;/code&gt; line is the only thing most developers know about &lt;code&gt;/etc/resolv.conf&lt;/code&gt;. The &lt;code&gt;search&lt;/code&gt; directive is where it gets interesting — and where short hostnames like &lt;code&gt;db&lt;/code&gt; can silently resolve to &lt;code&gt;db.corp.internal&lt;/code&gt;, which is either convenient or baffling depending on the day.&lt;/p&gt;

&lt;p&gt;That's why &lt;code&gt;db&lt;/code&gt; works on your laptop but fails in CI: one has a &lt;code&gt;search corp.internal&lt;/code&gt; entry and the other doesn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Recursive Resolver Earns Its Name
&lt;/h2&gt;

&lt;p&gt;Your query reaches the recursive resolver. Let's say it's a cold cache — never seen &lt;code&gt;api.example.com&lt;/code&gt; before.&lt;/p&gt;

&lt;p&gt;The resolver starts at the top.&lt;/p&gt;

&lt;p&gt;It queries one of the 13 root nameserver IPs (hardcoded into all resolver software as the "root hints"). The root server doesn't know &lt;code&gt;api.example.com&lt;/code&gt;. It responds with a referral: "I don't know, but &lt;code&gt;.com&lt;/code&gt; is handled by these nameservers."&lt;/p&gt;

&lt;p&gt;The resolver then queries a &lt;code&gt;.com&lt;/code&gt; Top Level Domain (TLD) nameserver. The TLD server doesn't know &lt;code&gt;api.example.com&lt;/code&gt;. It responds with a referral: "I don't know, but &lt;code&gt;example.com&lt;/code&gt; is handled by &lt;em&gt;these&lt;/em&gt; nameservers."&lt;/p&gt;

&lt;p&gt;The resolver then queries an authoritative nameserver for &lt;code&gt;example.com&lt;/code&gt;. This one knows. It returns an A record (IPv4) or AAAA record (IPv6) for &lt;code&gt;api.example.com&lt;/code&gt;, along with a TTL — a "time to live" value in seconds.&lt;/p&gt;

&lt;p&gt;The resolver caches the answer for &lt;code&gt;TTL&lt;/code&gt; seconds and returns it to your stub resolver. Your stub resolver hands it to &lt;code&gt;getaddrinfo()&lt;/code&gt;. Your browser gets an IP. The connection starts.&lt;/p&gt;

&lt;p&gt;That whole chain — root → TLD → authoritative — happened in the background, probably in under 100ms. On a warm cache, it's a single hop and maybe 5ms.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query trace for api.example.com:
  → root nameserver (hardcoded IPs)
    ← "ask .com TLD at 192.5.6.30"
  → .com TLD nameserver (192.5.6.30)
    ← "ask ns1.example.com at 93.184.216.10"
  → ns1.example.com (93.184.216.10)
    ← "api.example.com A 198.51.100.42  TTL 300"
  → your stub resolver
    ← 198.51.100.42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three round trips. More if any of those delegations weren't cached. And critically: the recursive resolver that did all this work is running on someone else's machine, which you do not control, and which has its own cache that it shares with everyone else who uses it.&lt;/p&gt;




&lt;h2&gt;
  
  
  TTL Is Not a Suggestion, It's Also Not a Guarantee
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;TTL&lt;/strong&gt; (Time to Live) is the record's expiry hint. If your A record has &lt;code&gt;TTL 300&lt;/code&gt;, it means "cache this for 300 seconds, then check again."&lt;/p&gt;

&lt;p&gt;Here's what TTL cannot do: it cannot tell resolvers that already have a cached answer to throw it away. When you update a DNS record, the old answer is still valid in every cache that holds it, until their individual TTLs expire. If your TTL was 24 hours (86400 seconds), some resolvers will be serving the old answer for up to 24 hours after your change.&lt;/p&gt;

&lt;p&gt;This is why "just flush DNS" is not a real answer to a propagation problem. You can flush your local machine's cache. You cannot flush Google's cache. You cannot flush your ISP's cache. You cannot flush the cache of the recursive resolver your user's mobile carrier uses.&lt;/p&gt;

&lt;p&gt;What you &lt;em&gt;can&lt;/em&gt; do: lower your TTL well before a migration. If you know you're moving an IP next Tuesday, set your TTL to 60 seconds on Friday. Let the short TTL propagate. Do the migration. The blast radius of stale caches is 60 seconds instead of 24 hours.&lt;/p&gt;

&lt;p&gt;What you &lt;em&gt;cannot&lt;/em&gt; do: change the TTL and have it take effect immediately. The TTL change itself has to propagate, and it propagates at the &lt;em&gt;old&lt;/em&gt; TTL.&lt;/p&gt;

&lt;p&gt;That's right. The new TTL doesn't matter until the old TTL expires and resolvers re-fetch the record. Plan accordingly.&lt;/p&gt;

&lt;p&gt;That's why your DNS change isn't working yet. The old answer is cached at some resolver with a TTL of 3600 and there are 47 minutes left. Verify by querying the authoritative server directly: &lt;code&gt;dig @ns1.example.com api.example.com&lt;/code&gt; — that bypasses all caches and shows what the authoritative server has right now.&lt;/p&gt;




&lt;h2&gt;
  
  
  CNAME Chains and Why They're Weird
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;A record&lt;/strong&gt; maps a name to an IP. A &lt;strong&gt;CNAME record&lt;/strong&gt; maps a name to another name — a canonical alias.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;api.example.com  CNAME  loadbalancer.us-east-1.elb.amazonaws.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a resolver sees a CNAME, it has to resolve the target too. So &lt;code&gt;api.example.com&lt;/code&gt; → look up &lt;code&gt;loadbalancer.us-east-1.elb.amazonaws.com&lt;/code&gt; → that returns an A record. Two lookups, one query from your perspective.&lt;/p&gt;

&lt;p&gt;CNAMEs are used everywhere — CDNs, load balancers, cloud services — because they let you point a name at another name that the provider controls. When AWS moves your load balancer, they update their A record; your CNAME keeps working.&lt;/p&gt;

&lt;p&gt;The rule everyone forgets: &lt;strong&gt;you cannot have a CNAME at a zone apex.&lt;/strong&gt; The zone apex is the bare domain itself — &lt;code&gt;example.com&lt;/code&gt; with nothing in front of it. Why? Because CNAME has to be the only record for a name (it replaces the name entirely), but the zone apex needs SOA and NS records. You can't have a CNAME and also have NS records. The DNS spec doesn't allow it.&lt;/p&gt;

&lt;p&gt;This is why CDN and DNS providers invented &lt;strong&gt;CNAME flattening&lt;/strong&gt; (Cloudflare calls it CNAME at the root, Route53 calls it ALIAS records). When you point &lt;code&gt;example.com&lt;/code&gt; at &lt;code&gt;example.com.cdn.cloudflare.net&lt;/code&gt;, the provider does the CNAME lookup at query time and returns a flat A record to the client. From the outside, it looks like an A record. It's not. It's a CNAME that your DNS provider is silently expanding.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsynjhb21lk9badtlsoxp.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsynjhb21lk9badtlsoxp.jpeg" alt="CNAME resolution — normal vs flattened" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This matters when you're debugging. If &lt;code&gt;dig example.com A&lt;/code&gt; returns an IP directly but you know you set up a CNAME at the root, the flattening is working. If it returns a CNAME, something's wrong with your provider config. These look identical to the application layer.&lt;/p&gt;

&lt;p&gt;That's why you can't put a CNAME on &lt;code&gt;example.com&lt;/code&gt; itself — CNAME semantics conflict with the SOA and NS records that every zone apex must have. Your DNS provider works around this with record flattening, which looks like an A record to the outside world.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why &lt;code&gt;dig&lt;/code&gt; and &lt;code&gt;curl&lt;/code&gt; Give Different Answers
&lt;/h2&gt;

&lt;p&gt;Back to my Docker debugging story. &lt;code&gt;curl api.internal.corp&lt;/code&gt; returned a different IP than &lt;code&gt;dig api.internal.corp&lt;/code&gt;. How?&lt;/p&gt;

&lt;p&gt;Because they use different resolution paths.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;dig&lt;/code&gt; is a DNS tool. It talks directly to a DNS resolver — by default, whatever is in &lt;code&gt;/etc/resolv.conf&lt;/code&gt;, or you can specify one with &lt;code&gt;@&lt;/code&gt;. It bypasses the OS resolver entirely, bypasses the local cache, and makes a raw DNS query.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;curl&lt;/code&gt; uses &lt;code&gt;getaddrinfo()&lt;/code&gt;. That function goes through the full OS name resolution stack, including &lt;code&gt;/etc/nsswitch.conf&lt;/code&gt; — the OS's routing table for name resolution, a priority-ordered list of where to look. On a typical Linux machine it looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# /etc/nsswitch.conf
hosts:          files dns myhostname
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;files&lt;/code&gt; entry means &lt;code&gt;/etc/hosts&lt;/code&gt; runs first — and &lt;code&gt;curl&lt;/code&gt; reads it, while &lt;code&gt;dig&lt;/code&gt; does not.&lt;/p&gt;

&lt;p&gt;Inside Docker, it gets more interesting. Docker injects its own nameserver into the container's &lt;code&gt;/etc/resolv.conf&lt;/code&gt;, pointing at Docker's internal resolver at &lt;code&gt;127.0.0.11&lt;/code&gt;. That resolver handles Docker network DNS (container names, service names). It may return different answers for internal names than an external DNS server would. &lt;code&gt;dig&lt;/code&gt; run without arguments still reads &lt;code&gt;/etc/resolv.conf&lt;/code&gt; — so inside the container, &lt;code&gt;dig api.internal.corp&lt;/code&gt; was querying Docker's resolver, not the corporate DNS. And Docker's resolver didn't know about the internal service.&lt;/p&gt;

&lt;p&gt;The rule: &lt;strong&gt;&lt;code&gt;dig&lt;/code&gt; shows you what a DNS query returns. &lt;code&gt;curl&lt;/code&gt; shows you what the application stack resolves.&lt;/strong&gt; They are not always the same query against the same server.&lt;/p&gt;

&lt;p&gt;When they disagree, the question is: which one matches what your application uses? Usually it's the &lt;code&gt;curl&lt;/code&gt; path, because your application also calls &lt;code&gt;getaddrinfo()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That's why &lt;code&gt;dig&lt;/code&gt; and &lt;code&gt;curl&lt;/code&gt; gave different answers inside that Docker container — Docker's internal resolver handled &lt;code&gt;curl&lt;/code&gt;'s path but not &lt;code&gt;dig&lt;/code&gt;'s direct query to the corporate DNS.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Trust Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;DNS was designed in 1983. The original protocol has no cryptographic authentication. A resolver asks a question; an authoritative server answers. There's nothing in the original design that proves the answer came from the real authoritative server.&lt;/p&gt;

&lt;p&gt;This isn't a theoretical concern. &lt;strong&gt;DNS spoofing&lt;/strong&gt; and &lt;strong&gt;cache poisoning&lt;/strong&gt; are real attacks. An attacker who can intercept or forge DNS responses can redirect any hostname to any IP — transparently, with no visible error to the user.&lt;/p&gt;

&lt;p&gt;The fix is &lt;strong&gt;DNSSEC&lt;/strong&gt; — DNS Security Extensions. DNSSEC adds cryptographic signatures to DNS records. Authoritative servers sign their records with a private key; validators check those signatures against a public key published in the parent zone. The chain of trust runs from the root all the way down.&lt;/p&gt;

&lt;p&gt;DNSSEC deployment is... fragmented. The root zone is signed. Many TLDs are signed. Many individual domains are not. And critically, many recursive resolvers don't validate DNSSEC even if the zone is signed — they just pass the signatures along. Validation has to happen somewhere for it to matter, and the chain has a lot of links.&lt;/p&gt;

&lt;p&gt;This is why your browser sometimes shows a DNSSEC validation warning on a subdomain you're confident is yours — someone in the delegation chain has a misconfigured or expired signing key.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DoT&lt;/strong&gt; (DNS over TLS) and &lt;strong&gt;DoH&lt;/strong&gt; (DNS over HTTPS) are a different layer of protection. They encrypt the query in transit — so your ISP can't see what you're looking up, and a network attacker can't intercept the packet. But they don't solve the authoritative trust problem. You're still trusting the resolver at the other end.&lt;/p&gt;

&lt;p&gt;The honest summary: DNS's trust model is "trust whoever answers." DNSSEC tries to fix that. Its deployment is patchy. DoT/DoH protect the wire, not the answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Debugging Toolkit
&lt;/h2&gt;

&lt;p&gt;When DNS is misbehaving, you want to query different layers explicitly rather than guessing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# What does the authoritative server say right now?&lt;/span&gt;
dig @ns1.example.com api.example.com A

&lt;span class="c"&gt;# What does your configured resolver return (including its cache)?&lt;/span&gt;
dig api.example.com A

&lt;span class="c"&gt;# What would an uncached query to a specific public resolver return?&lt;/span&gt;
dig @8.8.8.8 api.example.com A

&lt;span class="c"&gt;# Trace the full recursive delegation chain&lt;/span&gt;
dig +trace api.example.com A

&lt;span class="c"&gt;# Show what getaddrinfo() would return (follows /etc/hosts, nsswitch.conf)&lt;/span&gt;
&lt;span class="c"&gt;# getent is part of glibc-utils on Debian/Ubuntu; not available on macOS&lt;/span&gt;
getent hosts api.example.com

&lt;span class="c"&gt;# Check your resolver configuration&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; /etc/resolv.conf
resolvectl status    &lt;span class="c"&gt;# systemd-resolved environments&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;+trace&lt;/code&gt; flag on &lt;code&gt;dig&lt;/code&gt; is particularly useful when something is broken in the delegation chain itself — wrong glue records, expired DS records, missing NS entries. It shows you every hop.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;dig @authoritative&lt;/code&gt; returns the right answer but your application gets something different, the problem is between your application and that authoritative server. Work backwards: your OS cache, your container's resolver, your corporate split-horizon DNS, your VPN's DNS override.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Thing Worth Holding Onto
&lt;/h2&gt;

&lt;p&gt;I have a love-hate relationship with DNS. It's 40 years old, it was designed when the internet was a few hundred hosts, the trust model was "everyone on the network is trustworthy," and the failure modes of a globally distributed cache were... well not fully thought through.&lt;/p&gt;

&lt;p&gt;And yet it works. Hundreds of billions of queries a day, run by competing organisations with no central coordinator, and it almost never falls over. When it does fail, the failures are the worst kind: subtle. An answer that looks valid but is stale. A resolution path that bypasses the record you just updated. A CNAME chain that doesn't behave the way you modelled it. You don't get an error. You get the wrong IP, served confidently. Oh dear maybe LLMs learned from DNS !!!&lt;/p&gt;

&lt;p&gt;Every DNS debugging session I've ever had ended the same way: I wasn't querying the layer I thought I was querying. The answer is always cached somewhere you forgot to check. And you do this so rarely you basically have to re-teach yourself the DNS stack each time. &lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://linux.die.net/man/1/dig" rel="noopener noreferrer"&gt;&lt;code&gt;man 1 dig&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — More flags than you'll ever need, but &lt;code&gt;+trace&lt;/code&gt;, &lt;code&gt;+short&lt;/code&gt;, and &lt;code&gt;@server&lt;/code&gt; will cover 90% of debugging sessions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.oreilly.com/library/view/dns-and-bind/0596100574/" rel="noopener noreferrer"&gt;DNS and BIND, 5th ed.&lt;/a&gt;&lt;/strong&gt; — The definitive reference. Dense. Worth having on a shelf.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://howdns.works/" rel="noopener noreferrer"&gt;How DNS Works (howdns.works)&lt;/a&gt;&lt;/strong&gt; — Comic-style visual walkthrough of the resolution chain. Good for sharing with someone who's new to it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.cloudflare.com/learning/dns/what-is-dns/" rel="noopener noreferrer"&gt;Cloudflare's DNS Learning Center&lt;/a&gt;&lt;/strong&gt; — Technically accurate, well-illustrated, and they have a vested interest in you understanding DNS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://datatracker.ietf.org/doc/html/rfc1034" rel="noopener noreferrer"&gt;RFC 1034&lt;/a&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;a href="https://datatracker.ietf.org/doc/html/rfc1035" rel="noopener noreferrer"&gt;RFC 1035&lt;/a&gt;&lt;/strong&gt; — The original 1987 DNS specs. Remarkably readable for an RFC. Much of what's described here is in those two documents.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri once spent four hours debugging a DNS issue that turned out to be a single line he put into /etc/hosts in 2021. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Your Process Doesn't Exist Alone</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:02:42 +0000</pubDate>
      <link>https://forem.com/nazq/your-process-doesnt-exist-alone-1pnk</link>
      <guid>https://forem.com/nazq/your-process-doesnt-exist-alone-1pnk</guid>
      <description>&lt;h1&gt;
  
  
  Your Process Doesn't Exist Alone
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Sessions, Process Groups, and Why Ctrl-C Kills the Right Thing
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~13 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You pressed Ctrl-C and the program stopped. Exactly the right program. Not its parent. Not the shell you typed from. The one you were running.&lt;/p&gt;

&lt;p&gt;That probably felt unremarkable. It shouldn't. The kernel had to figure out which processes — out of a structured hierarchy on your machine — deserved that signal. It got it right every time you've ever tried. There is infrastructure specifically designed to make that work, and it involves three layers of grouping you've never had to think about.&lt;/p&gt;

&lt;p&gt;Let's look at what's actually happening.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Ctrl-C Has to Solve
&lt;/h2&gt;

&lt;p&gt;Here's a scenario. You type this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;tar &lt;/span&gt;czf archive.tgz big_directory/ | pv | gpg &lt;span class="nt"&gt;--encrypt&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; archive.tgz.gpg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three processes. A pipeline. You get bored waiting and press Ctrl-C.&lt;/p&gt;

&lt;p&gt;Which one should die? All three. They're one operation from your perspective. The terminal agrees.&lt;/p&gt;

&lt;p&gt;Consider: you've spawned those three processes from a bash shell. Bash itself is running. Maybe there are other background jobs. Maybe there's a long-running process you started earlier with &lt;code&gt;&amp;amp;&lt;/code&gt;. When you press Ctrl-C, none of those should die.&lt;/p&gt;

&lt;p&gt;The kernel needs a way to know which processes are "the foreground thing you're doing right now" and which are everything else. The mechanism it uses is the &lt;strong&gt;process group&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Process Groups: The First Layer
&lt;/h2&gt;

&lt;p&gt;Every process belongs to a &lt;strong&gt;process group&lt;/strong&gt;, identified by a &lt;strong&gt;PGID&lt;/strong&gt; (process group ID). When you run a pipeline like &lt;code&gt;tar | pv | gpg&lt;/code&gt;, the shell puts all three into the same process group. When you run a single command, that command becomes a process group of one.&lt;/p&gt;

&lt;p&gt;The PGID is inherited through &lt;code&gt;fork()&lt;/code&gt;. When a process forks, the child starts in the same process group as the parent. To create a new process group, a process calls &lt;code&gt;setpgid()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A pipeline works like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Shell (PGID=1000)
    │
    ├─ fork() → tar   (PGID=1001, also leader since tar's PID=1001)
    ├─ fork() → pv    (PGID=1001, setpgid'd to join tar's group)
    └─ fork() → gpg   (PGID=1001, setpgid'd to join tar's group)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The shell then tells the terminal: "the foreground process group is now 1001." When you press Ctrl-C, the kernel sees that sequence, generates &lt;a href="https://nazquadri.dev/blog/the-layer-below/05-signals/" rel="noopener noreferrer"&gt;&lt;code&gt;SIGINT&lt;/code&gt;&lt;/a&gt;, and delivers it to every process in group 1001. All three die. The shell — in group 1000 — is unaffected.&lt;/p&gt;

&lt;p&gt;That's why Ctrl-C kills the right thing. And &lt;strong&gt;that's why Ctrl-C sometimes doesn't work&lt;/strong&gt; — if a program installs a &lt;code&gt;SIGINT&lt;/code&gt; handler and doesn't propagate the signal to its children, or puts its children in a different process group, Ctrl-C kills the parent but leaves the children running. You've seen this: the prompt comes back, but there are still processes chewing through CPU.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff62g5cgo09xymxrpdh8o.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff62g5cgo09xymxrpdh8o.jpeg" alt="Process groups and the terminal — SIGINT hits the pipeline, not the shell" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There's one process in each group that the kernel considers the &lt;strong&gt;group leader&lt;/strong&gt; — the one whose PID equals the PGID. For a pipeline, bash usually makes the first process in the pipeline the leader. If the leader dies, the group doesn't disappear; the other members keep running. It's just a label.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sessions: The Second Layer
&lt;/h2&gt;

&lt;p&gt;Process groups answer "which processes are this foreground job." But there's a bigger question: which foreground job is &lt;em&gt;active right now&lt;/em&gt;, and who owns the terminal?&lt;/p&gt;

&lt;p&gt;That's where &lt;strong&gt;sessions&lt;/strong&gt; come in.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;session&lt;/strong&gt; is a collection of process groups. All the process groups you create during a login session belong to the same session. When you open a terminal and start typing, every command you run — every pipeline, every background job, everything — is in the same session.&lt;/p&gt;

&lt;p&gt;Sessions have an ID too: the &lt;strong&gt;SID&lt;/strong&gt;. The first process to call &lt;code&gt;setsid()&lt;/code&gt; creates a new session and becomes the &lt;strong&gt;session leader&lt;/strong&gt;. Its PID becomes the SID.&lt;/p&gt;

&lt;p&gt;The session has one more piece: the &lt;strong&gt;controlling terminal&lt;/strong&gt;. This is the terminal device that delivers job control signals — &lt;code&gt;SIGINT&lt;/code&gt; from Ctrl-C, &lt;code&gt;SIGTSTP&lt;/code&gt; from Ctrl-Z, &lt;code&gt;SIGHUP&lt;/code&gt; when the terminal closes. Exactly one session owns a given terminal at a time, and the terminal knows which process group is in the foreground.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Session (SID=999, controlling terminal: /dev/pts/3)
    │
    ├── Process Group 999 (shell — the session leader's group)
    │       └── bash (PID=999, PGID=999)
    │
    ├── Process Group 1001 (foreground: tar | pv | gpg)
    │       ├── tar
    │       ├── pv
    │       └── gpg
    │
    └── Process Group 1002 (background job: long-running-thing &amp;amp;)
            └── long-running-thing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The terminal keeps a pointer to the &lt;strong&gt;foreground process group&lt;/strong&gt;. &lt;code&gt;SIGINT&lt;/code&gt; goes there. &lt;code&gt;SIGTSTP&lt;/code&gt; goes there. When you run a command in the foreground, the shell calls &lt;code&gt;tcsetpgrp()&lt;/code&gt; to tell the terminal which group has the focus. When the command finishes, the shell takes it back.&lt;/p&gt;

&lt;p&gt;Run &lt;code&gt;ps -ej&lt;/code&gt; and look at the SID and PGID columns. Every process is accounted for — including daemons, which show &lt;code&gt;?&lt;/code&gt; in the TTY column because they have no controlling terminal.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Controlling Terminal: What It Actually Is
&lt;/h2&gt;

&lt;p&gt;"Controlling terminal" sounds abstract. It's not.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;controlling terminal&lt;/strong&gt; is a file descriptor — specifically a TTY or PTY device — that a session is attached to. It's the device that knows how to translate "the user pressed Ctrl-C" into a signal, and "the user pressed Ctrl-Z" into a different signal.&lt;/p&gt;

&lt;p&gt;When a session leader opens a TTY device for the first time, that TTY becomes the controlling terminal for the session. The kernel records this association in both directions: the TTY knows its session, and the session knows its TTY.&lt;/p&gt;

&lt;p&gt;Here's what "controlling terminal" buys you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Job control signals.&lt;/strong&gt; Ctrl-C, Ctrl-Z, Ctrl-\ all generate signals via the controlling terminal's line discipline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SIGHUP on terminal close.&lt;/strong&gt; When the controlling terminal's last master side is closed, the kernel sends &lt;code&gt;SIGHUP&lt;/code&gt; to the session leader and the foreground process group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background I/O protection.&lt;/strong&gt; If a background process tries to read from or write to the controlling terminal without being in the foreground group, it gets &lt;code&gt;SIGTTIN&lt;/code&gt; or &lt;code&gt;SIGTTOU&lt;/code&gt;. The process stops, you see "Stopped" in the shell, and you have to foreground it. Programs that reach for &lt;code&gt;/dev/tty&lt;/code&gt; directly (rather than stdout) are explicitly targeting the controlling terminal — and the kernel enforces access control on it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6b9tf3vxbk17eu56dqyy.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6b9tf3vxbk17eu56dqyy.jpeg" alt="PTY device as the hub — sessions, process groups, and signals" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;nohup&lt;/code&gt; and &lt;code&gt;disown&lt;/code&gt;: Why They Exist
&lt;/h2&gt;

&lt;p&gt;Two commands make more sense once you know what a session is.&lt;/p&gt;

&lt;p&gt;When your SSH connection drops, the terminal closes. The kernel notices: the master end of the PTY is gone. It sends &lt;code&gt;SIGHUP&lt;/code&gt; to the session leader and the foreground process group. If your long-running job is in that session, it gets &lt;code&gt;SIGHUP&lt;/code&gt;. The default action for &lt;code&gt;SIGHUP&lt;/code&gt; is: die.&lt;/p&gt;

&lt;p&gt;This is not a bug. It's the intended behavior. The terminal is gone, so the processes that depended on it should clean up. The problem is that sometimes you &lt;em&gt;want&lt;/em&gt; a process to keep running after you log out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;nohup&lt;/code&gt;&lt;/strong&gt; solves this by making the child process ignore &lt;code&gt;SIGHUP&lt;/code&gt; before exec. It also redirects stdout and stderr to &lt;code&gt;nohup.out&lt;/code&gt; since the terminal won't exist to receive output. The process is still in your session, still has the same controlling terminal, it just won't die when &lt;code&gt;SIGHUP&lt;/code&gt; arrives.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# nohup sets SIGHUP to SIG_IGN before exec, and redirects output&lt;/span&gt;
&lt;span class="nb"&gt;nohup &lt;/span&gt;long-running-job &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;disown&lt;/code&gt;&lt;/strong&gt; is a bash shell builtin that removes the job from the shell's job table. When the &lt;em&gt;terminal closes&lt;/em&gt; and the shell receives SIGHUP, it re-delivers SIGHUP to its job groups. &lt;code&gt;disown&lt;/code&gt; removes the job from that list, so the shell won't SIGHUP it on terminal close.&lt;/p&gt;

&lt;p&gt;Neither of these is a clean solution. The process is still in the session. If the session leader exits and the kernel sends SIGHUP through the normal terminal-close path, your ignored-SIGHUP process might still get caught. And the process still has the closed terminal as its controlling terminal, which causes problems if it ever tries to read from stdin.&lt;/p&gt;

&lt;p&gt;These are bandaids. The real solution is what comes next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why &lt;code&gt;ssh remote-host some-command &amp;amp;&lt;/code&gt; is dangerous.&lt;/strong&gt; The &lt;code&gt;some-command&lt;/code&gt; process starts on the remote host inside your SSH session. When your SSH connection drops, &lt;code&gt;SIGHUP&lt;/code&gt; fires. If &lt;code&gt;some-command&lt;/code&gt; doesn't handle &lt;code&gt;SIGHUP&lt;/code&gt;, it dies. The solution is either &lt;code&gt;nohup&lt;/code&gt;, or — better — put it inside a tmux session on the remote host.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;setsid()&lt;/code&gt;: Cutting the Cord
&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;real&lt;/em&gt; way to detach a process from a terminal is &lt;code&gt;setsid()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;setsid()&lt;/code&gt; creates a new session. The calling process becomes the session leader of a brand new, empty session. This new session has no controlling terminal — and importantly, a session only gets a controlling terminal when the session leader explicitly opens a TTY. Without &lt;code&gt;O_NOCTTY&lt;/code&gt;, opening a TTY device gives you a controlling terminal whether you want one or not. So if you never open a TTY, you never get one. No controlling terminal means no &lt;code&gt;SIGHUP&lt;/code&gt;, no job control signals, no &lt;code&gt;SIGTTIN&lt;/code&gt;/&lt;code&gt;SIGTTOU&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;One constraint: &lt;code&gt;setsid()&lt;/code&gt; fails if the calling process is already a process group leader. This is why the daemonization recipe forks first — the child is not a group leader, so &lt;code&gt;setsid()&lt;/code&gt; succeeds.&lt;/p&gt;

&lt;p&gt;This is what daemonization does. The classic daemon recipe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Parent forks, then exits — child is now orphaned&lt;/span&gt;
&lt;span class="c1"&gt;// (child is not a process group leader, so setsid() will work)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Child calls setsid() — new session, no controlling terminal&lt;/span&gt;
&lt;span class="n"&gt;setsid&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Fork again so we're NOT the session leader&lt;/span&gt;
&lt;span class="c1"&gt;// (without O_NOCTTY, opening a TTY would give us a controlling terminal)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Now we're a non-session-leader process in a session with no&lt;/span&gt;
&lt;span class="c1"&gt;// controlling terminal. SIGHUP cannot reach us through the terminal.&lt;/span&gt;
&lt;span class="c1"&gt;// We are truly detached.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The double-fork: by forking a second time, we become a child of the session leader — still in the new session, but not the leader. A non-leader can never become a session leader, so it can never acquire a controlling terminal by opening a TTY.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why daemonization requires a double fork.&lt;/strong&gt; You can't get to truly "no controlling terminal" in a single step if you're a session leader. The second fork is what closes that loophole.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;setsid()&lt;/code&gt; is also what PTY supervisors call in the child process before exec. When a supervisor spawns a child with a PTY, the child calls &lt;code&gt;setsid()&lt;/code&gt; first, then opens the PTY slave. That &lt;code&gt;open()&lt;/code&gt; call is what gives the child its controlling terminal — the PTY slave. This is intentional. We &lt;em&gt;want&lt;/em&gt; the child to have a controlling terminal. We just want that terminal to be the PTY we control, not whatever terminal the supervisor was launched from.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why tmux and screen Survive Terminal Closure
&lt;/h2&gt;

&lt;p&gt;Now all the pieces are in place to understand something that confused most of us for years.&lt;/p&gt;

&lt;p&gt;You connect to a server over SSH. You start &lt;code&gt;tmux&lt;/code&gt;. You run a bunch of stuff inside tmux. Your SSH connection drops. You reconnect and attach to the same tmux session. Everything is still there.&lt;/p&gt;

&lt;p&gt;How?&lt;/p&gt;

&lt;p&gt;When you start tmux, it forks a &lt;strong&gt;tmux server&lt;/strong&gt;. That server calls &lt;code&gt;setsid()&lt;/code&gt;. It is now the session leader of its own session with no controlling terminal. It is not attached to your SSH terminal at all. When your SSH session dies, &lt;code&gt;SIGHUP&lt;/code&gt; goes to your SSH client's session and the processes in it. The tmux server is not in that session. It's untouched.&lt;/p&gt;

&lt;p&gt;The tmux server holds the master ends of PTYs for all your "windows." Your shells and programs run on the slave ends. Those PTY slaves are controlled terminals for &lt;em&gt;their&lt;/em&gt; sessions — not for yours.&lt;/p&gt;

&lt;p&gt;When you reconnect and run &lt;code&gt;tmux attach&lt;/code&gt;, a new tmux client process starts. It connects to the tmux server over a Unix socket. The server starts forwarding the PTY master output to the new client, and the client starts forwarding your keystrokes to the server. From your perspective, it looks like you reconnected. From the shell inside tmux's perspective, absolutely nothing changed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your SSH session (terminal = /dev/pts/0)
    │
    └── tmux client ─────────Unix socket────► tmux server (setsid, no ctrl terminal)
                                                    │
                                              PTY master ─────► /dev/pts/7
                                                                      │
                                               bash (ctrl terminal = /dev/pts/7)
                                                                      │
                                               your stuff, running fine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When your SSH terminal closes, the left side of this diagram disappears. The right side doesn't care.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why the right workflow on a remote server is &lt;code&gt;tmux&lt;/code&gt; first, then work inside tmux&lt;/strong&gt; — not &lt;code&gt;nohup&lt;/code&gt; or &lt;code&gt;disown&lt;/code&gt;. The session architecture guarantees survival. The workarounds don't. tmux isn't magic. It's &lt;code&gt;setsid()&lt;/code&gt; and a Unix socket.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Signal Routing Map
&lt;/h2&gt;

&lt;p&gt;Let's put the whole picture together. When you press a key combination in your terminal:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ctrl-C&lt;/strong&gt; → The terminal line discipline intercepts it. It sends &lt;code&gt;SIGINT&lt;/code&gt; to the &lt;strong&gt;foreground process group&lt;/strong&gt; of the terminal's session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ctrl-Z&lt;/strong&gt; → The terminal line discipline intercepts it. It sends &lt;code&gt;SIGTSTP&lt;/code&gt; to the &lt;strong&gt;foreground process group&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Ctrl-\*&lt;/em&gt; → Same routing, different signal: &lt;code&gt;SIGQUIT&lt;/code&gt;. Kills with a core dump.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terminal close&lt;/strong&gt; → The kernel sends &lt;code&gt;SIGHUP&lt;/code&gt; to the &lt;strong&gt;session leader&lt;/strong&gt; and the &lt;strong&gt;foreground process group&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Background process tries to read from terminal → Kernel sends &lt;code&gt;SIGTTIN&lt;/code&gt; to that process's &lt;strong&gt;process group&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffsull4katpjb7l4re2it.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffsull4katpjb7l4re2it.jpeg" alt="Signal delivery reference — keys, signals, and targets" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice what's absent: there is no "kill just this one process from the terminal" mechanism. &lt;strong&gt;That's why daemons show &lt;code&gt;?&lt;/code&gt; in the TTY column of &lt;code&gt;ps aux&lt;/code&gt;&lt;/strong&gt; — a daemon has no controlling terminal, and the kernel records that as &lt;code&gt;?&lt;/code&gt;. The mysterious &lt;code&gt;?&lt;/code&gt; that used to look like noise is a direct signal: this process is properly detached. The terminal deals in groups. If you want to kill exactly one process, you use &lt;code&gt;kill(pid, signal)&lt;/code&gt; directly — by PID. The terminal doesn't do individual targeting.&lt;/p&gt;

&lt;p&gt;This is why &lt;code&gt;kill -9 $$&lt;/code&gt; in a subshell does what you expect but &lt;code&gt;kill -9 $(jobs -p)&lt;/code&gt; might not — &lt;code&gt;jobs -p&lt;/code&gt; returns PGIDs, not PIDs, and if you're not careful you're sending signals to entire groups.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means for Supervising Processes
&lt;/h2&gt;

&lt;p&gt;A PTY supervisor is a process that holds the master end of a PTY and manages a child process's terminal lifecycle. When it forks the child, the child calls &lt;code&gt;setsid()&lt;/code&gt;. This creates a new session. The child then opens the PTY slave — which becomes its controlling terminal. The child and everything it forks lives in that session. Job control signals go to that session's process groups. &lt;code&gt;SIGHUP&lt;/code&gt; will go to that session if the supervisor closes the PTY master — and we can use that deliberately to tell the supervised process to clean up.&lt;/p&gt;

&lt;p&gt;The supervisor itself is in &lt;em&gt;its own&lt;/em&gt; session, not the child's. The child's terminal lifecycle is entirely under the supervisor's control.&lt;/p&gt;

&lt;p&gt;This is exactly why &lt;code&gt;nohup&lt;/code&gt; and &lt;code&gt;disown&lt;/code&gt; feel janky — they're trying to get the benefits of this architecture without actually building it. They leave the process in the wrong session and just ask it to ignore signals. The proper solution is to put the process in its own session from the start, owned by a supervisor that manages its lifecycle explicitly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Recap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Every process belongs to a &lt;strong&gt;process group&lt;/strong&gt; (PGID). Shells put pipeline members in the same group.&lt;/li&gt;
&lt;li&gt;Ctrl-C and Ctrl-Z deliver signals to the &lt;strong&gt;foreground process group&lt;/strong&gt;, not just one process.&lt;/li&gt;
&lt;li&gt;Every process group belongs to a &lt;strong&gt;session&lt;/strong&gt; (SID). A terminal is the &lt;strong&gt;controlling terminal&lt;/strong&gt; of exactly one session.&lt;/li&gt;
&lt;li&gt;When a terminal closes, &lt;code&gt;SIGHUP&lt;/code&gt; goes to the session leader and foreground process group.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nohup&lt;/code&gt; ignores &lt;code&gt;SIGHUP&lt;/code&gt;. &lt;code&gt;disown&lt;/code&gt; removes the job from the shell's cleanup list. Both are workarounds.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;setsid()&lt;/code&gt; creates a new session with no controlling terminal. This is how daemons, &lt;code&gt;tmux&lt;/code&gt;, and PTY supervisors properly detach.&lt;/li&gt;
&lt;li&gt;tmux survives disconnection because the server calls &lt;code&gt;setsid()&lt;/code&gt; at startup and communicates via a Unix socket — it was never in your terminal's session.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man2/setsid.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 2 setsid&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man2/setpgid.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 2 setpgid&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man3/tcsetpgrp.3.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 3 tcsetpgrp&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — the system calls that manage all of this directly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man7/credentials.7.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 7 credentials&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — Linux's full documentation on PID, PGID, SID and how they interact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man1/ps.1.html" rel="noopener noreferrer"&gt;&lt;code&gt;ps -ej&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — run this in a terminal with a few jobs running and watch the SID/PGID columns. Everything above becomes concrete immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.linusakesson.net/programming/tty/" rel="noopener noreferrer"&gt;The TTY Demystified&lt;/a&gt;&lt;/strong&gt; — Linus Akesson's deep dive, already recommended in earlier posts. The section on job control is directly relevant here.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri once killed a production job by closing his laptop. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Signals: The Kernel's Text Messages</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:02:26 +0000</pubDate>
      <link>https://forem.com/nazq/signals-the-kernels-text-messages-27f</link>
      <guid>https://forem.com/nazq/signals-the-kernels-text-messages-27f</guid>
      <description>&lt;h1&gt;
  
  
  Signals: The Kernel's Text Messages
&lt;/h1&gt;

&lt;h2&gt;
  
  
  kill -9 Isn't What You Think It Is
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~10 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You've been saying "force kill" for years. You type &lt;code&gt;kill -9 1234&lt;/code&gt; when a process won't die, and you picture the operating system reaching in with a fist and crushing it.&lt;/p&gt;

&lt;p&gt;That's not what happens. What happens is the kernel sends the process a message. The message contains exactly one piece of information: the number 9.&lt;/p&gt;

&lt;p&gt;That's it. A number. The process gets a signal, and the signal is SIGKILL — signal 9. For a running process, termination is essentially immediate. But there's a case where even SIGKILL can't immediately kill a process: &lt;strong&gt;uninterruptible sleep&lt;/strong&gt; — when a process is blocked in kernel code that cannot be safely interrupted. That delay, and what causes it, turns out to matter enormously.&lt;/p&gt;

&lt;p&gt;Signals are one of the oldest IPC mechanisms in Unix — older than sockets, older than most of the other things you'd reach for when you want two processes to communicate. They're asynchronous, they arrive at arbitrary times, they can be caught or ignored or blocked, and some of them mean three different things depending on context. They're worth understanding.&lt;/p&gt;

&lt;p&gt;Let's look at the machinery.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a Signal Actually Is
&lt;/h2&gt;

&lt;p&gt;A signal is not a byte written to a pipe. It's not a network packet. It's a tiny notification the kernel delivers to a process, completely out of band from whatever the process is currently doing.&lt;/p&gt;

&lt;p&gt;When a signal arrives, the kernel interrupts the process mid-execution — whatever instruction it was running — and delivers the signal. The process then does one of three things, depending on how it has configured its &lt;strong&gt;signal disposition&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Default action.&lt;/strong&gt; The kernel handles it. Most signals kill the process; some stop it; one does nothing by default. The process doesn't have a say.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catch it.&lt;/strong&gt; The process has installed a signal handler — a function. The kernel calls that function instead. When the handler returns, the process resumes whatever it was doing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignore it.&lt;/strong&gt; The process has said "I don't care about this signal." The kernel checks, shrugs, and moves on.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvey9zzzemuxvt9ie6oh.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvey9zzzemuxvt9ie6oh.jpeg" alt="Signal disposition — default, caught, ignored" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The disposition model is per-process, per-signal. A process can catch SIGTERM but ignore SIGUSR1 but take the default for everything else. Most processes don't configure most signals — they just take whatever the defaults are. Think back to the number of times you've intentionally installed a signal handler, I bet it was rare and idiosyncratic, that's been my experience. &lt;/p&gt;

&lt;p&gt;And two signals can never be caught or ignored: &lt;strong&gt;SIGKILL (9)&lt;/strong&gt; and &lt;strong&gt;SIGSTOP (19)&lt;/strong&gt;. The kernel handles them unconditionally. No signal handler, no &lt;code&gt;SIG_IGN&lt;/code&gt;, no blocking mask. The process cannot intercept these — but as you'll see below, that doesn't mean SIGKILL takes effect instantly in every case.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Signals You Already Know (And What's Actually Happening)
&lt;/h2&gt;

&lt;p&gt;Let's start with the familiar ones and fill in the gaps.&lt;/p&gt;

&lt;h3&gt;
  
  
  SIGTERM (15)
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;kill &amp;lt;pid&amp;gt;&lt;/code&gt; with no flag sends SIGTERM. This is the polite version. The default action is termination, but the process can catch it — and well-behaved servers do. They use the SIGTERM handler to flush logs, close database connections, finish in-flight requests, and exit cleanly.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;systemd&lt;/code&gt; stops a service, it sends SIGTERM first. Waits a few seconds. Then, if the process is still running, sends SIGKILL. That's the &lt;code&gt;TimeoutStopSec&lt;/code&gt; setting you've probably seen in unit files. The grace period between the two is intentional: give the process a chance to clean up before the kernel pulls the plug.&lt;/p&gt;

&lt;h3&gt;
  
  
  SIGKILL (9)
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;kill -9&lt;/code&gt; or &lt;code&gt;kill -SIGKILL&lt;/code&gt;. Not caught. Not ignored. Not blocked. The moment the kernel delivers SIGKILL to a running process, it's marked for immediate termination. The kernel doesn't call any signal handler, doesn't run atexit functions, doesn't flush stdio buffers.&lt;/p&gt;

&lt;p&gt;This is why &lt;code&gt;kill -9&lt;/code&gt; leaves zombie processes, half-written files, and unreleased locks behind. It's not "force kill" — it's "unconditional termination, no cleanup allowed."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why &lt;code&gt;kill -9&lt;/code&gt; sometimes can't kill a process&lt;/strong&gt;: &lt;strong&gt;uninterruptible sleep&lt;/strong&gt;, shown in &lt;code&gt;ps&lt;/code&gt; as state &lt;code&gt;D&lt;/code&gt;. This is a process that's blocked in a kernel code path that cannot be safely interrupted — usually waiting on a disk I/O that can't be cancelled, or an NFS mount that's gone stale. The kernel won't deliver SIGKILL until the process wakes from that sleep. Sometimes it never wakes. That's how you get unkillable processes, and the only fix is rebooting or waiting for the I/O to resolve.&lt;/p&gt;

&lt;h3&gt;
  
  
  SIGINT (2)
&lt;/h3&gt;

&lt;p&gt;You press Ctrl-C. SIGINT happens. The default action is termination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why Ctrl-C kills a whole pipeline.&lt;/strong&gt; The kernel's terminal driver sends SIGINT to the &lt;strong&gt;entire foreground process group&lt;/strong&gt; — not a single process. Every process in the group gets it simultaneously. When you run &lt;code&gt;cat huge_file | grep pattern | wc -l&lt;/code&gt;, all three processes are in the same foreground process group. They all get SIGINT. They all die. The pipeline collapses cleanly. I explain why it hits the whole group — and what a process group even is — in &lt;a href="https://nazquadri.dev/blog/the-layer-below/06-sessions-and-groups/" rel="noopener noreferrer"&gt;&lt;em&gt;Sessions and Process Groups&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  SIGPIPE
&lt;/h3&gt;

&lt;p&gt;Run &lt;code&gt;python script.py | head -n 10&lt;/code&gt; and sometimes you'll see &lt;code&gt;BrokenPipeError&lt;/code&gt;. Here's what happened: &lt;code&gt;head&lt;/code&gt; read its 10 lines and exited. The pipe closed. The Python script was still writing. The kernel detected that the write end of a pipe has no readers, and sent &lt;strong&gt;SIGPIPE&lt;/strong&gt; to the still-writing process. The default action is termination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why &lt;code&gt;BrokenPipeError&lt;/code&gt; exists&lt;/strong&gt; — it's a caught SIGPIPE converted to an exception. Programs that ignore SIGPIPE (&lt;code&gt;signal(SIGPIPE, SIG_IGN)&lt;/code&gt;) will instead get an error return from &lt;code&gt;write()&lt;/code&gt;. Either way, the kernel is telling you: nobody's reading anymore, stop writing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Signals You Probably Don't Know
&lt;/h2&gt;

&lt;h3&gt;
  
  
  SIGTSTP (20): Ctrl-Z Does This
&lt;/h3&gt;

&lt;p&gt;You press Ctrl-Z. The shell reports &lt;code&gt;[1]+  Stopped&lt;/code&gt;. The process is still alive, just suspended.&lt;/p&gt;

&lt;p&gt;What happened: the terminal driver sent SIGTSTP to the foreground process group. The default action is to &lt;strong&gt;stop&lt;/strong&gt; the process — it gets parked in the kernel's run queue, not scheduled for CPU time, not consuming memory (well, its pages stay allocated, but it's not actively running). It's frozen mid-execution.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;fg&lt;/code&gt; sends SIGCONT to resume it. &lt;code&gt;bg&lt;/code&gt; also sends SIGCONT, but tells the shell to let it run in the background.&lt;/p&gt;

&lt;p&gt;Programs can catch SIGTSTP. Vim does this. When you Ctrl-Z out of vim, it saves its terminal state, restores the original terminal settings, &lt;em&gt;then&lt;/em&gt; stops itself. When you &lt;code&gt;fg&lt;/code&gt; back in, it gets SIGCONT, re-enters raw mode, and redraws the screen. That's not magic — it's a signal handler.&lt;/p&gt;

&lt;p&gt;The difference between SIGTSTP and SIGSTOP: SIGTSTP can be caught — a process can intercept it and do cleanup before stopping. SIGSTOP cannot — the kernel stops it unconditionally. The full job control cycle — &lt;code&gt;&amp;amp;&lt;/code&gt;, &lt;code&gt;fg&lt;/code&gt;, &lt;code&gt;bg&lt;/code&gt;, and how the shell orchestrates it — is in &lt;a href="https://nazquadri.dev/blog/the-layer-below/06-sessions-and-groups/" rel="noopener noreferrer"&gt;&lt;em&gt;Sessions and Process Groups&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  SIGWINCH: Your Terminal Is Resizing
&lt;/h3&gt;

&lt;p&gt;Every time you drag the corner of your terminal window, a signal fires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SIGWINCH&lt;/strong&gt; — "window change" — is delivered to the foreground process group when the terminal dimensions change. The kernel's terminal driver detects that the master side has reported a new size, and sends the signal.&lt;/p&gt;

&lt;p&gt;Let that sink in for a moment. Window resizing is a signal event. The kernel is involved.&lt;/p&gt;

&lt;p&gt;The process catches SIGWINCH, calls &lt;code&gt;ioctl(STDOUT_FILENO, TIOCGWINSZ, &amp;amp;ws)&lt;/code&gt; to get the new dimensions, and redraws its UI. This is how vim reflows when you resize the window. How htop re-renders its columns. How your shell prompt re-wraps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fscn4lgx851um94k24gdo.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fscn4lgx851um94k24gdo.jpeg" alt="SIGWINCH — terminal resize signal flow" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're writing a terminal application, you need to catch SIGWINCH or your UI will look wrong after any resize.&lt;/p&gt;

&lt;h3&gt;
  
  
  SIGCHLD: Your Child Died (Or Did Something Else)
&lt;/h3&gt;

&lt;p&gt;When a child process changes state — exits, is stopped, is resumed — the kernel sends &lt;strong&gt;SIGCHLD&lt;/strong&gt; to its parent.&lt;/p&gt;

&lt;p&gt;The default action is to ignore it. Most processes don't care.&lt;/p&gt;

&lt;p&gt;But shells care deeply. When bash gets SIGCHLD, it checks which child changed state, updates its job table, and prints &lt;code&gt;[1]+  Done  sleep 10&lt;/code&gt; or whatever. When a daemon forks children to handle requests, it catches SIGCHLD so it can call &lt;code&gt;waitpid()&lt;/code&gt; and reap the zombies before they accumulate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why zombie processes happen.&lt;/strong&gt; When a process exits, it doesn't fully disappear — it leaves a small record in the kernel's process table, holding the exit code. The parent is expected to call &lt;code&gt;wait()&lt;/code&gt; or &lt;code&gt;waitpid()&lt;/code&gt; to retrieve that exit code, at which point the zombie is cleaned up. If the parent never calls wait, the zombie just sits there. If the parent exits, zombies get reparented to init (PID 1), which calls wait in a loop.&lt;/p&gt;

&lt;p&gt;SIGCHLD is the kernel radioing your base camp: "One of your people turned." You can go collect what's left — the exit code — by calling &lt;code&gt;waitpid()&lt;/code&gt;. If you don't, they shamble around the process table indefinitely. The kernel literally calls them zombies. State &lt;code&gt;Z&lt;/code&gt; in &lt;code&gt;ps&lt;/code&gt;. No cure. No kill signal works — they're already dead. The only way to clear them is to collect the exit code, or let the parent die too, at which point &lt;code&gt;init&lt;/code&gt; adopts the orphans and reaps them. Rick Grimes would call that a mercy kill. The kernel calls it &lt;code&gt;wait()&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  SIGALRM: How Timeouts Work
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;alarm(30)&lt;/code&gt; sets a timer. In 30 seconds, the kernel delivers &lt;strong&gt;SIGALRM&lt;/strong&gt; to the process. The default action is termination, but programs catch it to implement timeouts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's how shell scripts time out operations.&lt;/strong&gt; The crude bash version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Kill myself after 30 seconds if still running&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;30 &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="nv"&gt;$$&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &amp;amp; long-running-command
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or the cleaner way, using the &lt;code&gt;timeout&lt;/code&gt; command (coreutils):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;timeout &lt;/span&gt;30 long-running-command
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, &lt;code&gt;timeout&lt;/code&gt; does exactly what &lt;code&gt;alarm()&lt;/code&gt; does — sets a timer, catches the signal, kills the child. The real pattern in C is &lt;code&gt;alarm()&lt;/code&gt; with a signal handler that cancels or interrupts the operation. Many C programs use this to implement read timeouts, connection timeouts, and watchdog behavior.&lt;/p&gt;




&lt;h2&gt;
  
  
  SIGHUP: The Signal That Means Two Different Things
&lt;/h2&gt;

&lt;p&gt;This one deserves its own section because it's genuinely confusing until you understand the history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SIGHUP&lt;/strong&gt; — "hangup" — originally meant that your physical serial terminal had disconnected. The modem connection dropped. The carrier signal was gone. The line was dead.&lt;/p&gt;

&lt;p&gt;When that happened, the kernel sent SIGHUP to the session leader (usually the shell) of the terminal session. The shell would then die, taking its job-controlled children with it. This was the correct behavior: if your terminal is gone, there's no point keeping the session running.&lt;/p&gt;

&lt;p&gt;SIGHUP means "your terminal disconnected." The default action is termination.&lt;/p&gt;

&lt;p&gt;There's a second meaning, and it came from necessity.&lt;/p&gt;

&lt;p&gt;By the mid-1980s, Unix system administrators had noticed: SIGHUP is a signal you can catch. And servers don't have physical terminals to disconnect. So SIGHUP became the conventional signal to tell a daemon "reload your configuration without restarting."&lt;/p&gt;

&lt;p&gt;nginx, sshd, Apache, rsyslog — they all catch SIGHUP and re-read their config files. You'll see this in documentation all the time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="nt"&gt;-HUP&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /var/run/nginx.pid&lt;span class="si"&gt;)&lt;/span&gt;    &lt;span class="c"&gt;# reload nginx config&lt;/span&gt;
&lt;span class="c"&gt;# or, in modern times:&lt;/span&gt;
systemctl reload nginx                  &lt;span class="c"&gt;# which does the same thing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;That's why &lt;code&gt;nginx -s reload&lt;/code&gt; works&lt;/strong&gt; — the nginx binary locates the master process PID, sends SIGHUP, and exits. The master process catches SIGHUP, re-reads its config, and gracefully replaces its workers. It's not a special protocol. It's just a signal.&lt;/p&gt;

&lt;p&gt;The same signal means "your terminal died, please exit" to an interactive process, and "please reload your config" to a daemon. Context entirely determines the correct behavior.&lt;/p&gt;

&lt;p&gt;It's terrible API design, but it's been working for 50 years.&lt;/p&gt;




&lt;h2&gt;
  
  
  Process Groups and Why Signals Go Sideways
&lt;/h2&gt;

&lt;p&gt;Signals don't always go where you expect. When Ctrl-C kills an entire pipeline but not your shell, that's not the signal mechanism — that's &lt;strong&gt;process groups&lt;/strong&gt;, and the kernel's rules about who receives what. I cover the full process group and session architecture in &lt;a href="https://nazquadri.dev/blog/the-layer-below/06-sessions-and-groups/" rel="noopener noreferrer"&gt;&lt;em&gt;Sessions and Process Groups&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  nohup
&lt;/h2&gt;

&lt;p&gt;You've probably used &lt;code&gt;nohup&lt;/code&gt; to keep a process alive after logout. How it actually works — and why tmux is better — is in &lt;a href="https://nazquadri.dev/blog/the-layer-below/06-sessions-and-groups/" rel="noopener noreferrer"&gt;&lt;em&gt;Sessions and Process Groups&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Signals and the PTY
&lt;/h2&gt;

&lt;p&gt;Signals and file descriptors are not separate systems. They interlock at the terminal level.&lt;/p&gt;

&lt;p&gt;When a PTY supervisor — a process that holds the master end of a PTY and manages a child's terminal session — wants to send Ctrl-C to the child, there are two approaches. The wrong one: &lt;code&gt;kill(child_pid, SIGINT)&lt;/code&gt;. The right one: write the byte &lt;code&gt;0x03&lt;/code&gt; (the byte Ctrl-C generates) to the PTY master.&lt;/p&gt;

&lt;p&gt;The kernel's line discipline, running inside the PTY, processes that byte and generates SIGINT for the foreground process group inside the PTY's session. The child sees a real Ctrl-C — one that came through the terminal, the way Ctrl-C is supposed to arrive. Job control works correctly. Signal handlers see the right source.&lt;/p&gt;

&lt;p&gt;When the supervisor wants to stop the whole session cleanly, it sends SIGTERM to the child's process group (using the negative PGID), waits for exit by catching SIGCHLD, then cleans up resources. SIGKILL is the fallback if the process ignores SIGTERM and the timeout expires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why well-written supervisors write control bytes to the PTY master instead of sending signals directly&lt;/strong&gt; — it respects the terminal abstraction that the child process expects.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mental Model
&lt;/h2&gt;

&lt;p&gt;Signals are the kernel's notification system for processes. Asynchronous, lightweight, and limited — they carry one piece of information, the signal number, and optionally a few extra bytes in the &lt;code&gt;siginfo_t&lt;/code&gt; structure. They're not designed for data transfer. They're designed for control: "stop what you're doing," "your child just exited," "your window changed size," "someone is asking you to reload your config."&lt;/p&gt;

&lt;p&gt;The disposition model — default, catch, ignore — gives each process control over how it responds. Except for SIGKILL and SIGSTOP, which the kernel reserves for itself. Those two bypass the disposition system entirely.&lt;/p&gt;

&lt;p&gt;Process groups determine the blast radius — which processes a signal actually reaches. That mechanism, along with session architecture, is in &lt;a href="https://nazquadri.dev/blog/the-layer-below/06-sessions-and-groups/" rel="noopener noreferrer"&gt;&lt;em&gt;Sessions and Process Groups&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;And SIGHUP is historically weird. Accept it and move on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Recap
&lt;/h2&gt;

&lt;p&gt;Here's what we've covered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A signal is a small out-of-band notification delivered by the kernel to a process.&lt;/li&gt;
&lt;li&gt;Disposition is per-process, per-signal: default action, catch with a handler, or ignore.&lt;/li&gt;
&lt;li&gt;SIGKILL (9) and SIGSTOP (19) cannot be caught or ignored — the kernel handles them unconditionally.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;kill -9&lt;/code&gt; can fail to immediately kill a process in uninterruptible sleep (&lt;code&gt;D&lt;/code&gt; state).&lt;/li&gt;
&lt;li&gt;The terminal driver sends SIGINT on Ctrl-C and SIGTSTP on Ctrl-Z to the entire foreground process group.&lt;/li&gt;
&lt;li&gt;SIGPIPE fires when you write to a pipe with no readers — that's the &lt;code&gt;BrokenPipeError&lt;/code&gt; you've seen.&lt;/li&gt;
&lt;li&gt;SIGWINCH fires whenever the terminal is resized — programs catch it to redraw their UI.&lt;/li&gt;
&lt;li&gt;SIGCHLD notifies a parent when a child changes state; catching it is how you avoid zombie processes.&lt;/li&gt;
&lt;li&gt;SIGHUP means "terminal disconnected" to interactive programs, and "reload config" to daemons. Both are real.&lt;/li&gt;
&lt;li&gt;Process groups, &lt;code&gt;nohup&lt;/code&gt;, session architecture, and &lt;code&gt;setsid()&lt;/code&gt; are covered in the next post: &lt;a href="https://nazquadri.dev/blog/the-layer-below/06-sessions-and-groups/" rel="noopener noreferrer"&gt;&lt;em&gt;Sessions and Process Groups&lt;/em&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man7/signal.7.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 7 signal&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — the complete Linux signal reference. Every signal, its default action, whether it can be caught.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man2/sigaction.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 2 sigaction&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — the correct way to install a signal handler (not the older &lt;code&gt;signal(2)&lt;/code&gt;, which has subtle portability problems).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man2/kill.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 2 kill&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man2/waitpid.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 2 waitpid&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — the syscalls for sending signals and reaping children.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.apuebook.com/" rel="noopener noreferrer"&gt;Advanced Programming in the UNIX Environment&lt;/a&gt;&lt;/strong&gt; — Stevens and Rago, Chapter 10 on signals. Still the best treatment of the full signal model in print.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri occasionally sends SIGTERM to processes that deserve a chance to clean up first. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>What Actually Happens When You Read a File</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:02:11 +0000</pubDate>
      <link>https://forem.com/nazq/what-actually-happens-when-you-read-a-file-2cd4</link>
      <guid>https://forem.com/nazq/what-actually-happens-when-you-read-a-file-2cd4</guid>
      <description>&lt;h1&gt;
  
  
  What Actually Happens When You Read a File
&lt;/h1&gt;

&lt;h2&gt;
  
  
  14 Things That Happen Before You Get Your Bytes
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~13 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You wrote &lt;code&gt;data = open('file.txt').read()&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Two function calls. That's all you did. Python is happy. Your bytes are there. The whole thing took 0.317 milliseconds and you moved on with your life.&lt;/p&gt;

&lt;p&gt;Under the hood, at least four separate processors woke up to serve you. One of them is inside your storage drive. Another is a dedicated interrupt controller. A third is the flash memory controller that lives alongside the NAND chips themselves, running its own firmware, maintaining its own data structures in its own RAM. And then there's your CPU, which you thought was doing all the work.&lt;/p&gt;

&lt;p&gt;Let's go look at what actually happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Comfortable Lie
&lt;/h2&gt;

&lt;p&gt;The mental model most of us carry is: open a file, read the bytes, done. The file is "on disk." Reading it means "getting it from disk." Maybe there's a cache involved. Simple.&lt;/p&gt;

&lt;p&gt;The truth is more interesting. Here's the actual stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Python f.read()
  └── C stdlib buffering
        └── read(2) syscall
              └── kernel VFS layer
                    └── filesystem driver (ext4, xfs, btrfs...)
                          └── block I/O scheduler
                                └── NVMe driver
                                      └── NVMe controller (separate ARM/RISC-V SoC)
                                            └── Flash Translation Layer
                                                  └── NAND flash cells
                                                        └── DMA → system RAM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You've probably hit the effects without knowing the cause: the first read of a file takes longer than subsequent reads. Large files on SSDs can stall with weird latency spikes. Programs that read the same file concurrently don't always step on each other. None of this makes sense until you know what's underneath.&lt;/p&gt;




&lt;h2&gt;
  
  
  You Haven't Even Left Python Yet
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file.txt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# this IS a syscall — the kernel is already working
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;        &lt;span class="c1"&gt;# and now it works harder
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;open()&lt;/code&gt; in Python isn't just creating a file object. CPython calls down to the &lt;code&gt;open(2)&lt;/code&gt; syscall immediately — the kernel resolves the path, walks the directory tree, loads the inode, checks permissions, and allocates a file descriptor. By the time &lt;code&gt;open()&lt;/code&gt; returns, the kernel has done real work. What it &lt;em&gt;hasn't&lt;/em&gt; done is read any data. The file is open. The bytes are still on disk.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;f.read()&lt;/code&gt; is where the data moves. Python's file object calls through its buffered I/O layer, which calls &lt;code&gt;read(2)&lt;/code&gt; — the actual POSIX syscall that triggers the chain of events this post is about.&lt;/p&gt;

&lt;p&gt;Now we're at the syscall boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Syscall — Crossing Into Kernel Space
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;read(2)&lt;/code&gt; syscall is the moment your program stops being in charge.&lt;/p&gt;

&lt;p&gt;Your code executes a special CPU instruction — &lt;code&gt;syscall&lt;/code&gt; on x86-64 — that switches the processor from user mode to kernel mode. The CPU saves your current register state, changes privilege level, and jumps to the kernel's syscall handler. Your process is now blocked, waiting. The kernel is driving.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyj17rm3rthoyqkd7tw2x.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyj17rm3rthoyqkd7tw2x.jpeg" alt="Syscall boundary — Ring 3 to Ring 0" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The kernel looks up the file descriptor you passed. &lt;a href="https://nazquadri.dev/blog/the-layer-below/03-file-descriptors/" rel="noopener noreferrer"&gt;File descriptors&lt;/a&gt; are integers — small ones, starting at 0. Behind that integer is a &lt;code&gt;struct file&lt;/code&gt; in kernel memory, containing things like the current read position, flags, and most importantly: a pointer to the VFS inode for this file.&lt;/p&gt;




&lt;h2&gt;
  
  
  The VFS — The Kernel Doesn't Know What a Filesystem Is
&lt;/h2&gt;

&lt;p&gt;I mean it does but in the name of abstraction the kernel doesn't operate in terms of ext4 or XFS at this level.&lt;/p&gt;

&lt;p&gt;The kernel has a &lt;strong&gt;Virtual Filesystem (VFS)&lt;/strong&gt; layer — an abstraction that sits above all actual filesystem implementations and defines a common interface. Every filesystem registers itself by providing a set of function pointers: here's how to look up a file by name, here's how to read an inode, here's how to iterate a directory.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;read(2)&lt;/code&gt; lands in the kernel, it calls through the VFS. The VFS looks at the inode for your file and calls that filesystem's &lt;code&gt;read&lt;/code&gt; function. If you're on ext4, the ext4 driver handles it. If you're on btrfs, btrfs handles it. Your process has no idea which one it is.&lt;/p&gt;

&lt;p&gt;This is why you can mount a USB drive formatted with FAT32 and &lt;code&gt;open()&lt;/code&gt; files on it with exactly the same Python code. Same syscall. Same VFS interface. Different driver underneath.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;read(fd)
  → kernel VFS layer
    → ext4_file_read_iter()   # or xfs_file_read_iter(), etc.
      → generic_file_read_iter()
        → page cache lookup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Page Cache — The Kernel's Memory is Not Your Memory
&lt;/h2&gt;

&lt;p&gt;Before any I/O happens, the kernel checks the &lt;strong&gt;page cache&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The page cache is a giant in-memory buffer the kernel maintains for file data. Every file read that came from disk is stored here. Every file write passes through here before hitting disk (in writeback mode). The page cache is the reason the second &lt;code&gt;read()&lt;/code&gt; of the same file is instant: the bytes are already sitting in kernel memory.&lt;/p&gt;

&lt;p&gt;Pages are 4KB chunks. The kernel asks: does the page cache contain the pages that cover the byte range this &lt;code&gt;read()&lt;/code&gt; asked for?&lt;/p&gt;

&lt;p&gt;If yes: copy the bytes from kernel memory into the userspace buffer your process provided. Done. No I/O. The whole chain from syscall to return took maybe 1–2 microseconds.&lt;/p&gt;

&lt;p&gt;If no: we need to go get them. This is where it gets interesting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why warm caches are fast and cold caches are slow.&lt;/strong&gt; A page cache hit skips everything below — the filesystem, the block layer, the NVMe controller, NAND sensing, DMA, all of it. Irrelevant when your bytes are already in kernel memory. This is why spinning up a long-running server process is slow at first — nothing is cached — and then blazing fast once the working set is warm.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why &lt;code&gt;mmap()&lt;/code&gt; exists.&lt;/strong&gt; Mapping a file into virtual memory cuts out the copy step. The page cache pages are mapped directly into your process's virtual address space. When you access them, the CPU's MMU handles the translation. No &lt;code&gt;read()&lt;/code&gt; syscall, no copy into a userspace buffer — the page cache IS your buffer. For large files read once, this is often faster. For small files read repeatedly, the syscall overhead doesn't matter and &lt;code&gt;read()&lt;/code&gt; is fine. As an aside when I first discovered &lt;code&gt;CreateFileMapping()&lt;/code&gt; (Windoze equiv of &lt;code&gt;mmap&lt;/code&gt;) back in the late 90s, I was amazed, I started using it for everything till one of my mentors pled with me to stop ... but I did not 😂.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why databases use &lt;code&gt;O_DIRECT&lt;/code&gt;.&lt;/strong&gt; PostgreSQL and other databases bypass the page cache entirely with &lt;code&gt;O_DIRECT&lt;/code&gt;, writing directly to the block layer. They maintain their own buffer pools and don't want the kernel caching their data twice. The kernel's cache is designed for general workloads; a database has better information about which pages to keep warm.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrnbtvnagfd1ja34bnyc.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqrnbtvnagfd1ja34bnyc.jpeg" alt="Page cache — cache hit vs cache miss" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Filesystem's Job — Names Don't Mean Anything
&lt;/h2&gt;

&lt;p&gt;The filesystem translates the filename into block addresses.&lt;/p&gt;

&lt;p&gt;Inside the filesystem, files are identified by &lt;strong&gt;inodes&lt;/strong&gt;. The inode is a data structure that stores everything about a file except its name: permissions, timestamps, owner, size, and most importantly — where the data actually lives on disk.&lt;/p&gt;

&lt;p&gt;The filename is just a pointer to an inode. The inode contains block addresses. The block addresses are where the data is.&lt;/p&gt;

&lt;p&gt;For ext4, the inode uses a tree of "extent" records. An extent says "this file's data from byte offset X to offset Y is stored at block address Z on disk." Large files have multiple extents. Highly fragmented files have many extents that the filesystem has to chase down.&lt;/p&gt;

&lt;p&gt;For a small file that's been there since the filesystem was created, you might get one extent: "all your bytes are at block 12845." For a file that's been written in pieces over months, you might get dozens of extents spread across the disk.&lt;/p&gt;

&lt;p&gt;The filesystem resolves all of this and hands the block layer a list of logical block addresses to fetch.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Block Layer — The Scheduler You Never Knew You Had
&lt;/h2&gt;

&lt;p&gt;Between the filesystem and the storage driver, there's a &lt;strong&gt;block I/O layer&lt;/strong&gt; with a scheduler.&lt;/p&gt;

&lt;p&gt;For traditional spinning hard drives, the scheduler tries to minimize seek time by reordering requests so the drive head doesn't thrash back and forth. For NVMe SSDs, the scheduler does something different: it batches requests and submits them in parallel. NVMe was designed for SSDs and supports up to 65535 queues with up to 65535 commands each. The bottleneck isn't seek time; it's queue depth and controller parallelism.&lt;/p&gt;

&lt;p&gt;For NVMe on modern Linux, the default scheduler is often &lt;code&gt;none&lt;/code&gt; — the kernel trusts the device's own queuing. Check your system with &lt;code&gt;cat /sys/block/nvme0n1/queue/scheduler&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Your single &lt;code&gt;read()&lt;/code&gt; call might generate one I/O request. A larger read, or a file with many extents, might generate several requests, all in flight simultaneously.&lt;/p&gt;




&lt;h2&gt;
  
  
  The NVMe Controller — A Separate Computer
&lt;/h2&gt;

&lt;p&gt;Here's where most people's mental model breaks down completely.&lt;/p&gt;

&lt;p&gt;Your NVMe drive is not a passive block device. It's a computer.&lt;/p&gt;

&lt;p&gt;Inside your SSD, there's a dedicated &lt;strong&gt;NVMe controller&lt;/strong&gt;: an ARM (or MIPS, or RISC-V) processor running its own firmware, with its own RAM (typically up to 2GB for caching and FTL mapping tables on consumer drives; cheap drives use HMB — Host Memory Buffer — borrowing from your system RAM instead), its own instruction cache, its own operating loop. It boots when your machine powers on. It's running right now, independent of your CPU.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbop8l9v0cl041urerh7l.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbop8l9v0cl041urerh7l.jpeg" alt="NVMe SSD exploded view" width="800" height="1071"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The NVMe protocol is how your CPU's storage driver talks to this controller. When Linux wants to read blocks 12845–12850, it writes a command into a &lt;strong&gt;submission queue&lt;/strong&gt; in memory — a region that both the CPU and the NVMe controller can see, via PCIe. The controller polls or is notified of this command, picks it up, and starts processing it.&lt;/p&gt;

&lt;p&gt;The CPU posts the command and goes to sleep, waiting for a completion notification. The NVMe controller is now in charge.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Flash Translation Layer — Another Layer of Lies
&lt;/h2&gt;

&lt;p&gt;The NVMe controller doesn't directly address NAND flash cells. There's one more indirection: the &lt;strong&gt;Flash Translation Layer (FTL)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;NAND flash has two deeply inconvenient properties that make it unsuitable for direct use as a block device:&lt;/p&gt;

&lt;p&gt;First, you can only write to a NAND cell by erasing it first. Erasing happens in "erase blocks" — units typically 128KB to several megabytes. You can't erase individual bytes. If you want to update 4KB of data in the middle of an erase block, you have to read the whole erase block, erase it, modify the relevant part, and write it all back.&lt;/p&gt;

&lt;p&gt;Second, NAND cells wear out. Each erase cycle damages the cell's insulator a little. Consumer NAND is typically rated for 1,000–3,000 program-erase cycles. After that, the cell starts to lose charge and eventually holds incorrect data. If you wrote to the same physical location every time, it would wear out while the rest of the drive was fresh.&lt;/p&gt;

&lt;p&gt;The FTL solves both problems. It maintains a mapping table — a giant lookup table in the controller's RAM — that maps the &lt;strong&gt;logical block address&lt;/strong&gt; the host sees to the &lt;strong&gt;physical block address&lt;/strong&gt; where data actually lives in the NAND. Every write goes to a fresh location, and the FTL updates the mapping. The old physical location is marked as invalid, to be reclaimed during garbage collection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why TRIM matters.&lt;/strong&gt; Without TRIM, the FTL doesn't know which logical blocks are no longer in use. It has to do garbage collection under load, pausing your writes while it erases blocks it &lt;em&gt;thinks&lt;/em&gt; might be reclaimable. The OS tells the drive what it deleted. The drive's FTL updates its tables. Without TRIM, your SSD gets slower over time as it accumulates "dead" mappings it can't safely reclaim.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why SSDs slow down near full.&lt;/strong&gt; The FTL runs out of fresh blocks and has to garbage-collect under load. Write amplification increases, and your writes start waiting on internal erase cycles. "90% full" means the FTL has 10% of its physical space to maneuver in.&lt;/p&gt;

&lt;p&gt;For a read, the FTL translates: "you want logical block 12845" → "that's at physical NAND page 0xAF3C14." Now the controller can address the NAND directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  NAND Sensing — Bits Are Charges
&lt;/h2&gt;

&lt;p&gt;NAND flash stores bits as electrical charge on a floating gate — a conductor surrounded by insulating oxide, sandwiched inside a transistor.&lt;/p&gt;

&lt;p&gt;Programmed (lower charge) reads as a 0. Erased (higher charge) reads as a 1. Sensing the charge means applying a reference voltage to the gate and measuring whether the cell conducts. The reference voltage is carefully calibrated — and for MLC or TLC NAND (2 or 3 bits per cell), there are multiple voltage thresholds to distinguish, because each cell holds multiple charge levels representing different bit patterns.&lt;/p&gt;

&lt;p&gt;The NAND controller reads a full page at once — typically 4KB to 16KB. It applies the sense voltage across thousands of cells in parallel, latches the results into a page register, and then applies &lt;strong&gt;ECC (Error Correcting Code)&lt;/strong&gt; to detect and correct any errors.&lt;/p&gt;

&lt;p&gt;NAND cells are lossy. Fresh cells might have error rates of 1 bit per billion. Near end-of-life, that might be 1 bit per thousand. ECC is mandatory — without it, you'd get bit flips constantly.&lt;/p&gt;

&lt;p&gt;After ECC, you have the corrected page data in the controller's page register. The relevant portion is your file's blocks.&lt;/p&gt;




&lt;h2&gt;
  
  
  DMA — The CPU Doesn't Touch the Data
&lt;/h2&gt;

&lt;p&gt;When the NVMe controller transfers data from its page register to your system RAM, your CPU doesn't touch the data. You read that right, and it's not some expensive server build hardware optimisation, this works on that 2018 14 inch laptop you still carry around.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direct Memory Access (DMA)&lt;/strong&gt; lets the NVMe controller (and other peripherals) write directly to system RAM over the PCIe bus, without interrupting the CPU. The DMA engine is a separate piece of hardware — often on the CPU die, but logically separate — that handles these memory transfers while the CPU does other things (or sleeps, in this case, since your process is blocked).&lt;/p&gt;

&lt;p&gt;The NVMe controller was told the physical address of the DMA target buffer when Linux submitted the I/O command. Now it uses the PCIe bus's memory write transactions to transfer the data directly into that buffer. No CPU instruction reads any of those bytes.&lt;/p&gt;

&lt;p&gt;The data travels: NAND → controller page register → PCIe bus → DMA engine → system RAM.&lt;/p&gt;

&lt;p&gt;Your CPU set up the transfer and will be notified when it's done. That's it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fai6c781nm7pwtl25sw8c.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fai6c781nm7pwtl25sw8c.jpeg" alt="NAND to system RAM — the CPU sleeps while DMA works" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Interrupt — MSI-X
&lt;/h2&gt;

&lt;p&gt;When the DMA transfer completes, the NVMe controller needs to tell the CPU.&lt;/p&gt;

&lt;p&gt;It does this with an &lt;strong&gt;MSI-X interrupt&lt;/strong&gt; — a "Message Signaled Interrupt eXtended." Instead of a physical signal on an interrupt pin (the old way), the controller writes a small message to a specific memory address. The interrupt controller (usually the APIC on x86 systems — another separate piece of hardware) sees this write and delivers the interrupt to the appropriate CPU core.&lt;/p&gt;

&lt;p&gt;The CPU's interrupt handler wakes up, runs the NVMe completion handler, marks the I/O as complete, and unblocks the page fault that was waiting for this data.&lt;/p&gt;

&lt;p&gt;MSI-X is worth naming because it enables one of the key performance features of NVMe: &lt;strong&gt;multiple independent queues mapped to different CPU cores&lt;/strong&gt;. Old storage interrupts all went to one CPU, which then had to fan out work. With MSI-X, each NVMe queue can interrupt a different core. The NVMe controller talks directly to all the CPUs in parallel.&lt;/p&gt;




&lt;h2&gt;
  
  
  Back in the Kernel — Assembly
&lt;/h2&gt;

&lt;p&gt;The page cache is now populated with the data from disk. The kernel copies the bytes from the page cache into the userspace buffer your process provided in the &lt;code&gt;read()&lt;/code&gt; syscall. The syscall returns.&lt;/p&gt;

&lt;p&gt;Your process unblocks. The return value is the number of bytes read. Control returns to Python's &lt;code&gt;f.read()&lt;/code&gt; implementation, then to your code.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;data&lt;/code&gt; has your bytes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Processor Count
&lt;/h2&gt;

&lt;p&gt;Let's count the processors involved in that single &lt;code&gt;read()&lt;/code&gt; call:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Your CPU&lt;/strong&gt; — executed the syscall, set up DMA, blocked, received the interrupt, ran the completion handler, copied bytes to userspace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The NVMe controller&lt;/strong&gt; — ARM/RISC-V SoC inside the drive, executed the FTL lookup, sent NAND commands, supervised the DMA transfer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The NAND flash controller&lt;/strong&gt; — embedded within or tightly coupled to the NAND chips, handled the page read, applied ECC, transferred data to the controller's page register&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The interrupt controller (APIC)&lt;/strong&gt; — delivered the MSI-X completion interrupt to the correct CPU core&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Four processors, minimum. More if you count the DMA engine as a separate compute element, which it arguably is.&lt;/p&gt;

&lt;p&gt;For one &lt;code&gt;read()&lt;/code&gt; call. That took 0.3 milliseconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lesson
&lt;/h2&gt;

&lt;p&gt;We started with two lines of Python. Here's what they were hiding: the VFS, a filesystem driver, a block scheduler, an NVMe command queue, a Flash Translation Layer, NAND physics, DMA hardware, and interrupt routing.&lt;/p&gt;

&lt;p&gt;None of those layers are particularly complicated on their own. It's the stack of them, the fact that each one is a complete independent system with its own logic and failure modes, that makes "reading a file" more interesting than it looks.&lt;/p&gt;

&lt;p&gt;The next time you hit an unexpected latency spike on a read, or notice that a fresh server is slower than a warmed-up one, or wonder why SSDs slow down when they're 90% full — you know where to look.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://man7.org/linux/man-pages/man2/read.2.html" rel="noopener noreferrer"&gt;&lt;code&gt;man 2 read&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — The read syscall. Start here, then follow the rabbit hole.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html" rel="noopener noreferrer"&gt;Linux Kernel Documentation: The Page Cache&lt;/a&gt;&lt;/strong&gt; — Authoritative, dense, worth it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.anandtech.com/show/9603/inside-the-ssd-revolution-3d-nand-and-the-future-of-flash/2" rel="noopener noreferrer"&gt;Flash Memory Guide, AnandTech&lt;/a&gt;&lt;/strong&gt; — Excellent deep dive into NAND internals, written for engineers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/linux-nvme/nvme-cli" rel="noopener noreferrer"&gt;&lt;code&gt;nvme list&lt;/code&gt; / &lt;code&gt;nvme smart-log /dev/nvme0&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — Poke at your own NVMe controller. The SMART data it returns is illuminating.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://nvmexpress.org/specifications/" rel="noopener noreferrer"&gt;NVM Express Base Specification&lt;/a&gt;&lt;/strong&gt; — The actual protocol spec. Freely available. Section 3 covers the submission/completion queue model.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri occasionally stares at perf trace output and wonders how many processors it takes to screw in a lightbulb. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>The Parallel Lanes Nobody Uses</title>
      <dc:creator>Naz Quadri</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:00:29 +0000</pubDate>
      <link>https://forem.com/nazq/the-parallel-lanes-nobody-uses-1n35</link>
      <guid>https://forem.com/nazq/the-parallel-lanes-nobody-uses-1n35</guid>
      <description>&lt;h1&gt;
  
  
  The Parallel Lanes Nobody Uses
&lt;/h1&gt;

&lt;h2&gt;
  
  
  SIMD and the Eight-Lane Highway You've Been Driving Solo
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reading time: ~13 minutes&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You ran ripgrep across a 2GB log file and it finished in half a second. grep would have taken ten. You called &lt;code&gt;np.array * 2&lt;/code&gt; and it finished before the function call overhead had time to register.&lt;/p&gt;

&lt;p&gt;Here's what actually happened: your CPU has 256-bit registers that can process 8 floats simultaneously. Those tools used all eight lanes of an eight-lane highway. Your Python for-loop uses one.&lt;/p&gt;

&lt;p&gt;This is what your CPU can actually do.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fundamental Idea
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SIMD&lt;/strong&gt; stands for Single Instruction, Multiple Data. It's not a clever trick. It's a first-class feature of every CPU you've used in the last twenty years.&lt;/p&gt;

&lt;p&gt;The idea is direct. A normal CPU instruction operates on one value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ADD rax, rbx      # add one 64-bit integer to one other 64-bit integer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A SIMD instruction operates on a packed vector of values in a single clock:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VADDPS ymm0, ymm1, ymm2   # add eight 32-bit floats at once
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Eight additions. One instruction. One cycle.&lt;/p&gt;

&lt;p&gt;The register &lt;code&gt;ymm0&lt;/code&gt; is 256 bits wide. You pack 8 floats (each 32 bits) into it and treat the whole thing as a single operand. The arithmetic unit is physically wider — eight adders in parallel — and the instruction wires them all to fire simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr17g808ymejybtwzzifp.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr17g808ymejybtwzzifp.jpeg" alt="Scalar vs SIMD — 8 instructions vs 1 instruction" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is not a metaphor. It's silicon.&lt;/p&gt;




&lt;h2&gt;
  
  
  How We Got Here: The Register Zoo
&lt;/h2&gt;

&lt;p&gt;The story of SIMD is a story of Intel and AMD racing to add bigger and bigger registers while pretending backward compatibility wasn't getting worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MMX (1996)&lt;/strong&gt; — Intel introduced the first SIMD extension in the Pentium MMX. Eight 64-bit registers (&lt;code&gt;mm0&lt;/code&gt;–&lt;code&gt;mm7&lt;/code&gt;) for integer operations. The catch: those registers were aliased to the &lt;em&gt;mantissa fields&lt;/em&gt; of the x87 ST(0)–ST(7) floating-point registers. Switching between MMX and x87 FP required executing &lt;code&gt;EMMS&lt;/code&gt; to reset the x87 tag word first. (I'm simplifying the aliasing here — the full story involves how x87 tracks "empty" register slots.) Programmers used it. Suffered for it. Moved on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSE (1999)&lt;/strong&gt; — Streaming SIMD Extensions. Eight new 128-bit registers (&lt;code&gt;xmm0&lt;/code&gt;–&lt;code&gt;xmm7&lt;/code&gt;), finally independent of the FPU stack. Supported 4 single-precision floats or integer variants. Used heavily for 3D graphics and audio in the early 2000s.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSE2 (2001)&lt;/strong&gt; — Added double-precision floats and 128-bit integer operations. x86-64 made SSE2 mandatory, so as of 64-bit mode you can assume it exists. This is the baseline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSE3, SSSE3, SSE4.1, SSE4.2 (2004–2007)&lt;/strong&gt; — A string of incremental additions. String comparison instructions, dot products, population counts. Useful but baroque. The naming got embarrassing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AVX (2011)&lt;/strong&gt; — Intel widened the registers to 256 bits (&lt;code&gt;ymm0&lt;/code&gt;–&lt;code&gt;ymm15&lt;/code&gt;). Now you could do 8 floats or 4 doubles at once. The &lt;code&gt;ymm&lt;/code&gt; registers are actually the full-width versions of the &lt;code&gt;xmm&lt;/code&gt; registers — &lt;code&gt;xmm0&lt;/code&gt; is the lower 128 bits of &lt;code&gt;ymm0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AVX2 (2013)&lt;/strong&gt; — Extended AVX to integer operations and added gather instructions (load scattered values from memory into a vector register). Available on Intel Haswell and later, AMD Ryzen. This is the register set most production code targets today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AVX-512 (2017)&lt;/strong&gt; — 512-bit registers (&lt;code&gt;zmm0&lt;/code&gt;–&lt;code&gt;zmm31&lt;/code&gt;). 16 floats or 8 doubles at once. Intel pushed this hard in server chips; it's common in the data center. Desktop support is inconsistent — Intel disabled AVX-512 on Alder Lake desktop SKUs specifically because AVX-512 instructions are power-hungry enough to trigger thermal throttling, and Alder Lake's big/little core design made the behavior unpredictable. AMD added AVX-512 starting with Zen 4. The instruction set is 300+ pages of documentation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8pg5z82lmqrqsoe8566.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8pg5z82lmqrqsoe8566.jpeg" alt="SIMD register width evolution — MMX to AVX-512" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The registers kept doubling. The theoretical throughput kept doubling. Most application code never noticed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the Compiler Sometimes Does This For You
&lt;/h2&gt;

&lt;p&gt;Modern compilers — GCC, Clang, MSVC, and &lt;code&gt;rustc&lt;/code&gt; (which uses LLVM) — can &lt;strong&gt;auto-vectorize&lt;/strong&gt; loops. This is when the compiler looks at your scalar loop and emits SIMD instructions for it without you asking.&lt;/p&gt;

&lt;p&gt;This works well when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The loop has no data dependencies between iterations (iteration N doesn't use the result of iteration N-1)&lt;/li&gt;
&lt;li&gt;The data is contiguous in memory (array, not linked list)&lt;/li&gt;
&lt;li&gt;The compiler can prove there's no aliasing (the input and output arrays don't overlap)&lt;/li&gt;
&lt;li&gt;The trip count is known or the compiler can generate a scalar fallback for the remainder&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simple sum-of-squares is a textbook case the compiler handles automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;sum_squares&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compile with &lt;code&gt;--release&lt;/code&gt; targeting AVX2 and... the multiply vectorizes (&lt;code&gt;vmulps&lt;/code&gt;) but the sum stays scalar (&lt;code&gt;vaddss&lt;/code&gt;). Wait, what?&lt;/p&gt;

&lt;p&gt;Floating-point addition isn't associative — &lt;code&gt;(a + b) + c&lt;/code&gt; can give a different result from &lt;code&gt;a + (b + c)&lt;/code&gt; due to rounding. The compiler won't reorder your additions without permission, which means it can't pack 8 sums into a single &lt;code&gt;vaddps&lt;/code&gt;. Switch to integers and the story changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;sum_squares_i32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;i32&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you get &lt;code&gt;vpmulld&lt;/code&gt; and &lt;code&gt;vpaddd&lt;/code&gt; on &lt;code&gt;ymm&lt;/code&gt; registers — 8 integers at once, fully vectorized. Integer addition is associative, so LLVM can reorder freely. &lt;a href="https://rust.godbolt.org/z/Y694e8jvr" rel="noopener noreferrer"&gt;See both versions side by side on Compiler Explorer →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the kind of thing that makes auto-vectorization both powerful and frustrating. The compiler is doing the right thing — it won't change your program's semantics — but it means the "just write clean code and the compiler will vectorize it" advice has a large asterisk on it.&lt;/p&gt;

&lt;p&gt;This breaks down further the moment things get complicated. Add a branch inside the loop: the compiler has to use masked operations or give up. Use a data structure it can't prove is contiguous: it has to generate both a vectorized path and a scalar fallback, with a runtime check. Access non-contiguous memory: it has to use gather instructions, which are slower than you'd hope. Add any function call it can't inline: it bails entirely.&lt;/p&gt;

&lt;p&gt;Rust's ownership model actually helps here — slices guarantee contiguous memory and the borrow checker proves non-aliasing at compile time. That's information the auto-vectorizer can use. In C, the compiler has to assume two &lt;code&gt;float*&lt;/code&gt; arguments might alias unless you annotate with &lt;code&gt;restrict&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The compiler's auto-vectorizer is optimistic but conservative. You can inspect the emitted SIMD with &lt;code&gt;cargo rustc --release -- --emit asm&lt;/code&gt;, or use &lt;a href="https://godbolt.org/" rel="noopener noreferrer"&gt;Compiler Explorer&lt;/a&gt; to see exactly what LLVM generated. Read that output. It's educational in a way that is sometimes painful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Intrinsics: Taking the Wheel
&lt;/h2&gt;

&lt;p&gt;When auto-vectorization isn't enough, you can write SIMD code directly using &lt;strong&gt;intrinsics&lt;/strong&gt; — functions in Rust's &lt;code&gt;std::arch&lt;/code&gt; module that map one-to-one to specific CPU instructions.&lt;/p&gt;

&lt;p&gt;This is not assembly. You're still writing Rust. You're just telling the compiler exactly which instruction to emit. The ISA-specific code lives inside &lt;code&gt;unsafe&lt;/code&gt; blocks, making it explicit where you're stepping outside the compiler's guarantees:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;#[cfg(target_arch&lt;/span&gt; &lt;span class="nd"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"x86_64"&lt;/span&gt;&lt;span class="nd"&gt;)]&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;arch&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;x86_64&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="cd"&gt;/// Add two float slices element-wise using AVX.&lt;/span&gt;
&lt;span class="cd"&gt;/// Handles lengths that aren't a multiple of 8 with a scalar tail.&lt;/span&gt;
&lt;span class="nd"&gt;#[target_feature(enable&lt;/span&gt; &lt;span class="nd"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"avx"&lt;/span&gt;&lt;span class="nd"&gt;)]&lt;/span&gt;
&lt;span class="k"&gt;unsafe&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;add_arrays&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="nf"&gt;.min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;va&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_mm256_loadu_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="nf"&gt;.as_ptr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;   &lt;span class="c1"&gt;// load 8 floats&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;vb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_mm256_loadu_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="nf"&gt;.as_ptr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;   &lt;span class="c1"&gt;// load 8 floats&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;vc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_mm256_add_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;va&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vb&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;                &lt;span class="c1"&gt;// add all 8&lt;/span&gt;
        &lt;span class="nf"&gt;_mm256_storeu_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="nf"&gt;.as_mut_ptr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;vc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// store 8 floats&lt;/span&gt;
        &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// scalar tail for remainder (if n % 8 != 0)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;__m256&lt;/code&gt; type is a 256-bit vector. &lt;code&gt;_mm256_loadu_ps&lt;/code&gt; loads 8 unaligned single-precision floats. &lt;code&gt;_mm256_add_ps&lt;/code&gt; adds them. One call, one instruction. The &lt;code&gt;#[target_feature(enable = "avx")]&lt;/code&gt; attribute tells the compiler this function requires AVX — calling it on hardware without AVX is undefined behavior, which is why the function is &lt;code&gt;unsafe&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Intrinsics code is not fun to write. The naming convention (&lt;code&gt;_mm256_loadu_ps&lt;/code&gt; vs &lt;code&gt;_mm256_load_ps&lt;/code&gt; vs &lt;code&gt;_mm512_loadu_ps&lt;/code&gt;) requires memorizing a taxonomy. The Intel Intrinsics Guide (at &lt;a href="https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html" rel="noopener noreferrer"&gt;intrinsics.intel.com&lt;/a&gt;) is the reference — it lists every intrinsic, the instruction it maps to, the latency, and the throughput. You'll spend time there.&lt;/p&gt;

&lt;p&gt;The upside over C: Rust's type system catches width mismatches at compile time. If you accidentally pass an &lt;code&gt;__m128&lt;/code&gt; where an &lt;code&gt;__m256&lt;/code&gt; is expected, that's a type error, not a silent runtime bug. The &lt;code&gt;unsafe&lt;/code&gt; boundary also makes it easy to audit — every line that touches raw SIMD is visually contained.&lt;/p&gt;

&lt;p&gt;For a higher-level alternative, Rust's portable SIMD API (&lt;code&gt;std::simd&lt;/code&gt;) provides type-safe, architecture-independent vector types like &lt;code&gt;f32x8&lt;/code&gt;. It's available on nightly and progressing toward stable. When it lands, it will be the preferred way to write explicit SIMD without &lt;code&gt;unsafe&lt;/code&gt; or platform-specific intrinsics.&lt;/p&gt;

&lt;p&gt;Most application programmers don't write intrinsics. But the programmers who write the libraries you depend on — numpy, simdjson, ripgrep — absolutely do.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where SIMD Actually Lives
&lt;/h2&gt;

&lt;h3&gt;
  
  
  String Search
&lt;/h3&gt;

&lt;p&gt;Finding a byte in a buffer. You do it constantly, you never think about it, and it's the single operation where SIMD makes the most visceral difference. A naive loop checks one byte at a time. SIMD checks 32 with a single &lt;code&gt;_mm256_cmpeq_epi8&lt;/code&gt; — compare 32 bytes simultaneously, get a 32-bit mask of which positions matched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;memchr&lt;/code&gt;&lt;/strong&gt; — the fundamental byte-search operation — is implemented with SIMD at every level: glibc's C implementation, and Rust's &lt;code&gt;memchr&lt;/code&gt; crate (which we'll get to in a moment). The function you call every day is already vectorized.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ripgrep&lt;/strong&gt; is fast partly because of SIMD-accelerated &lt;code&gt;memchr&lt;/code&gt;. The &lt;a href="https://github.com/BurntSushi/memchr" rel="noopener noreferrer"&gt;memchr crate&lt;/a&gt; by Andrew Gallant implements &lt;code&gt;memchr&lt;/code&gt;, &lt;code&gt;memmem&lt;/code&gt;, and substring search using AVX2 (and AVX-512 where available). The core idea for substring search is &lt;strong&gt;Teddy&lt;/strong&gt; — an algorithm that uses SIMD to find candidate positions in bulk, then verifies them. When ripgrep is blazing through a 2GB log file, it's pushing 32 bytes at a time through vectorized comparisons. This is why it outperforms grep by 5–10x on many workloads. It's not magic. It's lanes.&lt;/p&gt;

&lt;p&gt;That's also why string search benchmarks look bizarre to anyone who hasn't seen SIMD before. A loop that calls &lt;code&gt;find&lt;/code&gt; in a hot path and a SIMD-accelerated version can differ by 8x with identical O() complexity. The algorithm doesn't tell you the constant factor.&lt;/p&gt;

&lt;h3&gt;
  
  
  JSON Parsing
&lt;/h3&gt;

&lt;p&gt;In 2019 Daniel Lemire wrote a &lt;a href="https://arxiv.org/abs/1902.08318" rel="noopener noreferrer"&gt;whitepaper&lt;/a&gt; which proved that JSON parsing is fundamentally a SIMD problem, giving birth to &lt;strong&gt;simdjson&lt;/strong&gt;. The bottleneck in parsing isn't the logic — it's scanning through bytes looking for structural characters (&lt;code&gt;{&lt;/code&gt;, &lt;code&gt;}&lt;/code&gt;, &lt;code&gt;[&lt;/code&gt;, &lt;code&gt;]&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;,&lt;/code&gt;, &lt;code&gt;"&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;simdjson processes 64 bytes at a time using AVX-512 (or 32 with AVX2). It classifies every byte simultaneously — is this a structural character? A whitespace? A quote? — using bitwise SIMD operations to produce bitmasks. Then it uses those bitmasks to drive parsing without a byte-at-a-time loop.&lt;/p&gt;

&lt;p&gt;The result: simdjson parses JSON at 2–3 GB/s on a modern CPU. The fastest pure-scalar parser does maybe 300–500 MB/s. The 6x difference is entirely SIMD.&lt;/p&gt;

&lt;p&gt;That's why simdjson exists. That's why it's in MongoDB, Clickhouse, and dozens of other systems that care about throughput.&lt;/p&gt;

&lt;h3&gt;
  
  
  Image Processing
&lt;/h3&gt;

&lt;p&gt;Every pixel is independent. Every channel is independent. This is SIMD's dream workload — no data dependencies, no branches, just arithmetic on contiguous arrays of bytes. SSE2 processes 16 pixels at once with saturating addition (&lt;code&gt;u8x16::saturating_add&lt;/code&gt; in portable SIMD). OpenCV, libjpeg-turbo, libpng — they all have SIMD paths for their hot loops. When Photoshop applies a filter to a 24-megapixel image in under a second, this is why.&lt;/p&gt;

&lt;h3&gt;
  
  
  ML Inference
&lt;/h3&gt;

&lt;p&gt;This is the one that matters most right now.&lt;/p&gt;

&lt;p&gt;Neural network inference is fundamentally matrix multiplication: take a weight matrix, multiply by an input vector, pass through an activation function. Repeat. The core operation — multiply-accumulate on large matrices — is exactly what SIMD was built for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AVX2's fused multiply-add&lt;/strong&gt; (&lt;code&gt;_mm256_fmadd_ps&lt;/code&gt; via &lt;code&gt;std::arch&lt;/code&gt;, or &lt;code&gt;f32x8::mul_add&lt;/code&gt; in portable SIMD) does a*b + c on 8 floats in one instruction. For a naive matrix multiply loop, this is an 8x multiplier before you've thought about anything else. Add tiling for cache efficiency and you're in the range of what high-performance BLAS libraries actually do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AVX-512 with VNNI&lt;/strong&gt; (Vector Neural Network Instructions, 2019) goes further — it adds instructions specifically for quantized integer dot products used in 8-bit inference. A single &lt;code&gt;vpdpbusd&lt;/code&gt; instruction (exposed as &lt;code&gt;_mm512_dpbusd_epi32&lt;/code&gt; in intrinsics) processes 16 multiply-accumulates in one clock. llama.cpp, the library that lets you run large language models on consumer hardware, has hand-written AVX2 and AVX-512 kernels for its matrix multiplication. When you run a local model on your laptop, those kernels are running in tight loops for every token you generate.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mindset Shift
&lt;/h2&gt;

&lt;p&gt;Here's the insight that changes how you write code even if you never touch an intrinsic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SIMD forces you to think in batches, not items.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Scalar code says: "for each element, do this." SIMD code says: "take 8 elements, do this to all of them at once, advance 8." The data structure implications are real.&lt;/p&gt;

&lt;h3&gt;
  
  
  Arrays of Structures vs Structures of Arrays
&lt;/h3&gt;

&lt;p&gt;Consider a particle system. You might model it like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;Particle&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;// position&lt;/span&gt;
    &lt;span class="n"&gt;vx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vz&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// velocity&lt;/span&gt;
    &lt;span class="n"&gt;mass&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;particles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Particle&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_capacity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is &lt;strong&gt;AoS&lt;/strong&gt; — Array of Structures. Each particle's data is packed together. Intuitive. Natural.&lt;/p&gt;

&lt;p&gt;The goal: update all x positions — &lt;code&gt;x += vx * dt&lt;/code&gt; — for every particle.&lt;/p&gt;

&lt;p&gt;The problem: &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;vx&lt;/code&gt; are separated by 24 bytes in each struct. When you load a SIMD vector of 8 &lt;code&gt;x&lt;/code&gt; values, you also pull in &lt;code&gt;y&lt;/code&gt;, &lt;code&gt;z&lt;/code&gt;, &lt;code&gt;vx&lt;/code&gt;, &lt;code&gt;vy&lt;/code&gt;, &lt;code&gt;vz&lt;/code&gt;, &lt;code&gt;mass&lt;/code&gt; — data you don't need. Your &lt;a href="https://nazquadri.dev/blog/the-layer-below/15-ram/" rel="noopener noreferrer"&gt;cache lines&lt;/a&gt; are full of noise. Your SIMD registers require a scatter-gather to populate.&lt;/p&gt;

&lt;p&gt;The SIMD-friendly layout is &lt;strong&gt;SoA&lt;/strong&gt; — Structure of Arrays:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;Particles&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With SoA, all &lt;code&gt;x&lt;/code&gt; values are contiguous. Loading &lt;code&gt;&amp;amp;particles.x[i..i+8]&lt;/code&gt; gives 8 consecutive &lt;code&gt;x&lt;/code&gt; values, ready to go. Loading &lt;code&gt;&amp;amp;particles.vx[i..i+8]&lt;/code&gt; gives the matching 8 &lt;code&gt;vx&lt;/code&gt; values. One fused multiply-add updates 8 particles. No scatter-gather. No cache waste.&lt;/p&gt;

&lt;p&gt;This is not a micro-optimization. The difference in a physics simulation inner loop can be 4–8x. The code is otherwise identical.&lt;/p&gt;

&lt;p&gt;That's why SoA and AoS matter — two data structures with identical asymptotic behavior, identical logical content, identical algorithmic logic. One is auto-vectorizable. One isn't. The difference is 8x. Nobody mentioned this in algorithms class.&lt;/p&gt;

&lt;p&gt;This also explains why entity-component systems (ECS) — used in game engines like Unity DOTS and Bevy — look structurally odd until you see SIMD. ECS stores component data in contiguous arrays per component type, not per entity. That's SoA. The performance difference for physics and animation simulations is why the pattern exists.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1q3j6oukwox4dfajwt4.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1q3j6oukwox4dfajwt4.jpeg" alt="AoS vs SoA — scattered access vs contiguous SIMD loads" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Alignment
&lt;/h3&gt;

&lt;p&gt;SIMD instructions have opinions about memory alignment. &lt;strong&gt;Aligned&lt;/strong&gt; loads — &lt;code&gt;_mm256_load_ps&lt;/code&gt; — require the address to be 32-byte aligned (the address mod 32 == 0). &lt;strong&gt;Unaligned&lt;/strong&gt; loads — &lt;code&gt;_mm256_loadu_ps&lt;/code&gt; — work on any address, but may be slower on older hardware.&lt;/p&gt;

&lt;p&gt;On modern CPUs (Intel Skylake and later, AMD Zen 2 and later), unaligned loads are as fast as aligned loads — as long as you don't cross a 64-byte cache line boundary. So alignment mostly solves itself if you enforce it on your arrays and use &lt;code&gt;_mm256_loadu_ps&lt;/code&gt; in your code.&lt;/p&gt;

&lt;p&gt;In Rust, you control alignment with &lt;code&gt;#[repr(align(32))]&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;#[repr(C,&lt;/span&gt; &lt;span class="nd"&gt;align(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="nd"&gt;))]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;AlignedBlock&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the equivalent of C's &lt;code&gt;__attribute__((aligned(32)))&lt;/code&gt; or &lt;code&gt;alignas(32)&lt;/code&gt;. It means: "I plan to load this with SIMD and I want the first element to be register-friendly."&lt;/p&gt;




&lt;h2&gt;
  
  
  You Don't Need to Write Intrinsics
&lt;/h2&gt;

&lt;p&gt;The practical message is not "go rewrite your code in intrinsics." It's shorter:&lt;/p&gt;

&lt;p&gt;Write in a way the compiler can vectorize. Keep your hot loops simple and branch-free. Lay your data out contiguously in the access order you need it. Prefer SoA over AoS in performance-critical code. Reach for libraries (numpy, simdjson, BLAS, any vectorized BLAS-backed ML framework) before reaching for intrinsics.&lt;/p&gt;

&lt;p&gt;That's why numpy is fast and a Python for-loop isn't. numpy's inner loops are SIMD-vectorized C. When you call &lt;code&gt;arr * 2&lt;/code&gt;, numpy dispatches to a vectorized multiply kernel operating on the entire array in chunks of 8 or 16 elements. Your Python for-loop multiplies one element per bytecode interpretation cycle.&lt;/p&gt;

&lt;p&gt;Understand that when two seemingly equivalent implementations have an 8x performance difference, this is frequently why. Not cache (though that's related). Not branch prediction (though that matters too). The data layout didn't allow the CPU to use seven of its eight lanes.&lt;/p&gt;

&lt;p&gt;If you do need explicit SIMD, Rust gives you options before you reach for raw intrinsics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;std::simd&lt;/code&gt;&lt;/strong&gt; — Rust's portable SIMD API (nightly, progressing toward stable). Type-safe vector types like &lt;code&gt;f32x8&lt;/code&gt; that compile to the best available instructions on any architecture. This is the future.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://crates.io/crates/wide" rel="noopener noreferrer"&gt;&lt;code&gt;wide&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — a stable crate providing portable SIMD types today. Good for production code that can't wait for &lt;code&gt;std::simd&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://crates.io/crates/pulp" rel="noopener noreferrer"&gt;&lt;code&gt;pulp&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — runtime CPU feature detection with safe SIMD dispatch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For C++ codebases, &lt;a href="https://github.com/google/highway" rel="noopener noreferrer"&gt;highway&lt;/a&gt; (Google's portable SIMD abstraction) serves a similar role. Don't write raw &lt;code&gt;_mm256_*&lt;/code&gt; calls unless you've exhausted the higher-level options — though in Rust, at least the type system will catch width mismatches at compile time instead of letting you discover them at midnight.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the CPU Looks Like Now
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;One instruction:
  ADD rax, rbx
  → adds two 64-bit integers
  → uses 64 bits of register space

One SIMD instruction:
  VADDPS ymm0, ymm1, ymm2
  → adds eight 32-bit floats
  → uses 256 bits of register space
  → eight physical adders firing simultaneously

Your loop over 8 million floats:
  Scalar:  8,000,000 add instructions
  AVX2:    1,000,000 add instructions (8x fewer)
  AVX-512: 500,000 add instructions (16x fewer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The lanes are there. They've been there since 1999, getting wider every few years. Every calculation you've ever run in a Python loop touched one lane of a machine that had eight available.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html" rel="noopener noreferrer"&gt;Intel Intrinsics Guide&lt;/a&gt;&lt;/strong&gt; — The reference. Every intrinsic, its instruction, latency, and throughput. Searchable by operation type. Directly maps to Rust's &lt;code&gt;std::arch&lt;/code&gt; function names.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/rust-lang/rust/issues/86656" rel="noopener noreferrer"&gt;Rust &lt;code&gt;std::simd&lt;/code&gt; tracking issue&lt;/a&gt;&lt;/strong&gt; — The portable SIMD API's path to stabilization. Good overview of the design and current status.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://doc.rust-lang.org/std/arch/index.html" rel="noopener noreferrer"&gt;&lt;code&gt;std::arch&lt;/code&gt; module docs&lt;/a&gt;&lt;/strong&gt; — Rust's platform intrinsics. Every &lt;code&gt;_mm256_*&lt;/code&gt; function from the Intel guide has a corresponding Rust binding here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/BurntSushi/memchr" rel="noopener noreferrer"&gt;memchr crate (Rust)&lt;/a&gt;&lt;/strong&gt; — Andrew Gallant's SIMD-accelerated byte/substring search. Read the source and the README for a clear explanation of the Teddy algorithm.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://crates.io/crates/wide" rel="noopener noreferrer"&gt;&lt;code&gt;wide&lt;/code&gt; crate&lt;/a&gt;&lt;/strong&gt; — Portable SIMD types on stable Rust. A practical alternative while &lt;code&gt;std::simd&lt;/code&gt; stabilizes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/abs/1902.08318" rel="noopener noreferrer"&gt;simdjson paper&lt;/a&gt;&lt;/strong&gt; — Lemire et al., 2019. "Parsing Gigabytes of JSON per Second." The original paper. Section 3 explains the SIMD classification step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.akkadia.org/drepper/cpumemory.pdf" rel="noopener noreferrer"&gt;"What Every Programmer Should Know About Memory" — Ulrich Drepper&lt;/a&gt;&lt;/strong&gt; — Section 6 covers SIMD and its interaction with the cache hierarchy. This was the reference when AVX didn't exist yet; the principles are unchanged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.agner.org/optimize/" rel="noopener noreferrer"&gt;Agner Fog's optimization manuals&lt;/a&gt;&lt;/strong&gt; — Table of instruction latencies and throughputs for every SIMD instruction on every microarchitecture. Dense. Invaluable if you're actually tuning.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm writing a book about what makes developers irreplaceable in the age of AI. &lt;a href="https://nazquadri.dev/book" rel="noopener noreferrer"&gt;Join the early access list →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Naz Quadri once hand-wrote AVX2 intrinsics for a function the Rust compiler had already vectorised better. He blogs at &lt;a href="https://nazquadri.dev" rel="noopener noreferrer"&gt;nazquadri.dev&lt;/a&gt;. Rabbit holes all the way down 🐇🕳️.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
