Forem: Dayul Lee

Network Part 5 - CDN, WebSocket, and Idempotency: When the Parts Meet Traffic

Dayul Lee — Wed, 06 May 2026 08:55:05 +0000

Published: May 6, 2026

In Part 2, the TCP handshake turned out to be a negotiation fee charged on every connection.
In Part 3, HTTP evolved three times to escape the constraints of that fee.
In Part 4, load balancers split traffic based on how much they were willing to know.

But real services don't experience these problems one at a time. Loading a product page hits RTT, multiplexing, and CDN placement all at once. Processing a single payment triggers TCP reliability, application-layer retries, and idempotency in the same breath. The parts have been laid out. Now they meet traffic.

This post covers how concepts from Network Parts 1–4 combine in practice. Reading each concept in the part where it first appeared will make the trade-offs in every scenario click faster.

Parts 1 through 4 examined each component in isolation — one layer, one protocol, one routing decision at a time. That was necessary. You can't diagnose a bottleneck if you don't know where to look.

But isolation is not how systems operate. A single user action — loading a page, sending a message, completing a payment — passes through multiple layers simultaneously, and the bottleneck can form at any one of them. The question is no longer "what does this component do?" It's "which component is the constraint right now, and what combination resolves it?"

Four scenarios. Four different bottlenecks. Four different answers — all assembled from parts we've already seen.

Image-Heavy Pages — When Two Constraints Hit at Once

Picture an e-commerce product page. A single scroll loads 80 high-resolution images. Each image is a separate HTTP request. Each request pays the cost of the TCP handshake. The bottleneck here forms across two layers: the Network Layer / L3 (the routing path that determines RTT) and the Application Layer / L7 (the protocol that determines how requests are delivered).

The first bottleneck is distance — an L3 constraint.

Seoul to a US server costs roughly 150ms per round trip. That's RTT — a fixed cost locked to physics, determined by the routing path at L3. 80 images × 150ms = 12 seconds of pure network delay, even before the server starts processing. The bottleneck isn't computation. It's geography.

A CDN (Content Delivery Network) resolves this by caching static content on edge servers physically close to users. Seoul users hit a Seoul edge server. São Paulo users hit a São Paulo edge server. RTT gets reduced — not by making the connection faster, but by shortening the L3 path itself.

In the language of the Theory of Constraints (TOC): when RTT is the constraint, the strategy isn't to optimize what happens after the packet arrives. It's to move the server closer to the user so the packet has less distance to travel.

The second bottleneck is sequential delivery — an L7 constraint.

Even with a nearby CDN, 80 images over HTTP/1.1 means Head-of-Line Blocking. One slow image stalls everything behind it.

HTTP/2 multiplexing breaks those 80 requests into interleaved frames over a single connection. Small thumbnails slip through between chunks of a large hero image. The connection stays alive with Keep-Alive, and the queue disappears with multiplexing. The fix happens entirely at L7 — nothing below it changes.

[Without CDN + HTTP/1.1]

Client
   ├─ Request image1 → (150ms) → Response
   ├─ Request image2 → (150ms) → Response
   ├─ Request image3 → (150ms) → Response
   │
   ├─ ...
   │
   └─ Request image80 → (150ms) → Response

80 images × 150ms RTT × sequential 
= 12,000ms (12s)
= painfully slow

------------------------------------------

[With CDN + HTTP/2]

Client
   ├─────── Single Connection ──────┐
   │                                │
   │   ┌  image1  ┐  ┌  image2  ┐   │
   │   ├──────────┤  ├──────────┤   │
   │   ├  image3  ┤  ├  image4  ┤   │
   │   ├──────────┤  ├──────────┤   │
   │   ├   ...    ┤  ├   ...    ┤   │
   │   └  image80 ┘  └──────────┘   │
   │                                │
   │ (sent & received concurrently) │
   └────────────────────────────────┘

80 images × ~5ms RTT  × multiplexed 
≈ tens of ms
= fast

If static content dominates and users are geographically distributed, CDN + HTTP/2 is the first combination to consider.
If content is dynamic and users are concentrated in a single region, CDN adds little.

Theory of Constraints (TOC), applied: when two constraints sit on different layers, the fix must address both. Solving one while ignoring the other moves the bottleneck — it doesn't remove it.

TCP handshake cost, RTT, Keep-Alive → Network Part 2
Head-of-Line Blocking, multiplexing → Network Part 3
Theory of Constraints (TOC) → Network Part 1

Real-Time Messaging — Where the Connection Cost Moves

A chat application needs messages to arrive instantly. A notification system needs to push updates without the client asking. Both share the same structural problem: the server needs to talk first.

HTTP was designed the other way around. The client asks, the server answers. If the server has something new, it has no way to say so — it has to wait for the next question. This is an L7 constraint — the request-response model doesn't support server-initiated communication. The three approaches below each work around this at L7, but their costs cascade down to the Transport Layer / L4, where connection count and port limits live.

Three approaches exist. Each pays the connection cost in a different place.

Protocol	Connection Cost	Server Resources	Direction
Long Polling	Per message	Low (short-lived)	Client → Server
SSE	Per session	Medium (one-way)	Server → Client
WebSocket	Per session	High (persistent)	Both

Where does the difference come from? One at a time.

Long Polling — One response, one reconnection, a new contract every time.

The client sends a request and the server holds it open — not responding until there's something new to say. When a response finally arrives, the client immediately sends another request. The connection is alive, but it's rebuilt every round.

[Long Polling — Reconnect Loop]

Client                            Server
  │                                 │
  │ ───── Request ────────────────> │
  │                                 │
  │        (waiting...)             │
  │        (holding...)             │
  │ <──── Response ───────────────  │  "New message"
  │                                 │
  │ ───── Request ────────────────> │  ← immediately reconnect
  │                                 │
  │        (waiting...)             │
  │ <──── Response ───────────────  │  "Another message"
  │                                 │
  │ ───── Request ────────────────> │
  │        (repeat forever)         │

From Transaction Cost Theory: Long Polling is the TCP handshake problem in disguise. Every response-request cycle is a new negotiation. The contract doesn't carry over. The most effective way to reduce transaction costs is to reduce the number of transactions — Long Polling does the opposite. It multiplies them.

The overhead is real. Each reconnection carries HTTP headers, potentially a new TCP handshake (if Keep-Alive expires), and a fresh slot in the server's connection pool. At scale — tens of thousands of users waiting for messages — the negotiation fee alone saturates the server.

SSE (Server-Sent Events) — One channel, one direction, held open indefinitely.

SSE holds a single HTTP connection open indefinitely. The server pushes data down whenever it has something new. The client never reconnects — it just listens.

[SSE — One-Way Stream]

Client                             Server
  │                                 │
  │ ───── Request (subscribe) ────> │
  │                                 │
  │                                 │
  │ <──── Event Stream ───────────  │  "Message 1"
  │ <──── Event Stream ───────────  │  "Message 2"
  │ <──── Event Stream ───────────  │  "Update"
  │ <──── Event Stream ───────────  │  "Notification"
  │                                 │
  │        (connection stays open)  │

This is Keep-Alive logic applied to real-time delivery. One handshake, many messages. The negotiation fee is paid once and amortized across every subsequent event.

The trade-off: SSE is one-directional. The server talks, the client listens. For a notification feed — stock price alerts, live scores, deployment status updates — that's exactly right. For a chat application where the client also needs to send messages back through the same channel, SSE falls short. The client would need a separate HTTP request for every outbound message, reintroducing the per-message cost that SSE was designed to avoid.

WebSocket — One connection, both directions, permanently.

WebSocket starts as an HTTP request, then upgrades the connection to a persistent, full-duplex channel. Both sides can send data at any time. No re-negotiation. No new connections. The contract is signed once.

[WebSocket — Full Duplex Communication]

Client                            Server
  │                                │
  │ ─── HTTP Upgrade ───────────>  │
  │ <── 101 Switching ───────────  │
  │                                │
  │════════════════════════════════│
  │    Persistent Bidirectional    │
  │════════════════════════════════│
  │                                │
  │ ───────── "Hey" ────────────>  │
  │ <──────── "Hi" ──────────────  │
  │ ─────── "Got it" ───────────>  │
  │ <──────── "News" ────────────  │
  │            ...                 │

Transaction cost: near zero per message. The entire negotiation overhead is front-loaded into a single upgrade handshake.

But the cost doesn't vanish — it moves. Each WebSocket connection holds a persistent TCP socket open on the server. Roughly 28,000 usable ports per server. TIME_WAIT doesn't apply to connections that never close, but the port stays occupied for as long as the connection lives. A chat service with 50,000 concurrent users needs 50,000 open sockets — permanently.

The connection cost went from per-message (Long Polling) to per-session (WebSocket). Cheaper per interaction, but the resource commitment is continuous.

The question remains the same: what's more expensive — renegotiating constantly, or holding the line open?

If the bottleneck is message frequency (thousands of messages per second per user), WebSocket wins — the per-message cost of Long Polling would be devastating.
If the bottleneck is connection count (millions of users, infrequent updates), SSE or even Long Polling may be more efficient — they release resources between interactions.

Theory of Constraints (TOC), applied: find which resource saturates first — message throughput or connection count — and choose accordingly.

Transaction Cost Theory, TCP handshake, Keep-Alive, TIME_WAIT → Network Part 2
L4 port limit (28,000) → Network Part 1
Theory of Constraints (TOC) → Network Part 1

Global Routing — Closing the Information Gap

A service with users in Seoul, London, and São Paulo runs all its servers in us-east-1. Seoul to Virginia is roughly 150ms RTT. London is around 80ms. São Paulo, roughly 180ms.

Every request from every user pays that cost. Not once — on every interaction. The server could respond in 5ms, but the user waits 150ms before the response even starts its return trip. The bottleneck isn't the server. It's the L3 path between the server and the user.

DNS — operating at L7 — is where the routing decision is made. DNS round-robin can't solve the distance problem. It rotates IPs without knowing where the client is. A Seoul user might get routed to Virginia while a server in Tokyo sits idle. This is Information Asymmetry — the DNS server lacks the information the routing decision requires.

GeoDNS closes that information gap.

When a DNS query arrives, GeoDNS reads the client's IP address at L7, infers their geographic location, and returns the IP of the nearest server. This decision determines the L3 routing path — L7's information changes L3's cost. Seoul users get the Tokyo server. London users get the Frankfurt server. São Paulo users get the São Paulo server.

[Traditional DNS Round Robin]

              ┌────────────┐
              │    DNS     │
              │ (No logic) │
              └─────┬──────┘
                    │
      ┌─────────────┼─────────────┐
      │             │             │
 Seoul User    London User     SP User
      │             │             │
      └──────┬──────┴──────┬──────┘
             ▼             ▼
       ┌────────────┬────────────┐
       │            │            │
   192.168.1.1  192.168.1.2  192.168.1.3
   (Virginia)   (Virginia)   (Virginia)
       │            │            │
     150ms         80ms        180ms

-----------------------------------------------

[GeoDNS (Location-Aware Routing)]

              ┌─────────────┐
              │     DNS     │
              │ (Geo Logic) │
              └──────┬──────┘
                     │
       ┌─────────────┼─────────────┐
       │             │             │
  Seoul User    London User     SP User
       │             │             │
       ▼             ▼             ▼
     (Asia)       (Europe)   (South America)
       │             │             │
 ┌──────────┐  ┌───────────┐  ┌───────────┐
 │   Tokyo  │  │ Frankfurt │  │ São Paulo │
 │ 10.0.1.1 │  │  10.0.2.1 │  │ 10.0.3.1  │
 └──────────┘  └───────────┘  └───────────┘
       │             │             │
     30ms          15ms           10ms

Where DNS round-robin was blind rotation, GeoDNS is informed routing. The information that was missing — the client's location — is now part of the decision. How you handle the information gap determines the outcome. GeoDNS handles it by acquiring the one piece of information that matters most: where the user is.

Edge servers extend this logic beyond static content. A CDN caches images and files. An edge server can run L7 computation — authentication checks, personalization logic, API responses — close to the user. Where CDN shortened L3's distance, edge servers move L7's processing itself toward the user.

If users span two or more continents, GeoDNS + edge servers is the only way to structurally reduce RTT.
If traffic fits comfortably in a single region, it only adds operational complexity.

Information Asymmetry, applied: the routing decision is only as good as the information it has. Close the gap, and the cost drops. Ignore it, and geography wins by default.

RTT → Network Part 2
DNS round-robin, Information Asymmetry → Network Part 4

Payment Retries — Where TCP's Trust Ends

A user clicks "Pay." The request reaches the server. The server charges the card. Then the response is lost — a network timeout somewhere between the server and the client.

The client sees: "Request failed." The user clicks "Pay" again. A second request arrives at the server. Without protection, the card is charged twice.

TCP's guarantee, revisited. TCP purchases reliability at the cost of speed. The handshake ensures packets arrive, in order, without loss. But TCP's guarantee operates at the transport layer. It promises that bytes will be delivered. It says nothing about what happens after the application processes those bytes.

The timeout above isn't a TCP failure. TCP delivered the request successfully. The server processed it. The response was lost on its way back. L4's TCP did its job. L7 was left unprotected.

Client                        Server
|                              |
|  ——— "Charge $50" ————————>  |  ✓ TCP delivered
|                              |  ✓ Server charged the card
|  <—— Response ———————— ✕     |  ✗ Response lost in transit
|                              |
|  (timeout — user retries)    |
|                              |
|  ——— "Charge $50" ————————>  |  ✓ TCP delivered again
|                              |  ✗ Server charges the card AGAIN

An idempotency key solves this at L7. The client generates a unique key for each intended action and attaches it to the request. If the same key arrives twice, the server recognizes it as a retry and returns the original result without re-executing.

Client                        Server
|                              |
|  ——— "Charge $50"            |
|      key: abc-123 ————————>  |  ✓ First time seeing abc-123
|                              |  ✓ Charges the card, stores result
|  <—— Response ———————— ✕     |  ✗ Response lost
|                              |
|  ——— "Charge $50"            |
|      key: abc-123 ————————>  |  → abc-123 already processed
|                              |  → Returns stored result, no re-charge
|  <—— Response ————————————   |  ✓ Client receives confirmation

The key distinction: TCP guarantees at-least-once delivery. The application needs exactly-once execution. Idempotency keys bridge that gap.

This is where network-layer concepts meet application-layer design. The trust that the TCP handshake purchases extends only to the L4 boundary. Beyond that, reliability is L7's responsibility.

Any write operation where network retries can occur — payments, orders, reservations — needs an idempotency key. Read-only APIs do not.

Transaction Cost Theory, applied: TCP's contract covers L4. Guaranteeing execution at L7 requires a separate contract — and the idempotency key is that contract's cost.

TCP handshake, TCP's trust cost → Network Part 2

Brave New Geek: You Cannot Have Exactly-Once Delivery

The Bottom Line

Four scenarios. Four bottlenecks. Four different combinations of the same building blocks.

Image loading paid the cost of distance and the cost of queuing — at two different layers, simultaneously. Real-time messaging moved the negotiation fee from per-message to per-session, trading frequency for commitment. Global routing closed the information gap that DNS couldn't see. Payment retries revealed the boundary where TCP's trust expires and the application must build its own.

Every scenario asked the same question this series has been asking from the start: where is the bottleneck, and what are you willing to trade to clear it?

The answer was never the same twice.

There is no universal architecture. There is only the architecture that matches the constraint you're facing right now.

Next up: the network delivered the request. Now the server has to process it — and the first thing it touches is the database. The idempotency key from payment retries — the guarantee that the same request executes only once — cannot be completed without a database transaction. Every query, every write, every transaction ultimately hits one physical constraint: disk I/O. That's where the Database series begins.

Network Part 4 - The Load Balancer as a Traffic Information Decision

Dayul Lee — Fri, 01 May 2026 09:04:36 +0000

Published: April 29, 2026

October 4, 2021. Facebook, Instagram, and WhatsApp went completely dark for roughly six hours — all at once. The servers were fine. No bad deploy. A single command run during routine maintenance withdrew every one of Facebook's BGP routes. The internet forgot how to reach Facebook's data centers. Traffic had nowhere to go. Facebook ceased to exist on the internet.

The servers were running. The load balancers were healthy. Everything was fine. Requests just couldn't get in. That's what happens when traffic distribution breaks at the routing layer. No matter how well-built the system behind the load balancer is — if requests can't reach it, none of it matters.

Not all load balancers work the same way. Some look only at the outside of a packet and route it fast. Others open the packet, read what's inside, and decide based on the contents. In Part 1, the trade-off was clear: L4 is fast because it stays ignorant, L7 is precise because it pays to know. Load balancers face the same choice. Which layer do you split traffic at?

DNS Round Robin — Blind by Design

The most primitive form of load balancing starts at DNS. Register multiple server IPs under one domain, and hand out a different IP in rotation for each incoming request. That's DNS round-robin.

ref. Cloudflare Learning: What is round-robin DNS?
ref. Cloudflare Learning: What is DNS load balancing?

                ┌───────────────┐
                │     Client    │
                └───────────────┘
                        ↓
              "What's example.com?"
                        ↓
    ┌──────────────────────────────────────┐
    │               DNS Server             │
    │  (Returns a different IP each time)  │
    └──────────────────────────────────────┘
        ┌───────────────┼───────────────┐
  [1st request]   [2nd request]    [3rd request]
       ↙                ↓                ↘
 ┌────────────┐   ┌────────────┐   ┌────────────┐
 │  Server A  │   │ Server B   │   │  Server C  │
 │192.168.0.1 │   │192.168.0.2 │   │192.168.0.3 │
 └────────────┘   └────────────┘   └────────────┘

[Structural limits]

✗ Blind to server state
  → DNS keeps returning Server A even when it's overloaded
✗ Can't detect failures
  → DNS keeps responding with Server B's IP even after it goes down
✗ TTL caching
  → once a client receives an IP, it keeps hitting that server until TTL expires

DNS round-robin looks balanced in theory.

Imagine a theme park with three parking lots — A, B, and C. The navigation app at the entrance sends cars in rotation: first to A, next to B, then C. Arithmetically balanced.

But the navigation app doesn't check the server every time. It trusts the answer it got for a fixed window. "This information is valid for 10 minutes" — timer starts, server goes unchecked. That's TTL (Time-To-Live): an expiration date on information.

This is where the breakdown happens. Picture a convoy of tourist buses arriving back-to-back. The first bus gets "go to Lot A." Every bus behind it copies that answer without checking — their devices already have it cached. The server is ready to send the next convoy to Lots B and C. But the buses aren't asking anymore. Lot A is jammed. Lots B and C sit empty.

Economist George Akerlof described this structure in his 1970 paper "The Market for Lemons" as Information Asymmetry. In the used car market, sellers know the defects; buyers don't. That gap alone distorts the entire market.

ref. Information Asymmetry

DNS round-robin works the same way. The DNS server knows Server A is overloaded. The client won't find out until TTL expires. The distribution gets skewed — not because caching is broken, but because of a structural disconnect between the party that has the information and the party that needs it.

DNS round-robin looks like load balancing. In practice, it's blind rotation.

L4 Load Balancer — Fast by Choice

The L4 load balancer follows the same philosophy introduced in Part 1. It doesn't open the packet. It reads only the destination address (IP) and port number on the envelope, and decides where to send it from there.

[Transport Layer]

            ┌───────────────────┐
            │   Client Request  │
            └───────────────────┘      
                      ↓
      ┌─────────────────────────────────┐
      │         L4 Load Balancer        │
      │                                 │
      │         ✓ IP address            │
      │       ✓ Check Port number       │
      │        ✗ Packet content         │
      └─────────────────────────────────┘    
        ↙             ↓             ↘
  ┌───────────┐ ┌───────────┐ ┌───────────┐
  │ Server A  │ │ Server B  │ │ Server C  │
  └───────────┘ └───────────┘ └───────────┘      
  ( Based on IP hash or least connections )

No content inspection means fast decisions. Millions of concurrent connections, handled. It fits environments where large numbers of clients open simple TCP connections simultaneously — game servers, for example.

L4 is a strategy that accepts the information gap. It makes routing decisions without knowing what's inside the packet. Where DNS round-robin failed because it lacked information, L4 turns that same ignorance into a deliberate choice. DNS misdistributes because it doesn't know. L4 trades knowing for speed.

The limits follow from that choice. No content visibility means no URL-based routing. You can't send /api/payments to the payments cluster and /api/products to the product cluster. You can't read cookies. Session persistence isn't possible.

L7 Load Balancer — Informed by Design

The L7 load balancer reads the packet. HTTP headers, URL paths, cookies, request body. It opens the envelope, reads the letter, and routes it to whoever handles that specific content.

[Application Layer]

              ┌───────────────────┐
              │   Client Request  │
              └───────────────────┘
                        ↓
            ┌────────────────────────┐
            │    L7 Load Balancer    │
            │                        │
            │  ✓ IP address / port   │
            │  ✓ HTTP method / URL   │
            │  ✓ Host header         │
            │  ✓ Cookies / body      │
            └────────────────────────┘
          ↙             ↓             ↘
    ┌───────────┐  ┌──────────┐   ┌──────────┐
    │  Payment  │  │  Product │   │   User   │
    │  Server   │  │  Server  │   │  Server  │
    └───────────┘  └──────────┘   └──────────┘
               Routing based on URL path

Reading the URL means /api/payments goes to the payments server and /api/products goes to the product server. Reading cookies enables session persistence — if a user's cart data lives only on Server A, L7 reads the user ID from the cookie and keeps sending that user back to Server A.

L7 is a strategy that pays to close the information gap. Transaction Cost Theory from Part 2 applies here too. Acquiring information has a cost. Parsing headers, inspecting URLs, reading cookies — all of it is the price of knowing. In exchange for paying that cost, L7 can make decisions L4 simply cannot.

The trade-off is structural. Every request gets parsed and interpreted. That overhead is categorically higher than L4. As traffic scales, the cost compounds.

L4 vs L7 — Where the Bottleneck Is

	L4 Load Balancer	L7 Load Balancer
Sees	IP address, port number	HTTP headers, URL, cookies, request body
Speed	Fast	Slower
Routes by	Connection count, IP hash	URL path, cookies, headers
Can do	Simple TCP distribution	Content-based routing A/B testing Session persistence
Common use	Game servers High-volume streaming	API Gateway Microservices

Goldratt's Theory of Constraints from Network Part 1 applies directly here. The constraint is never fixed — it's wherever the system is closest to 100% saturation. The question of which OSI layer is the bottleneck becomes the question of which load balancer to use.

Concurrent connections approaching the limit: L4. Requests that need to be routed based on their content: L7. In practice, many production systems layer both — L4 receives traffic first and distributes it across server groups, then L7 handles fine-grained routing within each group.

ref. HAProxy Blog: Layer 4 and Layer 7 Proxy Mode

Own the Information, or Let It Go

Three systems. Same problem. Three different answers.

DNS round-robin failed because it couldn't see server state. It rotated blind. The information gap distorted the distribution.

L4 chose to give up information. It makes decisions without knowing the contents — and converts that ignorance directly into speed. The gap becomes an asset.

L7 chose to buy information. It pays in parsing time and gets precision in return. The gap gets closed at a cost.

What Akerlof showed wasn't that information gaps are inherently bad — it's that how you handle the gap is what determines the outcome. Used car markets that ignored the gap collapsed. Markets that bridged it with warranties survived.

Load balancers work the same way. Ignore the gap and you get DNS. Accept it and you get L4. Close it and you get L7.

The question isn't whether the information gap exists. It's what you do with it.

This is the same structure that's run through every part of this series. Goldratt asked where the constraint is. Coase and Williamson explained the conditions under which paying transaction costs makes sense. Akerlof showed how information gaps split behavior. Different names across four parts — but the same question underneath.

Where is the bottleneck right now, and what are you willing to give up to clear it?

The Bottom Line

DNS round-robin assigns slots without knowing server state. Information asymmetry distorts the distribution. L4 gives up information and gets speed. L7 acquires information and gets precision. Each approach makes a different call on where to absorb the cost.

Which layer you split traffic at isn't a technical preference — it's a trade-off decision. The same question this series has been asking from Part 1. Where is the constraint, and what do you give up to resolve it?

Know where the bottleneck is, and you'll know where to split. Know how to handle the information gap, and you'll know how to split.

Next up: everything covered so far — OSI layers, TCP handshake costs, HTTP evolution, load balancing — comes together in real systems. Three scenarios: an e-commerce platform, a live chat service, and a payment system. Where does the bottleneck form, and which choices resolve it?

Network Part 3 - The Evolution of HTTP and the Cost of Every Trade-off

Dayul Lee — Fri, 01 May 2026 08:57:17 +0000

Published: April 25, 2026

When the past answer becomes the present problem, we call it Path Dependency. TCP was designed in 1981. It solved the right problems for its time. By the time HTTP was carrying the modern web, that 40-year-old foundation was starting to show its age.

Keep-Alive solved the contract problem. One connection, many requests. Cheaper by design. But the queue was still single-file. Fix the engine, and suddenly the road is the problem. That road was TCP.

HTTP/1.1 — One Lane, No Exceptions

HTTP/1.1 had one rigid rule: one connection handles one request at a time, in order.

Keep-Alive meant you didn't have to renegotiate a new contract for every exchange. But the delivery itself was still sequential. One large image delayed at the front of the queue, and every lightweight text file behind it had to wait. The connection was alive — it just couldn't move two things at once. This is Head-of-Line Blocking (HOLB).

Browsers tried to route around it by opening up to six parallel connections per server. But that just brought back the port exhaustion and handshake overhead from Part 2. The problem wasn't solved. It was transferred — into a different form of cost.

HTTP/2 — Same Road, Different Lane

Released in 2015, HTTP/2 attacked HOL Blocking at the application layer with multiplexing.

Instead of sending whole files in sequence, HTTP/2 breaks requests and responses into small frames and interleaves them over a single connection. Small payloads can slip through between chunks of a large one. On paper, it looked like true parallel processing.

But TCP was still underneath. And TCP is obsessed with order.

If a single packet is lost in transit, TCP halts everything — including frames from entirely unrelated requests — until that packet is retransmitted and order is restored. HTTP/2 had widened the lanes at the application layer. The transport layer's old rules froze them anyway.

This is Path Dependency in action. The choice to build HTTP on top of TCP meant every improvement had to work within TCP's constraints. HTTP/2 was a sustaining innovation — the best possible improvement within the existing system. But it couldn't break the structural ceiling, because the ceiling was TCP itself.

ref. web.dev: HTTP/2

HTTP/3 — Cutting the Path

Google made the call. Ditching TCP entirely.

But this is where a common misconception takes hold. Dropping TCP didn't mean dropping reliability. It meant replacing everything TCP did with something that did it better.

The new foundation is UDP. Unlike TCP, UDP makes no guarantees — no ordering, no retransmission, no reliability. It just fires packets and moves on. Google built QUIC directly on top of that bare foundation — taking on everything TCP used to handle (retransmission, encryption, connection management), but doing it per stream. One stream stalls, the rest keep moving. HOL Blocking, gone at the root.

┌─────────────────┐        ┌─────────────────┐
│      Before     │        │      After      │
├─────────────────┤        ├─────────────────┤
│   HTTP/1.1·2    │        │     HTTP/3      │
│        ↕        │    →   │        ↕        │
│       TCP       │        │      QUIC       │
│        ↕        │        │        ↕        │
│       IP        │        │       UDP       │
└─────────────────┘        └─────────────────┘

This is why HTTP/3 is called "UDP-based." QUIC sits on top of UDP, and HTTP/3 runs on top of QUIC. TCP wasn't abandoned — its role was replaced.

Why YouTube and Zoom were already on UDP

Real-time streaming services made this call long ago. When a packet drops during a video call, waiting for TCP to retransmit it freezes the screen. It's better to skip that moment and move to the next frame. When continuity matters more than completeness, TCP's reliability becomes a liability.

QUIC brought that same instinct to general web traffic. A lost packet stalls only the stream it belongs to. Everything else keeps moving.

0-RTT — Cutting the cost of security too

The 1.5 RTT cost from Part 2 was just the TCP handshake. Add HTTPS, and TLS negotiation stacks on top. One connection, up to 3 RTT before a single byte of data moves.

QUIC has TLS 1.3 built in — connection and security negotiation happen simultaneously. Return visits skip the handshake entirely. The 1.5 RTT from Part 2 collapses to zero.

┌─────────────────────┬────────────────────┐
│     TCP + TLS       │        QUIC        │
├─────────────────────┼────────────────────┤
│    TCP   1.5 RTT    │ First visit 1 RTT  │
│    TLS   1.5 RTT    │ Return visit 0RTT  │
│    ──────────────   │                    │
│    Total  3 RTT     │                    │
└─────────────────────┴────────────────────┘

The trade-off is real. By abandoning TCP, the application now owns packet loss handling, connection reliability, and security. Simpler to use. Far more complex underneath.

This is what the Innovator's Dilemma calls disruptive innovation. Not an improvement on what existed — a replacement of it.

ref. Chromium: QUIC
ref. RFC 9000: QUIC

How the Three Versions Compare

HTTP/1.1  — Requests must wait in line
time →  t1     t2     t3     t4     t5
R1     [====] [====]
R2                   [====] [====]
R3                                 [====]

R2 can't start until R1 finishes. R3 can't start until R2 finishes.
──────────────────────────────────────────
HTTP/2  — One lost packet freezes everything
time →  t1     t2     t3     t4     t5
R1      [====] [====] [====]
R2      [====] ✕ ← packet lost
R3      [====]        ← waiting...  ← waiting...

When ✕ occurs, R1, R2, and R3 all freeze.
──────────────────────────────────────────
HTTP/3  — Only the affected stream pauses
time →  t1     t2     t3     t4     t5
R1      [====] [====] [====] [====]
R2      [====] ✕ ← packet lost        [retransmit]
R3      [====] [====] [====] [====] ← keeps moving

When ✕ occurs, only R2 pauses. R1 and R3 continue.

Each version made a different call on where to absorb the cost.

Path Dependency and the Innovator's Dilemma

The evolution of HTTP is a case study in two management concepts that explain why the right answer takes so long to arrive.

Path Dependency. Every router and firewall on the internet was optimized for TCP. Even when better alternatives existed, leaving TCP behind wasn't just a technical decision — it was an infrastructure negotiation with the entire global network. A choice made in 1981 constrained technical decisions well into the 2020s.

The Innovator's Dilemma. HTTP/1.1 to HTTP/2 was sustaining innovation — improving performance while staying within the existing system. Safe, but structurally limited. HTTP/3 was disruptive innovation — abandoning the standard entirely and rebuilding on a new foundation. Risky, but the only way to break the ceiling.

Where the two concepts meet is HTTP/3 itself. Path Dependency explains why it took 40 years. The Innovator's Dilemma explains why it had to happen at all.

ref. Clayton Christensen: The Innovator's Dilemma

The Bottom Line

HTTP/1.1 trimmed the negotiation fee. HTTP/2 increased the transaction density. HTTP/3 rejected the legacy constraints entirely to win back speed. The evolution of protocols isn't a search for the right answer. It's a history of choosing the best trade-off for the era.

Improve along the path, or change the path itself. That question doesn't only apply to protocols.

Next up: even with HTTP/3 handling requests efficiently, traffic still has to land somewhere. When tens of thousands of requests arrive at once, something has to decide where they go. That's the load balancer — and the choice between L4 and L7 turns out to be another trade-off worth understanding.

Network Part 2 - The Cost of a TCP Handshake

Dayul Lee — Fri, 01 May 2026 08:50:01 +0000

Published: April 13, 2026

Picture a ticket drop. Tens of thousands of people clicking at the same moment. The server collapses — before a single byte of real data has moved. That's the strange part. Nothing was actually exchanged yet. So what wore the server out?

There's work that has to happen before the data can flow. Until that work is done, the transaction hasn't started. It can't.

Every transaction has a setup cost. Verifying the other party. Aligning on terms. Confirming readiness on both sides. The more a transaction depends on trust, the more that setup costs.

Networks are no different. Before data can move reliably, both sides have to establish a shared understanding — that packets will arrive, that order will be preserved, that nothing will go missing without a response. None of that is free. It takes time. And that time adds up faster than most people expect.

TCP's Call — Contract Before Data

TCP made a deliberate choice: no data moves until a contract is in place.

Both sides exchange signals to confirm they're ready. Until that exchange completes, not a single byte of actual payload is transmitted. The entire window is spent on process. On paperwork.

That's the handshake. Trust purchased at the cost of speed.

The deeper problem is that this contract doesn't carry over. Every new connection starts from scratch. One user, one handshake — manageable. Ten thousand users hitting the server simultaneously — the setup cost alone is enough to bring it down, before the real work has even begun.

Client                            Server
  |                                 |
  |  ———————————— SYN ——————————>   |  "Can we connect?"
  |                                 |
  |  <————————— SYN-ACK —————————   |  "Yes. Are you ready?"
  |                                 |
  |  ———————————— ACK ——————————>   |  "Ready. Let's go."
  |                                 |
  |       [ Data transfer ]         |

Three signals. Only then does data flow. SYN and SYN-ACK are the negotiation. ACK is the signature. The actual transaction — the data — doesn't start until all three are done.

The Bill Comes Twice

Opening the connection — RTT

The time it takes to complete the handshake is tied to physical distance. Seoul to a US server: roughly 150ms per round trip. That's RTT — Round Trip Time.

TCP needs at least 1.5 round trips before the server receives the first byte of data. From the moment a user clicks to the moment the server registers what was sent: 225ms (150ms × 1.5) is already gone.

One request, 225ms. Ten thousand concurrent users, ten thousand instances of that cost. No amount of server-side optimization touches it. RTT is a fixed cost — locked to physics, not infrastructure.

Closing the connection — TIME_WAIT

The bill doesn't stop when the connection ends. TCP holds a closed port in TIME_WAIT for up to two minutes. The reason is defensive: late-arriving packets from the old connection shouldn't collide with a new one using the same port.

From Part 1: roughly 28,000 ports are available. A server handling 500 connections per second will accumulate 60,000 TIME_WAIT ports (500/s × 120s) before the two-minute window clears. That's more than double the limit. New connections stop being possible.

After connection closes:

Port 5001  [TIME_WAIT ——————————— 2 min ———————————]
Port 5002  [TIME_WAIT ——————————— 2 min ———————————]
Port 5003  [TIME_WAIT ——————————— 2 min ———————————]
  ...
Port 5028  [TIME_WAIT ——————————— 2 min ———————————]

→ 28,000 ports exhausted. No new connections accepted.

RTT is the cost of opening. TIME_WAIT is the cost of closing. TCP charges on both ends.

ref. Cloudflare Learning: What is round-trip time?
ref. Cloudflare Learning: What is TCP/IP?

Transaction Cost Theory and Keep-Alive

In 1937, Ronald Coase famously asked why markets aren't frictionless. His answer was simple: every transaction demands a hidden tax—the cost of searching for partners, negotiating terms, and signing contracts. TCP is a network implementation of that idea. Every connection comes with a negotiation fee. And Transaction Cost Theory points to exactly one solution.

ref. UBS Nobel Perspectives: Oliver Williamson — Transaction Cost Theory

The most effective way to reduce transaction costs is to reduce the number of transactions.

Not faster contracts — fewer contracts. That's the logic behind HTTP Keep-Alive. Instead of opening and closing a connection for every request, Keep-Alive holds the connection open across multiple requests. The handshake cost gets distributed — not paid once per request, but once per session.

Keep-Alive solved the contract problem. But it didn't solve everything. Even inside a persistent connection, there was still a hard rule: requests had to be handled in the order they arrived. Fix the negotiation fee, and suddenly the queue itself becomes the bottleneck.

That's where the next part picks up.

The Bottom Line

The TCP handshake is the price of trust. RTT is what you pay to open a connection. TIME_WAIT is what you owe after closing one. In the language of Transaction Cost Theory, TCP is a protocol that charges a negotiation fee on every single connection.

The optimization insight isn't "connect faster." It's "connect less." HTTP has been moving in that direction ever since.

Next up: HTTP/1.1 solved the connection frequency problem with Keep-Alive. Then it ran into a different wall. One queue, no passing. Fix the engine — and suddenly the road is one lane.

Network Part 1 - The OSI Model as a Fault Map

Dayul Lee — Fri, 01 May 2026 07:09:21 +0000

Published: March 27, 2026

In a previous post, we watched a single DNS misconfiguration on one AWS server bring 3,500 companies across 60 countries to a standstill. DNS lives at Layer 7. The failure started there.

This kind of thing repeats. On June 21, 2022, a misconfigured BGP route at Cloudflare blocked 50% of all global HTTP traffic. No server was overloaded. No deployment had gone wrong. Packets simply lost their way and looped endlessly through the network. This time, the failure was at Layer 3.

Both incidents share one thing: it took far too long to find the cause. Because no one knew which layer had failed.

The OSI model is not a taxonomy for networking textbooks. It's a fault map — a way to pinpoint exactly where a system breaks.

ref. Cloudflare Blog: Cloudflare outage on June 21, 2022

Why the Layers Don't Talk to Each Other

Before the fault map makes sense, this question needs an answer. Why does the OSI model split into 7 layers at all? Wouldn't it be more efficient if each layer could see what the others were doing?

In 1968, software engineer Melvin Conway proposed what has since become foundational in systems design:

"Any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure." — Conway's Law

The OSI model is that principle applied to network architecture. Each layer communicates only through a defined interface. Internal implementation stays private. Layer 4 has no idea whether Layer 7 is speaking HTTP or gRPC. Layer 3 doesn't know — or care — whether Layer 4 is TCP or UDP.

This is deliberate ignorance. And that ignorance produces two trade-offs:

Freedom to change: Migrating from HTTP/1.1 to HTTP/2 happens entirely within Layer 7. Everything below stays untouched. The layers are decoupled by design.
Fault isolation: A routing failure at Layer 3 has no bearing on your application logic at Layer 7. The blast radius is contained to one layer.

That's why the Cloudflare outage could be called "a Layer 3 problem" immediately. Without the layered design, the cause would have been buried somewhere in the full stack.

Each layer chose not to know the others. That's exactly what makes it possible to know which layer broke.

ref. Martin Fowler: Conway's Law

Every Layer Has Its Own Breaking Point

Goldratt's Theory of Constraints is direct: the output of any system is capped by its weakest link. Networks are no exception. But the nature of the bottleneck changes depending on which layer you're looking at.

Packets travel down from L7 to L1 on the sender's side — each layer wrapping the data in its own envelope. On the receiving end, they unwrap back up from L1 to L7. Seven layers. Seven handoffs. Under high-volume traffic, one of those handoffs will crack first. The question is which one, and why.

L4 — Speed was the goal. Awareness was the price.

Layer 4 is deliberately blind to content. It sees an IP address, a port number, a protocol — TCP or UDP — and nothing else. It never opens the packet. Think of it as a courier that delivers sealed envelopes without knowing what's inside. That's why it's fast.

But that choice has structural consequences. Every TCP connection occupies a port. Port numbers top out at 65,535 — with a realistic working range of around 28,000. Once concurrent connections hit that ceiling, the system stops accepting new ones. No exceptions.

L4's bottleneck is connection count. Ticketing drops, flash sales, live-streamed events — any scenario where thousands of users connect simultaneously runs straight into this wall.

L7 — Awareness was the goal. Speed was the price.

Layer 7 sees everything: HTTP headers, URL paths, cookies, request bodies. It reads the packet, understands the context, and makes decisions accordingly. That's enormously powerful.

But that knowledge is expensive. Parsing takes time. Authentication takes time. Decompression, routing logic, business rules — they all stack. The per-request Logic Latency at L7 is higher than anywhere below it by design. As traffic scales, those costs don't just add — they compound.

L4 stays blind and stays fast. L7 stays aware and pays for it. Neither is a flawed design. They made different trade-offs.

Pull back to all seven layers, and the picture looks like this:

         Rate of Saturation → 100%
L7  [████████████░░]  Logic Latency spikes   ← felt first
L4  [███████░░░░░░░]  Concurrency ceiling
L3  [█████░░░░░░░░░]  Routing overhead
L1  [███░░░░░░░░░░░]  Throughput saturation  ← when this goes, everything goes

L7 hits the wall first. L1 going down means nothing gets through at all. Under high-volume load, there's only one question that matters: which layer is closest to 100% Saturation right now?

How to resolve L4 and L7 bottlenecks in practice — that's Part 4 (Load Balancers).

ref. Google SRE Book: Monitoring Distributed Systems
ref. RFC 793: Transmission Control Protocol

The Bottom Line

The OSI model isn't a protocol classification system. Each layer is an independent failure candidate with its own breaking point. And the reason those layers exist in the first place is itself a trade-off — give up awareness to gain speed, or give up speed to gain awareness.

The layer where Saturation hits 100% first is the constraint. The boundaries between layers are what make that constraint findable — and fixable — without touching everything else.

Engineers who understand this don't panic when something breaks. They don't touch the whole system. They ask which layer. Then they fix that layer.

Next up: Layer 4, up close. We'll look at the hidden cost of TCP's 3-way handshake — the process every connection must complete before a single byte of real data moves. Under load, that turns out to be anything but cheap.

Prologue - What is Large-scale Processing?

Dayul Lee — Fri, 01 May 2026 06:58:27 +0000

Published: March 18, 2026

3 AM, October 2025. A single DNS configuration error on an AWS server brought Snapchat, Roblox, and McDonald's to a standstill. 3,500 companies across 60 countries were stopped cold by one small crack.

Systems are far more fragile than we think. Large-scale processing isn't a trend about boosting server specs. It's the engineering discipline that keeps services alive at the edge of their limits.

So where does "large-scale" actually begin? 10,000 users? A million? That's the wrong question. Large-scale isn't a number. It's the moment a system hits the ceiling of its available resources. That's why what's a normal Tuesday for Amazon can be a catastrophe for a growing startup.

This series is about how to detect that ceiling, understand why systems break, and build things that hold.

The Signals Before a System Breaks

Systems don't collapse without warning. There are always signs. The Google SRE team calls them the Four Golden Signals.

ref. Google SRE: Monitoring Distributed Systems

Latency: How long does it take to handle a request? A gap between successful and failed response times is often the first sign something's wrong.
Traffic: How much demand is hitting the system right now? Think RPS — requests per second.
Errors: How many requests are failing? Explicit 500s, silent wrong responses — both count.
Saturation: How "full" is the system? This is the most direct signal of large-scale stress. When latency starts climbing, saturation is usually already on its way up.

If any one of these looks off, the system is already approaching its limit.

So What Exactly is "Large"?

The Golden Signals tell you the state of a system. But to actually fix things, you need to understand the nature of the load. The same word — "large-scale" — means something completely different depending on what's overwhelming the system.

Traffic (Too many requests)
How many requests per unit time? How many connections can the system hold?

TPS / QPS: Transactions or queries per second. The real measure of system throughput.
Concurrency: Simultaneous active connections. The deciding factor during flash sales or ticketing rushes.

Volume (Too much data)
How large is the data, and how fast does it need to move?

Throughput: Data transferred per second (MB/s). The usual bottleneck in video streaming or large file uploads.

Complexity (Too hard to process)
How much computation does a single request require? How many systems does it touch?

Logic Latency: The more complex the logic, the slower the response — and the faster saturation spikes.

Real outages usually involve all three at once. But if you can't separate the causes, you can't fix them.

Where Business Thinking Meets Engineering

Picture a factory floor. One slow machine holds up the entire line. It doesn't matter how fast everything else runs.

Eliyahu M. Goldratt formalized this as the Theory of Constraints (TOC): *"The throughput of any system is determined by its weakest link — the Constraints."

ref. Lean Enterprise Institute: TOC

Servers work the same way. The point where Saturation hits 100% first — that's the Bottleneck. Large-scale engineering is about finding which component saturates first as traffic grows, then eliminating that constraint with the right strategy.

When You Hit a Wall

Once you've found the bottleneck, you need to increase capacity. There are two ways to do it.

ref. GeeksforGeeks: Vertical and Horizontal Scaling

Vertical Scaling (Scale-up): Upgrade the single node — more CPU, more RAM. Fast to implement, but there's a ceiling. And it's expensive.
Horizontal Scaling (Scale-out): Add more nodes and distribute the load. More complex, but theoretically limitless.

  [Single Server]         [Multiple Servers]

  ┌─────────────┐         ┌───┐ ┌───┐ ┌───┐
  │   CPU ↑↑↑   │         │ S │ │ S │ │ S │
  │   RAM ↑↑↑   │    →    │ 1 │ │ 2 │ │ 3 │
  │   SSD ↑↑↑   │         └───┘ └───┘ └───┘
  └─────────────┘           Load Balancer

      Scale-up                Scale-out
    (Has limits)        (Infinitely expandable)

Scale-up buys simplicity at the cost of a ceiling.

Scale-out removes the ceiling at the cost of complexity.

Neither is the right answer. There's only the right trade-off for the constraint you're solving.

What's Next

At the end of the day, large-scale processing is Strategic Bottleneck Management — controlling Latency and Errors by managing Saturation.

ref. AWS Well-Architected Framework

Next up: a single HTTP request makes its way to a server by passing through 7 layers — the OSI model. We'll trace that journey and see exactly where large-scale traffic creates bottlenecks at each layer, and what engineers have done about it.