Forem: shubham pandey (Connoisseur)

Designing Google Maps / Location Service at Scale A System Design Deep Dive — Question by Question

shubham pandey (Connoisseur) — Wed, 25 Mar 2026 08:15:04 +0000

Introduction

Google Maps gives 1 billion users turn by turn navigation, real time traffic, and instant place search simultaneously. On the surface it is just a map with directions. Underneath it is one of the most sophisticated distributed systems ever built — spanning graph algorithms, real time data pipelines, tile rendering, geographic search, and live traffic computation all working together seamlessly. This post walks through every challenge question by question, including wrong turns and how to navigate out of them.

Challenge 1: Representing Earth's Road Network

Interview Question: The entire road network of Earth has billions of roads and intersections. Finding the shortest path between two points on a graph this size seems computationally impossible in under 1 second. How do you represent the road network and what algorithm finds the shortest path?

Solution: Weighted directed graph with Dijkstra algorithm as the foundation.

Road network as a graph:

Every intersection is a node — roughly 50 million nodes globally
Every road segment between intersections is a directed edge — roughly 100 million edges
Each edge has weights — distance, speed limit, road type, turn restrictions
One way roads are directed edges — two way roads are two edges in opposite directions

Dijkstra algorithm finds shortest path by exploring nodes in order of cumulative cost from source. It guarantees the optimal path in a graph with non-negative edge weights.

Why Dijkstra alone fails at Earth scale:

Graph has 50 million nodes and 100 million edges
Dijkstra complexity is O(E log V) — O(100 million × log 50 million)
On a single machine this takes several minutes per query
Google Maps responds in under 1 second for 1 billion daily queries
Dijkstra on raw Earth graph is completely unworkable at this scale

Key Insight: Dijkstra is the correct algorithmic foundation but cannot scale to Earth's road network without significant optimization. The graph structure and search space must be dramatically reduced.

Challenge 2: Fast Routing with Contraction Hierarchies

Interview Question: Dijkstra exploring all roads equally for a Mumbai to Delhi query wastes enormous computation on tiny side streets that could never be part of the optimal route. How do you make routing dramatically faster?

Navigation: The key insight is that humans navigate hierarchically — for long distances you immediately think of major highways, not local streets. The routing algorithm should do the same. Structure the road network into importance levels and only search relevant levels for each query distance.

Solution: Contraction Hierarchies — hierarchical road network with pre-computed shortcuts.

Road hierarchy levels:

Level 1 — Local roads — walking distance queries
Level 2 — City roads — intra-city queries
Level 3 — State highways — inter-city queries
Level 4 — National highways — long distance queries

Routing by distance:

Mumbai to Delhi — 1400km — search Level 4 national highways only — find path instantly
Mumbai to Pune — 150km — Level 4 has no direct path — search Level 3 state highways — find path
Street to nearby mall — search Level 2 city roads — find path
Walking to neighbor — search Level 1 local roads — find path

Pre-computed shortcuts:
Google does not compute hierarchies at query time. Shortcuts between important nodes are pre-computed offline during map processing. The algorithm contracts less important nodes and creates direct shortcut edges between important nodes that bypass them. Query time just looks up these shortcuts rather than exploring the full graph.

Result: Route query time drops from several minutes to milliseconds. A Mumbai to Delhi query explores only a few thousand highway nodes instead of 50 million total nodes.

Key Insight: Contraction Hierarchies exploit the natural importance hierarchy of road networks. Long distance routing never touches local roads. Pre-computed shortcuts transform an intractable global graph problem into a fast hierarchical lookup.

Challenge 3: Real Time Traffic Data Pipeline

Interview Question: 500 million Android phones with Google Maps open send GPS location every few seconds. Google must process this to detect traffic patterns in near real time. How do you design the pipeline that converts billions of location updates into real time traffic conditions?

Solution: Kafka streaming pipeline with Traffic Computation Service and Redis storage.

Full traffic pipeline:

500 million phones send GPS coordinates every few seconds
Kafka receives billions of location events per day
Traffic Computation Service consumes from Kafka
Groups location updates by road segment using map matching
Computes average speed of all phones on each road segment
Compares against normal free flow speed for that segment
Classifies traffic status — FREE FLOW above 80 percent of speed limit, SLOW between 40 and 80 percent, CONGESTED below 40 percent
Updates Redis with current traffic status per segment
Pushes traffic updates to affected users via WebSocket

Redis traffic data structure:

Key is road segment ID
Value contains current speed, traffic status, and last updated timestamp
Sub-millisecond reads for route computation and map rendering

WebSocket push to users:

Traffic status changes on a segment
Notification Service identifies users currently navigating through that segment
Pushes rerouting suggestion via WebSocket instantly
User sees updated route without refreshing

Key Insight: Kafka decouples data collection from processing. Traffic Computation Service processes billions of events asynchronously without ever blocking the phones sending data. Redis provides sub-millisecond traffic status reads for real time route computation.

Challenge 4: GPS Noise Filtering for Accurate Traffic

Interview Question: Phone GPS has 5 to 15 meter accuracy. A stationary phone looks identical to a traffic jam. A phone moving slowly through a school zone looks like congestion. With billions of noisy GPS points how do you ensure traffic computation reflects actual road conditions?

Solution: Three layer noise filtering pipeline.

Layer 1 — Crowd sourced statistical averaging:

Single phone showing slow speed on a segment — could be anything — ignore
Minimum threshold of phones needed before declaring traffic status — typically 5 to 10 phones
Statistical outlier trimming removes readings that deviate significantly from segment average
Genuine congestion shows up across many phones simultaneously — impossible to fake with noise

Layer 2 — Red light detection versus genuine jam:

All phones on segment stopped for 30 to 60 seconds then resuming normal speed — red light — ignore, not congestion
All phones stopped for 5 or more minutes with sustained slow movement — genuine traffic jam — flag as congested
Pattern recognition on stop duration and subsequent speed distinguishes red lights from jams reliably

Layer 3 — Map matching from GPS coordinates to road segments:

Raw GPS coordinates snapped to nearest road segment using road network graph
Eliminates readings from parallel footpaths, buildings, parking lots
Only road-snapped coordinates contribute to traffic computation
Same Viterbi map matching algorithm from Uber design applies here

Key Insight: No single GPS reading is trusted. Only crowd sourced patterns across many phones over time produce reliable traffic signals. Statistical averaging, temporal pattern recognition, and map matching together filter noise to near zero.

Challenge 5: Map Tile Rendering

Interview Question: Earth has billions of geographic features. At different zoom levels users see different detail. How do you structure map data so you serve only the small portion a user is currently viewing at their specific zoom level?

Solution: Tile pyramid with pre-rendered cached tiles.

Tile pyramid structure:

World map divided into a grid of square tiles — each 256 by 256 pixels
At zoom level 1 — 4 tiles cover entire Earth
Each zoom level quadruples the number of tiles
At zoom level 15 — roughly 1 billion tiles — individual streets visible
At zoom level 20 — 35 trillion tiles — individual buildings visible
Each tile identified by three coordinates — zoom level, x position, y position

Serving tiles:

User opens Google Maps centered on Mumbai at zoom level 15
App calculates which tile coordinates cover current viewport — typically 9 to 12 tiles
Requests only those specific tiles from CDN
Downloads maybe 500KB of tile images instead of petabytes
User pans map — new tiles requested only for newly visible area
Zoom in — higher zoom level tiles requested for same area

Pre-rendering and CDN caching:

Google pre-renders all tiles at all zoom levels offline during map processing
Tiles stored on CDN servers globally
Tile request hits nearest CDN server — sub-millisecond response
Base map tiles change infrequently — roads and buildings rarely move
CDN cache TTL can be days or weeks for base tiles

Key Insight: The tile pyramid transforms a petabyte scale global dataset into millions of small independent cacheable images. Users only ever download the tiny fraction of tiles visible in their current viewport.

Challenge 6: Real Time Traffic Overlay Without Re-rendering Tiles

Interview Question: You have pre-rendered static tiles cached on CDN. Traffic conditions change every few minutes. Re-rendering and re-caching billions of tiles every few minutes is too slow and expensive. How do you overlay dynamic traffic data on static tiles?

Solution: Two layer rendering — static base tiles plus dynamic vector data rendered client side.

Layer 1 — Static base tiles from CDN:

Pre-rendered map imagery showing roads, buildings, parks, labels
Never changes — cached with long TTL on CDN
Served instantly from nearest CDN server
No re-rendering ever needed for base map

Layer 2 — Dynamic traffic vector data from server:

Not images — just data — road segment IDs mapped to traffic status
Tiny JSON payload — a few kilobytes for entire city
Updated every few minutes from Redis traffic data
Pushed to app via WebSocket when significant changes occur

Client side rendering:

App fetches base tiles from CDN
App fetches traffic vector data from server
App renders colored traffic overlay on top of base tiles locally
Green segments for free flow, yellow for slow, red for congested
Traffic changes — server pushes tiny update — app re-renders only affected segments
Base tiles never invalidated — CDN cache always fresh

Benefits:

Base tile CDN cache never invalidated by traffic changes
Traffic updates are tiny data payloads not images
Client renders overlay locally — zero server rendering cost per user
Smooth real time traffic visualization with minimal bandwidth

Key Insight: Separating static geographic data from dynamic traffic data allows each to be optimized independently. Static tiles cached forever on CDN. Dynamic data pushed as tiny updates. Client side rendering combines both seamlessly.

Challenge 7: Location Search — Text Plus Proximity

Interview Question: User searches "Italian restaurants near me." Results must consider both text relevance and geographic proximity simultaneously. A Starbucks 200 meters away is more relevant than a better rated one 20km away. How do you design location search?

Solution: Two phase search combining Redis GEORADIUS and Elasticsearch.

Phase 1 — Geographic filter via Redis GEO:

User location known from GPS
GEORADIUS places user_longitude user_latitude 2km returns all place IDs within radius
Returns maybe 500 place IDs — fast O(log n) proximity query
Pure geographic filter — no text matching yet

Phase 2 — Text search and ranking via Elasticsearch:

Pass 500 place IDs to Elasticsearch as filter
Elasticsearch searches name and category fields for "Italian restaurant"
Typo tolerance — "Italain" still matches "Italian"
Synonym matching — "eatery" matches "restaurant"
Returns 50 matching places with text relevance scores

Final ranking combining three signals:

Final Score = Text Relevance multiplied by 0.4 plus Proximity Score multiplied by 0.4 plus Rating Score multiplied by 0.2
Top 10 results returned to user

Why this weighting:

Text relevance and proximity equally important for location search
Rating matters but a highly rated place far away should not outrank a good place nearby
Weights tunable based on query type — "best restaurant in Mumbai" weights rating higher

Key Insight: Location search is a two dimensional problem — text relevance and geographic proximity. Redis GEO handles the geographic dimension efficiently. Elasticsearch handles the text dimension. A scoring function combines both into a single ranked result list.

Challenge 8: Keeping Data Stores in Sync

Interview Question: Place data lives in three stores — PostgreSQL as source of truth, Redis GEO for proximity queries, and Elasticsearch for text search. Millions of place updates happen daily — new businesses opening, closing, changing addresses. How do you keep all three in sync?

Solution: Kafka event driven async synchronization.

Update flow:

New restaurant opens — Place Service writes to PostgreSQL — source of truth committed
Place Service publishes event to Kafka instantly
Three independent consumers read from Kafka in parallel

Consumer 1 — Redis GEO updater:
GEOADD places longitude latitude placeID — place immediately available for proximity queries

Consumer 2 — Elasticsearch indexer:
Index new place document with name, category, rating, and metadata — immediately searchable

Consumer 3 — CDN cache invalidator:
Identify map tiles containing this location — invalidate cached tiles — trigger re-render of affected tiles — new business appears on map

Failure handling:

PostgreSQL write succeeds — Kafka event published
Redis consumer fails — Kafka retains event — consumer retries automatically on restart
Elasticsearch consumer fails — same retry pattern
All three stores eventually consistent with PostgreSQL source of truth
No data loss possible — Kafka durably stores events until all consumers acknowledge

Key Insight: Kafka decouples the Place Service from all downstream data stores. A single write to PostgreSQL propagates asynchronously to Redis, Elasticsearch, and CDN without the Place Service knowing or caring about any of them. Failures in any consumer self-heal via Kafka retry.

Full Architecture Summary

Road network — Weighted directed graph with 50 million nodes and 100 million edges
Fast routing — Contraction Hierarchies with pre-computed shortcuts per road level
Traffic pipeline — Kafka streaming to Traffic Computation Service to Redis
GPS noise filtering — Crowd sourced averaging plus red light detection plus map matching
Map rendering — Pre-rendered tile pyramid cached on global CDN
Traffic overlay — Static base tiles plus dynamic vector data rendered client side
Location search — Two phase Redis GEORADIUS plus Elasticsearch with combined scoring
Data sync — Kafka event driven updates to Redis, Elasticsearch, and CDN in parallel

Final Thoughts

Google Maps is a system where every layer has a fundamentally different performance characteristic. Routing needs millisecond graph traversal. Traffic needs near real time stream processing. Map rendering needs globally distributed static caching. Search needs both geographic and text indexing simultaneously. Data sync needs eventual consistency across multiple specialized stores.

The recurring theme is that no single data store or algorithm solves all problems. The right architecture uses each tool for what it does best — Redis for proximity and traffic state, Elasticsearch for text search, CDN for static tiles, Kafka for event propagation, and pre-computed hierarchical graphs for fast routing. The skill is knowing which tool fits which problem.

Happy building. 🚀

Designing Google Drive / Dropbox at Scale A System Design Deep Dive — Question by Question

shubham pandey (Connoisseur) — Mon, 23 Mar 2026 05:54:02 +0000

Introduction

Google Drive seems simple — upload a file and access it anywhere. But at 1 billion users, it hides some of the most elegant distributed systems engineering in the industry. Resumable uploads, intelligent deduplication, delta sync, real time collaboration, and offline catch up all working seamlessly together. This post walks through every challenge question by question, including wrong turns and how to navigate out of them.

Challenge 1: Resumable Uploads

Interview Question: User uploads a 5GB video file. Halfway through the upload their internet connection drops. Without smart design they must restart the entire upload from scratch. How do you design uploads so they resume exactly where they left off?

Navigation: The key insight is that you never need to treat a large file as a single atomic upload. If you split it into smaller independent pieces and track which pieces succeeded, you only need to retry the failed pieces.

Solution: Chunked upload with client side state tracking and checksum validation.

Upload flow:

Client splits 5GB file into chunks — typically 5MB each — producing roughly 1000 chunks
Client maintains upload state locally tracking each chunk as PENDING, UPLOADED, or FAILED
Client uploads chunks sequentially or in parallel
Connection drops — client knows exactly which chunks succeeded from local state
Connection restores — client resumes from first failed or pending chunk
All chunks uploaded — server merges into single complete file
Server computes checksum of merged file and compares with client computed checksum
Checksum match — file integrity confirmed, upload complete
Checksum mismatch — corruption detected, affected chunks re-uploaded

Zero re-uploading of already completed chunks regardless of how many times the connection drops.

Key Insight: Chunking transforms a fragile all-or-nothing upload into a resumable checkpoint based process. Client side state tracking means the server never needs to tell the client where to resume — the client already knows.

Challenge 2: Storage Deduplication

Interview Question: User A uploads a 5GB video file. User B — a completely different person — uploads the exact same 5GB file. Google Drive naively stores two complete 5GB copies. With 1 billion users uploading popular files millions of times, this wastes petabytes of storage. How do you detect identical files and avoid storing duplicates?

Solution: Content addressable storage using file hashing.

Client computes SHA256 hash of the file before uploading
Client sends hash to server first
Server checks hash against metadata database
Hash exists — file already stored — create pointer to existing file — skip upload entirely
Hash not found — proceed with upload — store file — save hash to metadata database

Storage savings at scale:

1 million users upload same movie trailer — 500MB each
Without deduplication — 1 million copies — 500TB of storage
With deduplication — 1 copy plus 1 million pointers — 500MB total

Key Insight: Content addressable storage uses the file's own content as its address. Identical content produces identical hash — identical hash means content already exists — no need to store it twice.

Challenge 3: Chunk Level Deduplication

Interview Question: Computing SHA256 of a 5GB file takes several seconds. Can deduplication be done more efficiently — and can it save even more storage?

Navigation: Since files are already split into chunks for resumable uploads, compute hash per chunk rather than per file. Two completely different files might share identical chunks — same embedded image, same opening credits, same boilerplate header.

Solution: Chunk level hash based deduplication with pre-upload hash check.

Client computes hash for every chunk before uploading anything
Client sends all chunk hashes to server in one request
Server checks each hash against chunk database
Server responds with which chunks already exist and which need uploading
Client uploads only the chunks the server does not already have

Example result:

1000 chunk file
Server already has 950 chunks from other files
Client uploads only 50 new chunks — 250MB instead of 5GB
Upload completes in seconds instead of minutes

This technique — uploading without uploading the chunks that already exist — caused a famous controversy when Dropbox implemented it in 2011. Users believed their files were being fully uploaded but Dropbox was silently skipping chunks it already had. The technique is legitimate but raised important transparency questions.

Key Insight: Chunk level deduplication saves more storage than file level deduplication and dramatically reduces upload time. A 5GB file might require uploading only a few hundred megabytes of genuinely new data.

Challenge 4: Deduplication Security — Hash Probing Attack

Interview Question: Cross-user chunk deduplication leaks information. How?

The attack — Hash Probing:

Attacker has a known file — say contraband content
Attacker computes SHA256 hash of that file
Attacker sends hash to Google Drive server without uploading the file
Server responds — chunk already exists, no upload needed
Attacker now knows someone on Google Drive has that exact file
Identified a user possessing specific content without downloading anything

This is called a Hash Probing Attack — using the deduplication mechanism as a detection oracle. Dropbox was caught vulnerable to this attack in 2011 and quietly changed their approach.

Solution: Salted hash with userID — deduplicate within user only.

Wrong approach — per user deduplication without salt:

User A and User B upload same file — two separate copies stored
Eliminates cross-user privacy risk but wastes storage

Better approach — salted hash:
chunk_hash = SHA256(chunk_data + userID)

Same chunk from User A produces different hash than User B
Hash probing impossible — attacker cannot predict salted hash without knowing userID
User A uploads same file twice — same salted hash — deduplicated to one copy
Cross-user deduplication eliminated — privacy preserved
Within-user deduplication fully preserved — storage still saved for same user's duplicate files

Alternative — Convergent Encryption:
Encrypt chunk with user private key before hashing. Each user's chunks encrypted independently. Content completely private even from Google itself.

Key Insight: Cross-user deduplication leaks information about what other users have stored. Salted hashing with userID preserves within-user deduplication while making cross-user hash probing attacks impossible.

Challenge 5: Delta Sync — Only Upload What Changed

Interview Question: User edits a 100MB PowerPoint file — changes a single slide — maybe 50KB of actual changes. Without smart design Google Drive uploads the entire 100MB file again on every save. With 1 billion users constantly editing files this wastes petabytes of unnecessary uploads per day. How do you sync only the actual changes?

Solution: Delta sync using chunk hash comparison.

File already split into chunks from upload
Client maintains hash of every chunk locally
User saves edited file
Client recomputes hash for every chunk
Compares new hashes against stored hashes
Unchanged chunk — same hash — skip entirely
Changed chunk — different hash — upload only this chunk
Server updates metadata with new chunk hash
Other devices notified to download only the changed chunk

Result for 100MB PowerPoint with one changed slide:

1000 chunks total
1 chunk changed — 100KB
Upload 100KB instead of 100MB
99.9 percent bandwidth saving on every incremental edit

Key Insight: Delta sync combined with chunk level hashing means editing a large file costs almost nothing in bandwidth. Only genuinely new bytes ever travel over the network.

Challenge 6: Real Time Sync Notifications

Interview Question: File changes on laptop. Phone needs to know instantly. How does the server notify the phone — and what happens if the phone is offline when the change happens?

Solution: Three tier notification strategy based on device state.

Tier 1 — App actively open — WebSocket persistent connection:

Google Drive app open on phone — WebSocket connection maintained
File changes on laptop — change event published to Kafka
Notification Service consumes from Kafka
Pushes file change notification to phone via WebSocket instantly
Sub 100ms notification delivery — seamless real time sync experience

Tier 2 — App closed or backgrounded — FCM push notification:

Phone app not running — no WebSocket connection
Notification Service sends FCM push notification
FCM wakes up app — app connects and syncs changed chunks
Standard mobile push notification flow

Tier 3 — Device offline — Change Log with ordered event storage:

Phone offline for hours or days
Every file change stored as ordered event in Change Log on server
Phone comes back online — app sends last sync timestamp to server
Server returns all changes since that timestamp in chronological order
App applies changes sequentially — fully caught up regardless of how long it was offline

Key Insight: WebSocket for active app, FCM for background app, and Change Log for offline devices covers every possible device state. No file change is ever missed regardless of connectivity.

Challenge 7: Change Log Retention and Long Term Offline Recovery

Interview Question: User has Google Drive on 5 devices. Tablet not used for 6 months. Thousands of missed changes. Do you replay 6 months of individual chunk changes — and how long do you keep the Change Log?

Wrong Approach: Keep Change Log forever and replay all events for any offline device.

Why It Fails: 1 billion users making constant edits generates petabytes of change events over years. Replaying 6 months of events for a returning device is wasteful when a simpler full state sync achieves the same result more efficiently.

Solution: 30 day TTL on Change Log with full state sync fallback.

Device offline less than 30 days:

Change Log has all events within retention window
Device connects — sends last sync timestamp
Server replays all missed changes in order
Device fully synced with minimal data transfer

Device offline more than 30 days:

Change Log expired via TTL — events gone
App performs full state sync instead
Client sends current file metadata hashes to server
Server compares with current state
Server returns list of files that differ
Client downloads only differing files — not all files
Device fully synced regardless of how long it was offline

Key Insight: 30 day TTL bounds Change Log storage to a predictable size. Devices offline longer than retention window fall back to full state sync — which is actually more efficient than replaying months of stale intermediate events.

Challenge 8: Real Time Collaborative Editing

Interview Question: User A and User B both edit the same Google Doc simultaneously. User A types "Hello" at position 10. User B simultaneously types "World" at position 10. Both changes arrive at the server at the same millisecond. How does Google Docs resolve this without asking users to manually resolve conflicts?

Wrong Approach: Lock the section being edited so only one user can type at a time.

Why It Fails: Locking blocks collaborators from typing while someone else holds the lock. With 10 million concurrent editors, lock contention creates a terrible experience. Users stare at frozen cursors waiting for locks to release. Google Docs never blocks you — you can always type freely.

Solution: Operational Transformation — OT algorithm.

Core insight: Instead of sending the final text, send the operation — what changed and where.

User A sends: INSERT "Hello" at position 10
User B sends: INSERT "World" at position 10

Both arrive at server simultaneously. Server applies User A's operation first:

Original document: "The quick brown fox"
After User A: "The quick Hello brown fox"

Now User B's operation says INSERT "World" at position 10 — but position 10 has shifted because User A inserted 5 characters before it.

OT transforms User B's operation:

Original position: 10
User A inserted 5 characters at position 10
Transformed position: 10 plus 5 equals 15
Transformed operation: INSERT "World" at position 15

Final document: "The quick Hello World brown fox"

Both users see identical document. No conflict popup. No blocking. Fully seamless.

Modern alternative — CRDT Conflict Free Replicated Data Types:

Every character assigned a globally unique ID — not just a position number
Position derived from character relationships — not absolute index
Insertions and deletions commute — order of application does not matter
Used by Figma, Notion, and modern collaborative tools
More robust than OT for complex multi-user scenarios

Key Insight: Operational Transformation allows simultaneous edits by transforming operations relative to each other rather than preventing conflicts. The result is the seamless real time collaboration experience users expect — no locks, no conflict popups, no blocked cursors.

Full Architecture Summary

Resumable uploads — Client side chunk state tracking with checksum validation
Storage deduplication — Chunk level SHA256 hash based content addressable storage
Dedup security — Salted hash with userID prevents cross-user hash probing attacks
Delta sync — Upload only changed chunks via hash comparison
Real time notifications — WebSocket for active app, FCM for offline devices
Offline catch up — Change Log with 30 day TTL, full state sync beyond retention
Collaborative editing — Operational Transformation with position adjustment

Final Thoughts

Google Drive is a masterclass in applying the same core techniques recursively at every layer. Chunking solves resumable uploads, deduplication, delta sync, and parallel processing all at once. Hashing solves content addressability, change detection, and integrity validation simultaneously. TTL solves Change Log retention the same way it solved cache eviction, lock expiry, and presence detection in every other design in this series.

The most important lesson is that elegant systems reuse simple primitives everywhere. Once you understand chunking and hashing deeply, an enormous range of distributed systems problems become variations of the same theme.

Happy building. 🚀

Designing Netflix / Video Streaming at Scale A System Design Deep Dive — Question by Question

shubham pandey (Connoisseur) — Sun, 22 Mar 2026 05:40:56 +0000

Introduction

Netflix serves 4K video to 200 million subscribers simultaneously without buffering. On the surface it is just playing a video file. Underneath it is one of the most sophisticated distributed systems ever built — spanning global content delivery, adaptive streaming, parallel encoding pipelines, geo licensing, and personalized recommendations all working together seamlessly. This post walks through every challenge question by question, including wrong turns and how to navigate out of them.

Challenge 1: Global Video Delivery

Interview Question: If you had one central server in the US storing all Netflix movies, a user in Mumbai, a user in London, and a user in Tokyo all click play simultaneously. What is the most fundamental problem they all face?

Navigation: Serving 100GB video files from a single central server to users across the world means every user pays the cost of geographic distance — high latency, slow start times, and saturated long distance network links. The solution is obvious once you frame it correctly — video data needs to physically live close to the user.

Solution: Content Delivery Network — CDN with regional servers.

Netflix built their own CDN called Open Connect. They place servers called Open Connect Appliances directly inside ISP data centers worldwide. Instead of video traveling from a US central server to Mumbai, it travels from a Mumbai ISP server a few milliseconds away.

Central server stores master copy of all content
Regional CDN servers cache popular content close to users
Mumbai user streams from Mumbai CDN server — low latency
London user streams from London CDN server — low latency
Tokyo user streams from Tokyo CDN server — low latency

Key Insight: Video data must live geographically close to the user. A CDN is not an optimization — it is a fundamental requirement for global video streaming at scale.

Challenge 2: Smart CDN Cache Management

Interview Question: CDN storage is expensive. You cannot cache every title on every regional server. How do you decide what content to cache on which regional server — and what happens when a user requests a title not cached on their nearest CDN?

Solution: Two caching strategies working together — push and pull.

Push based caching — proactive:

New blockbuster release approaching launch day
Netflix proactively pushes content to all relevant CDN servers before launch
Day one release — content already cached everywhere — zero cache misses on launch

Pull based caching — reactive:

User requests title not cached on nearest CDN server
CDN pulls content from central server on first request
Caches it locally with TTL for future requests
All subsequent users in that region hit the cache

Unpopular titles:

Rarely requested content never gets cached
Served directly from central server
CDN storage reserved for content that justifies caching

Dynamic TTL based on viewing patterns:
A fixed TTL for all content is naive. A show extremely popular during its launch week but dead 3 months later should not hold CDN space indefinitely.

Solution: Viewing Analytics Service collects watch events and publishes to Kafka. TTL Management Service consumes from Kafka, computes viewing frequency per title per region, and dynamically adjusts TTL accordingly.

Avengers launch week — 10 million views — TTL 30 days
Avengers 3 months later — 1000 views — TTL 3 days
Obscure documentary — 10 views — TTL 0, evict immediately

All TTL adjustments happen asynchronously — never blocking the streaming experience.

Key Insight: Smart CDN management combines proactive push for known blockbusters, reactive pull for long tail content, and dynamic TTL adjustment based on real viewing patterns. Static caching strategies waste expensive CDN storage.

Challenge 3: Video Chunking and Instant Playback

Interview Question: A 4K HDR movie is 100GB. A user has a 50 Mbps connection. Downloading 100GB takes 4.4 hours. But Netflix starts playing in under 3 seconds. How is this physically possible?

Navigation: The user does not need all 100GB before playback starts. They only need the first few seconds of video. If you split the video into small chunks and start playing the first chunk while the rest download in the background, playback starts almost instantly.

Solution: Video Chunking — split video into 2 to 10 second chunks.

3 hour movie split into roughly 2000 individual chunks
Each chunk is 2 to 10 seconds of video
User needs only the first chunk to start playing — a few megabytes
Subsequent chunks download in background while current chunk plays
Playback starts in under 3 seconds regardless of file size

Key Insight: You never need the whole file to start playing. Chunking transforms an impossible 100GB download problem into a trivial few megabyte first chunk problem.

Challenge 4: Adaptive Bitrate Streaming

Interview Question: Netflix stores the same chunk at 5 different quality levels. Why — and who decides which quality to request?

Netflix stores every chunk at multiple quality levels:

4K HDR — 8 Mbps — 4MB per chunk
1080p — 4 Mbps — 2MB per chunk
720p — 2 Mbps — 1MB per chunk
480p — 1 Mbps — 0.5MB per chunk
360p — 0.5 Mbps — 0.25MB per chunk

Different users have different bandwidth. The same user has different bandwidth at different moments — strong WiFi at home, weak signal in the kitchen, mobile data on the commute.

With a single quality video: connection degrades → buffering → terrible experience.
With multiple quality versions: connection degrades → seamlessly switch to lower quality chunk → no buffering.

The client decides — not the server. The Netflix app runs an ABR Algorithm — Adaptive Bitrate Algorithm — that continuously monitors download speed of recent chunks and current buffer level, then decides which quality chunk to request next.

ABR decision logic:

Buffer above 30 seconds and speed above 20 Mbps — request 4K chunk
Buffer above 15 seconds and speed above 8 Mbps — request 1080p chunk
Buffer below 10 seconds and speed dropping — request 720p chunk
Buffer below 5 seconds — request 360p immediately to prevent buffering

Quality adjustments are seamless and invisible to the user. The app makes hundreds of these decisions per viewing session.

Key Insight: Multiple quality versions per chunk combined with client side adaptive bitrate selection eliminates buffering under any network condition. The client always has the right quality for current bandwidth.

Challenge 5: Parallel Video Encoding Pipeline

Interview Question: A raw master file could be 1TB of uncompressed footage. Netflix needs to create thousands of chunks at 5 quality levels each. Encoding a 3 hour movie sequentially on one machine could take days. Netflix adds thousands of new titles every year. How do you process them fast enough?

Navigation: The key insight is that chunks are independent of each other. You do not need to encode chunk 1 before encoding chunk 2. If you can encode all chunks simultaneously across thousands of machines, a process that took days takes minutes.

Solution: Parallel chunk encoding across a distributed encoding farm.

Raw master file arrives
Chunking Service splits movie into 2000 independent chunks
Each chunk published as a job to Kafka job queue
2000 encoding machines each pick up one job
Every chunk encodes simultaneously across the farm
Each machine produces 5 quality versions of its chunk
All 2000 chunks complete — Merge Service reassembles into final encoded movie
CDN Distribution pushes encoded content to regional servers

A 3 hour movie that took days on one machine now takes minutes across 2000 machines.

Key Insight: Chunking solves two problems simultaneously — instant playback for users and parallel encoding for Netflix. The same chunk boundaries that enable streaming also enable massively parallel encoding.

Challenge 6: Fault Tolerant Encoding Pipeline

Interview Question: Your encoding farm has 2000 machines encoding chunks simultaneously. One machine fails mid encoding. 1999 chunks complete successfully but chunk 847 is lost. The entire movie is incomplete. At scale machine failures happen constantly. How do you make the pipeline fault tolerant?

Navigation: The failed chunk needs to be detected and retried on a different machine automatically. This requires a coordinator that tracks the state of every chunk job and reassigns failed jobs without human intervention.

Solution: Kafka job queue with Coordinator Service tracking chunk states.

Every chunk job has a state:

PENDING — waiting to be picked up by a worker
IN PROGRESS — currently being encoded by a worker
COMPLETED — successfully encoded
FAILED — encoding failed, needs retry

Coordinator Service monitors all job states:

Job stuck IN PROGRESS too long — worker crashed — set back to PENDING — reassigned to healthy worker
All 2000 jobs COMPLETED — trigger Merge Service automatically
Job fails repeatedly — alert engineering team

TTL on IN PROGRESS state — same pattern from WhatsApp and Stock Exchange designs:

Worker picks up chunk — job marked IN PROGRESS with TTL of expected encoding time plus buffer
Worker crashes — TTL expires — job automatically returns to PENDING — reassigned
No manual intervention needed — pipeline self heals

Key Insight: A job queue with explicit state tracking and TTL based failure detection makes distributed encoding pipelines self healing. Individual machine failures never block movie processing.

Challenge 7: Geo Licensing Checks

Interview Question: Netflix operates in 190 countries with different licensing rules per title per country. When a user clicks play Netflix must instantly verify if content is licensed in their country. This check must happen in milliseconds. How do you design it?

Solution: Redis Set per title with DynamoDB fallback and fail open strategy.

Data structure — Redis Set per title:

Key is title ID
Value is Redis Set containing all countries where title is licensed
SISMEMBER titleID countryCode returns true or false in O(1)
10000 titles times 190 countries — entirely manageable in Redis memory

Licensing updates — eventual consistency is acceptable:

Licensing changes happen a few times per day not in milliseconds
DynamoDB stores licensing rules as source of truth
Async background job syncs DynamoDB to Redis every few minutes
Eventual consistency is perfectly fine — nobody needs sub-second licensing propagation
This is deliberate under-engineering — Kafka streaming for licensing updates would be overkill

Fallback strategy — graceful degradation:

Redis available — O(1) licensing check — instant response
Redis down — fall back to DynamoDB — slightly slower but always available
Both down — fail open — allow playback — minor licensing risk accepted

Fail open philosophy: Netflix prioritizes user experience over minor licensing violations. Blocking 200 million users from watching anything during a 30 second Redis outage causes massive revenue loss and reputational damage. Serving unlicensed content to a small number of users for 30 seconds is an acceptable tradeoff.

This is called Graceful Degradation — system degrades to a slower but functional state rather than failing completely.

Key Insight: Redis Set gives O(1) geo licensing checks. DynamoDB fallback means Redis downtime is never user facing. Fail open philosophy ensures Netflix never goes dark over a licensing check infrastructure failure.

Challenge 8: Personalized Recommendations

Interview Question: Netflix shows every user a completely personalized homepage based on watch history, ratings, similar users, trending content, and time of day patterns. Generating this in real time for 200 million users simultaneously seems impossible. How do you make personalized recommendations appear instantly?

Navigation: Recommendations do not need to be real time. Your taste does not change between 2pm and 2:05pm. Generating recommendations once per day and caching them is indistinguishable from real time generation — but vastly cheaper and faster.

Solution: Offline precomputed recommendations with DynamoDB plus Redis cache.

Offline ML pipeline runs continuously in background:

Analyzes watch history of all 200 million users
Runs collaborative filtering — users with similar taste to you watched X
Runs content based filtering — you liked action movies so here are more
Computes personalized top 100 recommendations per user per region
Stores results in DynamoDB
Invalidates Redis cache so fresh recommendations load on next session

User opens Netflix app:

Fetch precomputed recommendations from Redis cache — instant O(1) read
Cache miss — load from DynamoDB — populate Redis — return results
No ML computation at request time — pure database read
Homepage loads in under 200ms

Cache invalidation strategy:

ML pipeline reruns — new recommendations computed — DynamoDB updated
Redis cache invalidated per user
Next app open — cache miss — fresh recommendations loaded from DynamoDB — cached again

Key Insight: Precomputing recommendations offline transforms an impossibly complex real time ML problem into a trivial database read. User taste changes slowly — daily recomputation is indistinguishable from real time for the user experience.

Full Architecture Summary

Global video delivery — CDN with regional servers, Netflix Open Connect inside ISPs
CDN cache management — Push for blockbusters, pull on demand, dynamic TTL via Kafka
Video chunking — 2 to 10 second chunks for instant playback start
Adaptive bitrate — 5 quality versions per chunk, client side ABR algorithm
Encoding pipeline — Parallel chunk encoding across distributed farm
Fault tolerant encoding — Kafka job queue with coordinator and TTL based failure detection
Geo licensing — Redis Set per title with DynamoDB fallback and fail open strategy
Personalized recommendations — Offline ML pipeline stored in DynamoDB plus Redis cache

Final Thoughts

Netflix is a masterclass in knowing when to compute eagerly and when to compute lazily. Recommendations are precomputed because real time ML at 200 million users is impossible. Chunks are encoded in parallel because sequential encoding is too slow. Licensing checks use eventual consistency because sub-second propagation is unnecessary overkill.

The recurring theme across every layer is that the right architecture matches the actual requirements — not a theoretical ideal. Netflix does not need real time recommendations. It does not need synchronous licensing updates. It does not need to serve the full 100GB file before playback. Recognizing what you do not need is just as important as knowing what you do.

Happy building. 🚀

Redis

shubham pandey (Connoisseur) — Sun, 22 Mar 2026 04:39:13 +0000

From a frustrated Sicilian hacker in 2009 to the backbone of every scaled system you've ever used — here's everything Redis does and why it works.

The World Before Redis — and Why It Was Broken
Inside Redis: The Engine That Shouldn't Work (But Does)
Persistence: RDB vs AOF — How Redis Survives a Crash
Replication & Sentinel — Redis Gets Serious About Availability
Redis Cluster & Sharding — Going Horizontal
Pub/Sub & Streams — Redis as a Message Bus
The Verdict: When Redis Wins and When It Doesn't

1. The World Before Redis — and Why It Was Broken

It's 2009. Salvatore Sanfilippo — antirez on the internet — is trying to build a real-time web analytics tool called LLOOGG. Every page view needs to be recorded. Every user session needs a capped log. The data structure? A list. The operation? Append to end, pop from front when it exceeds N entries.

He tries PostgreSQL. He tries MySQL. They work — for a few requests per second. Then they don't. Disk seeks kill him. Row locking kills him. The impedance mismatch between "I need a capped list" and "here's your B-tree index" kills him.

So he does what engineers do when they're desperate enough: he writes his own database. In C. In a weekend. That database is Redis.

The Timeline

Pre-2009 — The Dark Ages
Every team runs RDBMS for everything. Session storage in MySQL. Cache in MySQL. Rate limiting in MySQL. The database is both the source of truth and the punching bag.

~2003 — Memcached Arrives
Brad Fitzpatrick at LiveJournal builds Memcached to solve the read-heavy problem. Key-value, in-memory, fast. But it's a dumb cache — no persistence, no data structures, no atomicity. You can't do "increment this counter unless it doesn't exist."

2009 — Redis Ships
antirez open-sources Redis. It's Memcached with a brain — data structures, atomic operations, optional persistence. Hacker News loses its mind. Within months, Twitter and GitHub are running it in production.

2010–2015 — The Takeover
Redis Sentinel ships for high availability. Redis Cluster ships for horizontal scaling. Redis goes from "interesting toy" to "required infrastructure." Salvatore joins VMware, then Pivotal, to work on it full-time.

2020–Present — Redis Ltd. & Forks
Redis Labs (now Redis Ltd.) introduces commercial licensing for modules. The community forks into Valkey (Linux Foundation) and Redict. The core is still C. The architecture is still single-threaded. The principles haven't changed.

The real insight: The world before Redis wasn't lacking a faster database. It was lacking a database that matched how engineers actually think about data — as lists, sets, counters, and queues — not just rows and columns.

2. Inside Redis: The Engine That Shouldn't Work (But Does)

Redis is single-threaded. In an era of 64-core servers, this sounds insane. Every database textbook tells you concurrency is how you scale. Redis ignores the textbook — and it's faster than most multi-threaded systems at their own game.

Here's why: the bottleneck was never the CPU. It was the disk. Redis lives in RAM. When your data fits in memory, the CPU processes commands in nanoseconds. Context switching, mutex contention, and lock overhead from multi-threading cost more than the work itself. One thread, zero contention, pure throughput.

The Event Loop (epoll)

Redis uses an event-driven, non-blocking I/O model — a reactor pattern powered by epoll on Linux. Here's what happens on every tick:

┌─────────────────────────────────────────────┐
│ Redis Event Loop (ae.c) │
└─────────────────────────────────────────────┘
epoll_wait() ← blocks until ≥1 fd is ready
│
▼
for each ready fd:
├─ read event? → parse command → execute → write response
├─ write event? → flush output buffer
└─ timer event? → run background jobs (TTL expiry, etc.)
repeat forever
No threads. No locks. No context switches.
10,000 connections → same single loop handles them all.

Data Structures — The Real Differentiator

Redis doesn't store "strings." It stores typed objects, each with an encoding chosen at runtime for memory efficiency:

Type	Internal Encoding	Use Case
String	SDS (Simple Dynamic String)	Cache values, counters, session tokens
List	QuickList (linked list of ListPack nodes)	Message queues, activity feeds
Hash	ListPack → Hashtable (auto-promoted)	User profiles, object storage
Set	ListPack → Hashtable	Tags, unique visitors, membership
Sorted Set	Skip List + Hash Map	Leaderboards, rate limiters, trending topics

The Sorted Set is the most interesting. It maintains a skip list for ordered range queries and a hash map for O(1) score lookups — two data structures kept in sync on every write, giving you O(log n) for everything.

TTL Mechanics — Two-Phase Expiry

When you call EXPIRE key 60, Redis doesn't set a timer. It stores the expiry timestamp in a separate hash. Actual deletion happens in two ways:

Lazy Expiry: On every read, Redis checks if the key is expired before returning it. Expired? Delete it, return nil. Zero background overhead.

Active Expiry: Every 100ms, Redis samples 20 random keys from the expiry hash. If more than 25% are expired, it samples again — loops until the expired ratio drops below 25% or the time budget runs out. This caps CPU usage while still reclaiming memory proactively.

ON every READ command:
if key exists in expires_dict:
if now() > expires_dict[key]:
delete(key)
return NIL
EVERY 100ms (activeExpireCycle):
sample 20 keys from expires_dict
delete all expired ones
if expired_count / 20 > 0.25:
repeat // keep going until clean enough

3. Persistence: RDB vs AOF — How Redis Survives a Crash

Redis is in-memory. If the process dies, the data dies with it — unless you've configured persistence. Redis gives you two mechanisms. They solve different problems. Most production systems use both.

RDB — The Snapshot

RDB (Redis Database) takes a point-in-time snapshot of your entire dataset and writes it to disk as a compact binary file. Think of it as a photograph of your memory at a moment in time.

The magic: Redis uses fork(). The parent process keeps serving requests. The child process inherits the memory, writes the snapshot, exits. The OS handles copy-on-write — pages only get duplicated if the parent modifies them. Zero downtime, low overhead.

RDB Snapshot Flow
Parent Process Child Process
────────────── ─────────────
serving requests (forked)
│ │
user writes page A ──CoW──→ page A copied for child
│ │
continues serving writes RDB to dump.rdb
│ │
│ exits cleanly
│
atomic rename: dump.rdb.tmp → dump.rdb

Trigger it manually with BGSAVE, or configure automatic snapshots: save 900 1 means snapshot if ≥1 key changed in 900 seconds.

RDB Pros: Compact binary format, fast to load on restart, minimal performance impact, perfect for backups.

RDB Cons: You lose all data since the last snapshot (could be minutes). Not suitable when you need near-zero data loss.

AOF — The Append-Only Log

AOF (Append-Only File) logs every write command as it happens. On crash, Redis replays the log to reconstruct state — same concept as PostgreSQL's WAL.

Three fsync policies control the durability vs performance tradeoff:

Policy	Behavior	Data Loss Risk
`always`	fsync after every command	Zero — slowest
`everysec` (default)	fsync once per second	At most 1 second
`no`	OS decides when to fsync	Up to OS buffer size

AOF Rewrite: AOF files grow forever. Redis periodically rewrites the AOF — replacing the log of SET x 1, INCR x, INCR x, INCR x with just SET x 4. Done in the background via fork. File size collapses, replay time shrinks.

The Hybrid: RDB + AOF

Production best practice: enable both. RDB gives you fast restarts and clean backups. AOF gives you durability between snapshots. Redis on restart prefers AOF (more complete), falls back to RDB if AOF is missing.

Rule of thumb: Tolerating minutes of data loss? → RDB only. Storing primary data that can't be replayed? → AOF with everysec. Running something financial on Redis? → AOF with always. And always test your recovery path. A backup you've never restored is just a file.

4. Replication & Sentinel — Redis Gets Serious About Availability

A single Redis node is a single point of failure. If it goes down, every cache miss hits your database simultaneously — the thundering herd. Replication is how Redis spreads read load and survives node failures.

How Replication Works

Redis uses asynchronous leader-follower replication. One primary accepts writes. One or more replicas mirror the primary's data and serve reads.

 ┌──────────────┐
 │   Primary    │  ← ALL writes go here
 └──────┬───────┘
        │  replication stream (async)

┌────────┴────────┐
▼ ▼

┌────────────┐ ┌────────────┐
│ Replica 1 │ │ Replica 2 │
│ (read-only)│ │ (read-only)│
└────────────┘ └────────────┘
Client reads → any replica
Client writes → primary only

Initial Sync: Primary runs BGSAVE, sends the RDB snapshot, then streams all commands that happened during the snapshot. Replica loads RDB, applies the delta — fully caught up.

Partial Resync: If a replica briefly disconnects, it sends its replication offset on reconnect. The primary checks its replication backlog (a circular buffer of recent commands). If the offset is still in the buffer, only the missed commands are replayed — no full RDB needed.

The catch: Replication is asynchronous. Writes acknowledged by the primary may not yet be on replicas. If the primary crashes before replication completes, that data is gone. For critical data, pair with AOF always or use WAIT to force synchronous acknowledgement.

Sentinel — Automated Failover

Sentinel monitors your Redis topology and promotes a replica to primary when the leader fails.

Failure scenario:
1. Primary goes silent
2. Sentinel marks it SDOWN (subjectively down)
3. Quorum of Sentinels agree → ODOWN (objectively down)
4. Election: one Sentinel leads the failover
5. Best replica promoted to primary
6. Other replicas repoint to new primary
7. Clients get new primary address via Sentinel API

Run at least 3 Sentinel instances (odd number for quorum). Your client talks to Sentinel first — asks "who is the current primary?" — then connects. Libraries like ioredis and redis-py handle this transparently.

Sentinel is not a silver bullet. Failover takes 30–60 seconds by default. Design your application to handle degraded writes gracefully — queue them, circuit-break, or serve stale reads. Never assume failover is instant.

5. Redis Cluster & Sharding — Going Horizontal

Sentinel solves availability. It doesn't solve capacity. If your dataset is 500GB, no single machine runs it in RAM. That's the problem Redis Cluster solves.

Hash Slots — The Foundation of Sharding

Redis Cluster divides the keyspace into 16,384 hash slots. Every key maps to a slot: slot = CRC16(key) % 16384. Slots are distributed across nodes.

Node A (Primary) ── Slots 0–5460 + Node A’ (Replica)
Node B (Primary) ── Slots 5461–10922 + Node B’ (Replica)
Node C (Primary) ── Slots 10923–16383 + Node C’ (Replica)
Key “user:1234” → CRC16 % 16384 = 8976 → Node B
Key “user:5678” → CRC16 % 16384 = 1204 → Node A

Request Routing — MOVED & ASK

Cluster-aware clients cache the slot→node mapping and route directly. If the map is stale, the node replies with a MOVED error. During live resharding, a transitional ASK redirect is used.

The Multi-Key Trap

In Cluster mode, all keys in a single command must map to the same slot. MGET user:1 user:2 fails if those keys land on different nodes.

Solution: hash tags. {user}.1 and {user}.2 both hash on user — guaranteed same slot.

bash

Without hash tags — might fail in cluster

MGET user:1 user:2

With hash tags — guaranteed same slot

MGET {user}.1 {user}.2

This is not optional trivia. It’s a design constraint you’ll hit on day one of cluster migration.
Cluster vs Sentinel

	Sentinel	Cluster
Use when	Dataset fits in one node’s RAM	Dataset exceeds single-node RAM
Write scale	Single primary	Horizontal across nodes
Multi-key commands	Works freely	Requires hash tags
Ops complexity	Lower	Higher

Pub/Sub & Streams — Redis as a Message Bus Redis can act as a lightweight message broker — with two very different mechanisms: Pub/Sub and Streams. They solve different problems. Use the wrong one and you’ll regret it at 2am. Pub/Sub — Fire and Forget Publishers send to channels, subscribers receive. No history. No persistence. If a subscriber is offline when a message is published, that message is gone forever.

Publisher

PUBLISH notifications "user:42 completed checkout"

Subscriber

SUBSCRIBE notifications

Pattern subscribe

PSUBSCRIBE notifications:*

Use Pub/Sub for: real-time events where loss is acceptable — typing indicators, presence updates, cache invalidation signals.
Never use Pub/Sub for: anything requiring guaranteed delivery. If a subscriber is offline, the message is gone. Full stop.
The invisible problem: Pub/Sub is a telephone call, not a voicemail. If no one picks up, nothing is recorded.
Redis Streams — Persistent, Ordered Message Log
Streams are a durable, append-only log introduced in Redis 5.0 — conceptually similar to Kafka topics but living inside Redis. Messages persist until explicitly deleted.

Producer

XADD orders * user_id 42 item "laptop" amount 1299

Consumer (blocking)

XREAD COUNT 10 BLOCK 0 STREAMS orders $

Consumer Group — each message delivered to one worker

XGROUP CREATE orders order-processors $ MKSTREAM
XREADGROUP GROUP order-processors worker-1 COUNT 5 STREAMS orders >

Acknowledge — removes from Pending Entry List

XACK orders order-processors 1686123456789-0

The Pending Entry List (PEL) is the critical piece: every delivered-but-not-acknowledged message sits here. If a worker crashes, XCLAIM reassigns its pending messages to another worker. This gives you at-least-once delivery semantics.
Streams vs Pub/Sub

	Pub/Sub	Streams
Persistence	None	Yes
Delivery guarantee	None	At-least-once (PEL + XACK)
Consumer groups	No	Yes
Replay	Impossible	From any point
Best for	Ephemeral signals	Reliable event workflows

Streams vs Kafka: Under 100K events/sec and already running Redis? Streams probably suffice. Above that? Kafka earns its complexity.

The Verdict: When Redis Wins and When It Doesn’t Use Redis for: ∙ Session storage and auth tokens ∙ Rate limiting (INCR + EXPIRE is atomic) ∙ Leaderboards (Sorted Sets) ∙ Distributed locks (SET NX PX) ∙ Real-time feed aggregation (fan-out on write) ∙ Job queues (Lists or Streams) ∙ Caching hot database rows ∙ Cache invalidation across nodes (Pub/Sub) Don’t use Redis for: ∙ Primary data store for large datasets (RAM is expensive) ∙ Complex queries and JOINs ∙ Full-text search → use Elasticsearch ∙ Heavy analytics → use ClickHouse or BigQuery ∙ High-volume audit logs → use Kafka antirez built Redis to solve one problem: fast, structured, in-memory operations. Fifteen years later, it still solves exactly that — and everything built on top of it is just clever applications of the same primitives he wrote in C over a weekend in Sicily. The lesson isn’t “use Redis.” The lesson is: know what your database is optimized for, and stop asking it to do everything else.

If this was useful — share it. If you disagree — the comments exist for a reason.

Designing a Stock Exchange / Trading System at Scale A System Design Deep Dive — Question by Question

shubham pandey (Connoisseur) — Sat, 21 Mar 2026 16:34:52 +0000

Introduction

A stock exchange is one of the most demanding and unforgiving systems in software engineering. A single millisecond of downtime means millions of dollars lost. A single duplicate transaction means regulatory shutdown. A single race condition on a balance means catastrophic financial loss. This post walks through every challenge question by question, including wrong turns and how to navigate out of them.

Challenge 1: The Order Book Data Structure

Interview Question: A stock exchange must match buyers and sellers in strict price and time priority in microseconds. What data structure holds all pending buy and sell orders and efficiently finds the best match when a new order arrives?

Wrong Approach: Store orders in a simple array and scan for best match.

Why It Fails: Finding the best matching order requires scanning the entire array — O(n) time. At 10 million orders per second this is completely unacceptable.

Navigation: The key insight is to bucket orders by price level. Within each price level maintain a queue for time priority. Use a hashmap for O(1) price level lookup. But a hashmap alone cannot efficiently find the next best price without scanning all keys. You need a data structure that always gives you minimum sell price and maximum buy price instantly.

Solution: Order Book using Balanced BST plus Hashmap plus Queue per price level.

Buy Orders Bids:
$152 maps to Order3 Order7
$151 maps to Order1 Order9
$150 maps to Order2 Order4

Sell Orders Asks:
$153 maps to Order1 Order5
$154 maps to Order2 Order8
$155 maps to Order4 Order6

BST.max() returns best bid price — O(log n)
BST.min() returns best ask price — O(log n)
Queue per price level maintains FIFO time priority — O(1) enqueue and dequeue

Full Order Book operations:

Insert new order — add to queue at price level, insert price into BST — O(log n)
Find best match — BST.min() for sells, BST.max() for buys — O(log n)
Remove matched order — dequeue from front of price queue — O(1)
Remove empty price level — delete from BST — O(log n)

This is called Price-Time Priority matching — best price wins, earliest order wins at same price.

Key Insight: The Order Book is three data structures working together — BST for price level ordering, Hashmap for O(1) price lookup, and Queue for time priority within each price level.

Challenge 2: Fault Tolerance Without Sacrificing Microsecond Latency

Interview Question: The Order Book lives in memory on one machine. A crash loses every pending order worth billions of dollars. But distributing across machines adds milliseconds of latency destroying microsecond matching. How do you make a single machine Order Book fault tolerant?

Wrong Approach 1: Periodic snapshots to disk every few seconds.

Why It Fails: A crash 4.9 seconds after the last 5 second snapshot loses 4.9 seconds of orders. Any gap is unacceptable in financial systems.

Wrong Approach 2: Async write to disk for every order.

Why It Fails: Async writes are fast but data between the write and crash is still lost. Even a tiny gap is catastrophic.

Navigation: Disk writes are slow because of mechanical latency. What if persistent storage operated at RAM speed? And what if replication happened over a network so fast it was equivalent to local memory access?

Solution: NVM RAM plus RDMA replication plus Write Ahead Log.

NVM RAM — Non Volatile Memory:

Writes at RAM speed — nanoseconds not milliseconds
Data survives power loss like a disk
Combines speed of RAM with durability of disk

RDMA Replication — Remote Direct Memory Access:

Write to standby machine RAM over ultra fast data center network
Takes roughly 1 microsecond — equivalent to local memory access
Standby machine has identical Order Book state at all times

Write Ahead Log — WAL:

Every order written to NVM RAM log before being applied to Order Book
On crash — restore last snapshot, replay WAL log, recover every single order
Zero data loss guaranteed

Full fault tolerance flow:

New order arrives
Written to NVM RAM WAL simultaneously with matching engine processing
RDMA replication to standby machine in 1 microsecond
Matching engine never blocked — all persistence happens at memory speed
Machine crashes — standby takes over instantly — replays any missed WAL entries

Key Insight: NVM RAM for local persistence and RDMA replication for standby failover achieves zero data loss with microsecond failover without ever blocking the matching engine.

Challenge 3: Crash Safe Write Ahead Log

Interview Question: What happens if the system crashes while writing the log entry itself? A partial log entry could corrupt recovery entirely.

Navigation: The WAL itself needs crash safety. A log entry must be either fully written and valid or not written at all. Partial entries must be instantly detectable and discardable on recovery.

Solution: Checksum plus UUID idempotency.

Checksum for entry validity:
Every log entry includes a checksum computed from all fields in the entry. On recovery recompute the checksum from the entry fields. If it matches the entry is fully written and safe to replay. If it does not match the entry is partially written and must be discarded and ignored.

UUID for idempotent replay:
Every log entry has a unique UUID. Before executing any step check if that UUID has already been processed. If already processed skip it — no duplicate execution. If not processed execute it and mark complete in log.

Combined crash safety covers every failure scenario:

Crash before log write completes — checksum mismatch — entry discarded — no action taken
Crash after log written but before execution — both steps PENDING — replay safely from beginning
Crash mid execution — UUID check — completed steps skipped — pending steps executed
Zero data loss and zero duplicates regardless of crash timing

Key Insight: Checksum detects corrupt log entries. UUID prevents duplicate execution on replay. Together they make the WAL completely crash safe at every possible failure point.

Challenge 4: Stop Loss Cascade and Circuit Breakers

Interview Question: Millions of stop loss orders all trigger simultaneously when a price drops sharply. Each converts to a market order instantly flooding the matching engine. This is what caused the Flash Crash of 2010 when the Dow Jones dropped 1000 points in minutes. How do you handle this?

Solution Part 1 — Separate Stop Loss Order Book:

Maintain a dedicated Stop Loss Order Book alongside the main Order Book using the same price bucketing structure. Each bucket contains orders that trigger at that price level. When price drops all orders at triggered price levels are converted to market orders simultaneously and queued for the matching engine.

Solution Part 2 — Circuit Breakers — Lower Circuit and Upper Circuit:

When price moves too fast the exchange temporarily halts trading to prevent cascade.

Indian market NSE/BSE circuit breaker levels:

Index drops 10 percent — trading halts 45 minutes
Index drops 15 percent — trading halts 1 hour 45 minutes
Index drops 20 percent — trading halts rest of day
Individual stocks have 5, 10, or 20 percent circuit limits

Implementation:

Price update arrives
Circuit Breaker Service checks current price versus opening price
Move exceeds circuit limit — set stock status to HALTED in Redis instantly
Matching Engine checks Redis status before processing every single order
Status HALTED — new orders rejected and stored in pending queue with original time priority preserved
Timer expires — status set back to ACTIVE — pending queue resumes processing in order

Stop loss cascade with circuit breakers in action:

Apple drops to $144 — 2 million stop losses trigger simultaneously
First batch hits matching engine
Price drops 10 percent from opening — circuit breaker trips
Redis status set to HALTED
Remaining 1.9 million stop losses held in pending queue
Trading halts 45 minutes — market stabilizes
Trading resumes — orders process in controlled manner with original time priority preserved

Key Insight: A separate Stop Loss Order Book handles trigger detection efficiently. Circuit breakers act as the emergency brake preventing cascade failures from destroying market stability.

Challenge 5: Atomic Trade Settlement

Interview Question: Every trade must settle atomically — buyer gets shares and seller gets money together or neither happens. A crash between the two steps leaves one party with nothing. How do you guarantee atomicity?

Solution: Two Phase Commit with Write Ahead Log and UUID idempotency.

Phase 1 — Write intent to WAL before executing anything:
Both settlement steps are written to the WAL as PENDING with a single UUID before any execution begins. This is the commit point — if the system crashes before this write nothing has happened and nothing needs to be undone.

Phase 2 — Execute steps and mark complete one by one:
Execute step 1, mark it COMPLETED in WAL. Execute step 2, mark it COMPLETED in WAL. Transaction fully settled.

Crash recovery at any point:

Crash before WAL write — checksum mismatch — entry discarded — no action taken
Crash after WAL written — both steps PENDING — replay from beginning safely
Crash after step 1 — WAL shows step 1 COMPLETED step 2 PENDING — UUID check skips step 1 — executes step 2 only
Crash after step 2 — WAL shows both COMPLETED — UUID check skips both — transaction already settled

Note on T+2 settlement: Most markets settle trades 2 business days after execution. The WAL and two phase commit pattern applies identically at settlement time — the same crash safety guarantees apply whether settlement happens in microseconds or 2 days later.

Key Insight: Write intent before acting, mark completion after each step, use UUID to prevent duplicate execution. This three part pattern makes any multi-step financial operation completely crash safe.

Challenge 6: Counterparty Risk and Pre-Trade Fund Locking

Interview Question: Settlement happens T+2 days after trade execution. What if the buyer does not have enough money or the seller does not own the shares they sold? How does the exchange protect against counterparty risk?

Solution: Pre-trade risk checks with immediate fund and share locking.

Before any order enters the Order Book:

Check available balance — does buyer have sufficient funds?
Lock required funds immediately — frozen for this specific order
Check share ownership — does seller actually own the shares?
Lock shares immediately — cannot be sold in any other order

Account state during pending trade:
Total Balance 15 million dollars
Locked Amount 10 million dollars — frozen for pending order
Available Balance 5 million dollars — available for new orders only

Lock lifecycle:

Order placed — funds locked immediately
Order cancelled — locked funds released immediately
Order partially filled — proportional locked amount released proportionally
Order fully filled — locked funds transferred to counterparty at T+2

This is called Margin Requirements in financial terminology. Exchanges also require traders to maintain a minimum margin balance as additional protection against large adverse price moves during the T+2 window.

Key Insight: Pre-trade risk checks with immediate fund locking eliminate counterparty risk entirely. By the time a trade executes all required funds and shares are already reserved and cannot be used elsewhere.

Challenge 7: Balance Storage — ACID vs Speed

Interview Question: Trader balance checks happen before every order — thousands per second. You need extremely fast reads and strongly consistent writes. Eventual consistency is not acceptable. What storage solution do you use?

Wrong Approach: Store all balance data in Redis only.

Why It Fails: Redis in cluster mode is eventually consistent. Two simultaneous orders can both read the same available balance before either locks funds — allowing a trader to commit more funds than they have. This race condition on financial balance is catastrophic.

Why Eventual Consistency Is Unacceptable Here:

Twitter feed showing slightly stale tweets — acceptable, nobody loses money
Uber showing driver location 100 meters off — acceptable, minor inconvenience
Trader balance showing stale available funds — catastrophic, exchange loses millions instantly

Solution: Relational Database for ACID guarantees plus Redis cache for read speed.

Write path — fund locking via relational database with row level locking:
Begin transaction. Select balance for trader with row level lock. If available balance is greater than or equal to order amount then update locked amount and reduce available balance and commit. Otherwise rollback and reject order. End transaction.

Two simultaneous orders handled safely:

Order 1 acquires row level lock — locks 10 million dollars — commits
Order 2 waits for row level lock to be released
Order 1 commits — lock released — balance updated to reflect locking
Order 2 reads updated balance — zero available — rejected cleanly
No double spending possible under any timing scenario

Read path — balance check via Redis cache:

Available balance cached in Redis — O(1) read for every order check
Balance updated in relational DB — Redis cache invalidated immediately
Next read — cache miss — reload from DB — cache refreshed
Stale cache never used for actual locking — only the relational DB performs locking

Key Insight: Relational database provides ACID guarantees preventing race conditions on financial balances. Redis cache provides read speed for high throughput balance checks. Never use eventual consistency for financial data.

Full Architecture Summary

Order Book data structure — BST plus Hashmap plus Queue per price level
Fault tolerance — NVM RAM plus RDMA replication plus Write Ahead Log
WAL crash safety — Checksum for entry validity plus UUID for idempotent replay
Stop loss handling — Separate Stop Loss Order Book with price bucketing
Cascade prevention — Circuit breakers with Redis status and pending queue
Trade settlement — Two Phase Commit with WAL replay
Counterparty risk — Pre-trade risk checks with immediate fund locking
Balance storage — Relational DB for ACID plus Redis cache for read speed

Final Thoughts

A stock exchange is where computer science meets finance at the highest possible stakes. Every design decision has a dollar value attached to it. Microsecond latency, zero data loss, atomic settlements, and race condition free balance management are not nice to have — they are regulatory requirements.

The recurring theme throughout this design is that financial systems demand absolute guarantees at every layer. Where other systems tolerate eventual consistency and minor data loss, a stock exchange tolerates neither. Write Ahead Logging, ACID transactions, checksums, UUID idempotency, and circuit breakers are not over-engineering — they are the minimum bar.

Happy building. 🚀

Designing Uber / Ride Sharing at Scale: deep dive

shubham pandey (Connoisseur) — Thu, 19 Mar 2026 03:14:00 +0000

Introduction

Uber seems simple on the surface — request a ride, a driver shows up. But underneath it is one of the most complex real time distributed systems ever built. Real time location tracking, intelligent driver matching, dynamic surge pricing, and accurate fare calculation all happening simultaneously at massive scale. This post walks through every challenge question by question, including wrong turns and how to navigate out of them.

Challenge 1: Real Time Driver Location Tracking

Interview Question: Drivers are constantly moving and their location changes every few seconds. How do you design a system that collects and stores driver locations in real time — and how frequently should a driver app send its location to your servers?

Navigation: The tradeoff is between location accuracy and server load. Sending every 100ms is too frequent and wastes battery and bandwidth. Sending every 30 seconds is too infrequent for a real time map. The middle ground is adaptive frequency based on trip phase.

Solution: Kafka stream for location events with adaptive GPS frequency.

Driver online no trip — GPS every 5 seconds
Driver approaching rider — GPS every 2 to 3 seconds
Trip in progress — GPS every 1 to 2 seconds
Driver stationary at red light — reduce to every 5 seconds automatically
Driver moving fast on highway — increase to every 1 second automatically

Driver app emits location events to Kafka. Location Service consumes from Kafka and updates driver position in storage layer.

Key Insight: Adaptive GPS frequency based on trip phase and speed balances accuracy with battery life and server load. One size does not fit all phases of a trip.

Challenge 2: Storing and Querying Driver Locations

Interview Question: 5 million active drivers send location updates every 2 seconds. That is 2.5 million location writes per second. You only ever need the most recent location — old locations are immediately irrelevant. Does a traditional database make sense here?

Wrong Approach: Store driver locations in a traditional database like PostgreSQL or DynamoDB.

Why It Fails: Traditional databases are optimized for persistent storage with complex queries. Driver locations are ephemeral — they change every 2 seconds and old values have zero value. 2.5 million writes per second on a traditional database is extremely heavy and unnecessary for data that expires immediately.

Navigation: You need extremely fast reads and writes for data that does not need to persist permanently. Redis is the perfect fit — in memory, sub millisecond reads and writes.

Solution: Redis GEO commands for location storage and proximity search.

Driver location update:
GEOADD drivers longitude latitude driverID

Find all drivers within 2km of rider:
GEORADIUS drivers rider_longitude rider_latitude 2 km

Under the hood Redis GEO uses a Sorted Set — it converts latitude and longitude into a special score called a Geohash. This enables O(log n) proximity queries across millions of driver locations.

Full location pipeline:

Driver app sends location every 2 seconds
Kafka receives location event
Location Service consumes from Kafka
Updates Redis GEO with latest driver position
Old position automatically overwritten — no cleanup needed
Rider opens app — GEORADIUS query returns nearby drivers instantly

Key Insight: Redis GEO is purpose built for real time location storage and proximity queries. It replaces an entire geospatial database with two commands — GEOADD and GEORADIUS.

Challenge 3: Pushing Driver Locations to Rider App

Interview Question: A rider has the app open and sees drivers moving on the map in real time. Those locations update every 2 seconds. How does the rider app get continuous location updates — polling or push?

Wrong Approach: Polling since Redis reads are fast and have no downside.

Why It Fails: Even with fast Redis reads, 100 million riders polling every 2 seconds generates 50 million HTTP requests per second. Each request has HTTP overhead — headers, connection setup, authentication. Most responses are nearly identical since drivers barely move in 2 seconds. The network cost is enormous even if Redis responds instantly.

Navigation: The smarter approach is only pushing updates when a driver actually moves significantly — delta updates. This requires a persistent connection rather than repeated polling.

Solution: Hybrid approach based on trip phase.

Phase 1 browsing before booking:

Simple polling every 5 seconds
Rider just needs approximate driver positions
Slight staleness is acceptable
No persistent connection needed for 100 million casual browsers

Phase 2 driver matched and approaching:

WebSocket connection opened at moment of booking confirmation
Server pushes driver location updates in real time
Rider tracks their specific driver with precision

Phase 3 trip in progress:

WebSocket connection maintained
Highly accurate real time location updates
Both rider and driver tracked simultaneously

Key Insight: Do not over-engineer connections until they are actually needed. Polling is acceptable for casual browsing. WebSocket is reserved for the moment precision and real time tracking genuinely matter.

Challenge 4: Intelligent Driver Matching

Interview Question: Rider confirms booking. There are 500 available drivers within 2km. Uber needs to pick the optimal driver considering distance, rating, availability, and acceptance rate. The naive approach queries 500 drivers from Redis then does 500 individual database lookups for each driver's metadata. At 1 million ride requests per day that is 500 million database queries at peak. What is wrong with this?

Real World Observation: Uber actually sends alert to closest driver first and if they do not accept moves to the next nearest driver.

Why Pure Sequential Is Too Slow: Driver 1 has 15 seconds to accept. Driver 1 ignores it — 15 seconds wasted. Driver 2 also ignores — another 15 seconds wasted. Rider has been waiting 30 seconds with no driver assigned. In a city with low driver availability this cascading sequential approach means riders wait minutes just for assignment.

Why Notifying All 500 Is Also Wrong: Multiple drivers accept simultaneously causing race conditions. 497 drivers receive notifications for nothing — wasted alerts and poor driver experience.

Solution: Batch notifications with distributed Redis lock.

Query Redis GEORADIUS — get 500 nearby drivers sorted by distance
Send notification to first batch of 10 closest drivers simultaneously
First driver to accept acquires distributed lock on that ride request using Redis SETNX
SETNX is atomic — if two drivers accept simultaneously only one gets the lock
Lock acquired — ride assigned — all other notifications cancelled
Nobody accepts in 15 seconds — send to next batch of 20 drivers expanding radius
Repeat until driver found

Redis SETNX for atomic locking:
SETNX rideID driverID — returns 1 if lock acquired, 0 if already taken

TTL on the lock — 15 seconds matching driver acceptance window:

Server crashes after lock acquired — lock automatically expires after 15 seconds
Next batch gets notified — fresh lock available
No stuck locks, no riders waiting forever

Key Insight: Batch notifications with Redis atomic locking balances speed with fairness. TTL prevents deadlocks from server crashes — the same pattern that saved us in WhatsApp and Twitter designs.

Challenge 5: Surge Pricing

Interview Question: Uber's surge pricing multiplies fares during peak demand. It is calculated based on supply and demand ratio per geographic zone in real time. How do you compute this ratio across thousands of zones simultaneously as riders request and drivers come online?

Solution: Kafka event streaming with Redis Sorted Set sliding window per zone.

Every zone has two Redis Sorted Sets:

zone5:requests — ride requests with Unix timestamp as score
zone5:drivers — available drivers with Unix timestamp as score

When rider requests ride in Zone 5:
ZADD zone5:requests current_timestamp unique_request_id

When driver comes online in Zone 5:
ZADD zone5:drivers current_timestamp driver_id

Every minute Surge Pricing Service:

ZREMRANGEBYSCORE zone5:requests 0 ten_minutes_ago_timestamp — removes stale requests
ZREMRANGEBYSCORE zone5:drivers 0 ten_minutes_ago_timestamp — removes stale drivers
ZCARD zone5:requests — count of active requests in last 10 minutes
ZCARD zone5:drivers — count of available drivers in last 10 minutes
Surge ratio = requests divided by drivers
Updates Redis with surge multiplier for Zone 5

Rider requests ride — reads surge multiplier from Redis instantly. All computation happens asynchronously via Kafka — never blocks the main ride request flow.

Key Insight: The sliding time window pattern using timestamp as score and ZREMRANGEBYSCORE for expiry — identical to Twitter trending topics — applies perfectly to surge pricing. Events older than 10 minutes automatically stop contributing to the surge calculation.

Challenge 6: GPS Coordinate Collection Strategy

Interview Question: A 30 minute trip at GPS updates every 2 seconds generates 900 coordinates. At 10 million trips per day that is 9 billion GPS coordinates daily. How do you store trip route data efficiently?

Wrong Approach 1: Store only start and end coordinates.
Why It Fails: Straight line between start and end ignores the actual route taken. Non optimal routes and detours are completely missed. Fare calculation is inaccurate.

Wrong Approach 2: Store every single GPS coordinate.
Why It Fails: 9 billion coordinates per day at 16 bytes each is 144GB of raw GPS data daily. After one year that is 52TB of mostly redundant data.

Solution: Store coordinates at smart intervals with intelligent compression.

During trip store coordinates every 2 seconds in Redis temporarily
Map Matching Service processes coordinates in near real time
Reconstructs clean route using 20 to 30 key waypoints instead of 900 raw points
Clean reconstructed route stored permanently in database
Raw GPS coordinates discarded after processing

Smart optimizations:

Dead reckoning between GPS updates using phone accelerometer and gyroscope for smooth map animation
Encoded Polyline Algorithm compresses coordinate sequences by up to 75 percent before sending to server
Geofencing triggers immediate GPS update when driver enters airport or landmark zones regardless of interval
Stationary detection reduces frequency automatically when driver is not moving

Storage lifecycle:

Raw GPS coordinates kept temporarily in Redis during trip
Clean reconstructed route kept 90 days for dispute resolution
Raw data deleted immediately after fare calculation
Driver location history anonymized and aggregated for traffic data

Key Insight: Never store raw GPS permanently. Process immediately into clean reconstructed routes, discard raw data, and keep only the meaningful waypoints. Reduces storage by over 90 percent.

Challenge 7: GPS Gap Filling and Map Matching

Interview Question: Driver enters a tunnel and GPS drops for 90 seconds. You have a coordinate at minute 3 and the next at minute 4:30. The straight line between them cuts through buildings. How do you reconstruct the actual route taken?

Wrong Approach: Always assume the longest route for worst case scenario.

Why It Fails: Driver takes the optimal shortest route through the tunnel but gets charged for the longest possible route. Rider is overcharged for a route the driver never took. Unfair to both parties and destroys user trust.

Navigation: Always assume the most probable route — not the longest or shortest. Roads are not random. Drivers can only travel on known roads. Even with GPS gaps there are only a finite number of possible routes between two points.

Solution: Map matching with Viterbi algorithm and Google Maps fallback.

Step 1 — Road Network Graph:
Uber maintains an internal graph of every city's road network. Every intersection is a node. Every road segment is an edge with metadata — speed limit, road type, historical average speed, traffic patterns.

Step 2 — GPS Coordinate Snapping:
Raw GPS has 5 to 15 meter error even without signal drops. Every coordinate gets snapped to the nearest road segment eliminating small inaccuracies.

Step 3 — Viterbi Algorithm for Gap Filling:
When GPS drops the Viterbi algorithm finds the most probable path through the road network graph between the last known point and next known point.

It considers:

All possible routes between the two known points
Historical speed data on each road segment
Time elapsed during the gap — 90 seconds at 50kmh means roughly 1.25km travelled
Which roads are physically reachable in that time window
Historical probability of drivers choosing each route from millions of past trips

Example:

Possible routes A to B — Via Highway 65 percent probability, Via Main Street 25 percent, Via Side Streets 10 percent
Viterbi selects Via Highway as most probable route

Step 4 — Google Maps Fallback:
When internal map matching fails due to sparse road data, new roads, or construction — fall back to Google Maps Distance Matrix API or Mapbox for route reconstruction.

Why not always use Google Maps:

Google Maps API costs 5 dollars per 1000 requests
Uber does 10 million trips per day
That is 50,000 dollars per day just for fare calculation
Internal map matching costs a fraction of that
Google Maps is only the fallback for edge cases

Step 5 — Final Fare Calculation:
Total distance equals sum of all road segments travelled. Fare equals base fare plus distance rate multiplied by total kilometers plus time rate multiplied by total minutes.

Key Insight: Raw GPS is never fully trusted. Map matching snaps coordinates to known roads, fills gaps using historical probability via the Viterbi algorithm, and falls back to Google Maps only when internal matching fails. Most probable route — not longest, not shortest.

Full Architecture Summary

Driver location tracking — Kafka stream with adaptive GPS frequency
Location storage — Redis GEO with GEOADD and GEORADIUS
Rider map updates Phase 1 — Polling every 5 seconds
Rider map updates Phase 2 and 3 — WebSocket on booking confirmation
Driver matching — Batch notifications with Redis SETNX lock and TTL
Surge pricing — Kafka events with Redis Sorted Set sliding window per zone
GPS coordinate storage — Temporary Redis then clean reconstructed route in database
GPS gap filling — Map matching with Viterbi algorithm and Google Maps fallback
Fare calculation — Reconstructed route distance plus time based pricing

Final Thoughts

Uber is a masterclass in combining real time systems with intelligent data processing. Every feature that feels instant and accurate to the rider — live driver locations, fast matching, fair fares — is backed by a carefully orchestrated pipeline of Kafka streams, Redis data structures, and probabilistic algorithms working together seamlessly.

The recurring theme across every challenge is that naive approaches collapse at scale and the right solution always involves pushing work to the right layer — Kafka for async processing, Redis for real time state, databases for durable history, and smart algorithms for filling in what sensors miss.

Happy building. 🚀

Designing WhatsApp / Chat System at Scale Deep Dive — Question by Question

shubham pandey (Connoisseur) — Wed, 11 Mar 2026 15:30:10 +0000

Introduction

A chat application seems simple — send a message, receive a message. But at 2 billion users, WhatsApp hides some of the most complex distributed systems challenges in software engineering. This post walks through the real complexity challenge by challenge, including the wrong turns and how to navigate out of them.

Challenge 1: The Naive Message Delivery Approach

Interview Question: Walk me through the basic flow of how you would get a message from one phone to another — and where does the first major challenge appear?

Initial Approach: User A sends a message, it goes to the WhatsApp server, and the server uses push notifications to deliver it to User B.

Why Push Notifications Alone Are Not Enough: Push notifications work perfectly when the app is closed — waking up the device and alerting the user. But when User B has WhatsApp actively open on their screen, push notifications add 100-500ms latency through FCM or SNS. In an active conversation that feels laggy and unnatural. WhatsApp delivers messages in under 100ms when both users are online.

Navigation: The key realization is that there are two distinct scenarios — app open and app closed — and they need different delivery mechanisms.

Solution: Hybrid delivery approach.

App is open and active — use WebSocket persistent connection for instant sub 100ms delivery
App is closed or in background — fall back to FCM or AWS SNS push notification

Key Insight: Push notifications and WebSockets solve different problems. WebSockets handle real time delivery for active users. Push notifications handle delivery for offline or background users. You need both.

Challenge 2: WebSocket Connections at Scale

Interview Question: WhatsApp has 2 billion users. Even 10% active simultaneously means 200 million open WebSocket connections. A single server holds roughly 65,000 connections. How does Server 1 deliver a message to User B who is connected to Server 7?

Initial Approach: Each server handles its own connections but has no knowledge of where other users are connected.

Why It Fails: With thousands of WebSocket servers each holding a slice of connections, a message arriving at Server 1 has no way to reach User B on Server 7 without a routing mechanism.

Navigation: You need a centralized lookup that any server can query to find where any user is currently connected.

Solution: Redis lookup table mapping users to their WebSocket server.

User B connects to Server 7 — Redis stores UserB mapped to Server 7
User A sends message — arrives at Server 1
Server 1 queries Redis — finds User B on Server 7
Server 1 forwards message to Server 7
Server 7 delivers to User B via WebSocket instantly

Key Insight: Redis acts as a real time routing table for WebSocket connections. Every connection and disconnection updates this table so any server can route to any user in O(1).

Challenge 3: Message Durability and the ACK Pattern

Interview Question: User B temporarily loses internet connection for 30 seconds while a message is being delivered. What happens to the message and how does User B get it when they reconnect?

Navigation: The key insight is that delivery confirmation must be explicit — the server cannot assume a message was delivered just because it sent it. The client must acknowledge receipt.

Solution: ACK (Acknowledgement) pattern with database persistence.

Message delivered to User B — User B's app sends ACK back to server
Server receives ACK — marks message as delivered — no further action needed
No ACK received — server knows User B is offline
Server updates Redis — UserB marked as offline
Server stores message in database for retry
User B reconnects — server checks database — delivers all pending messages

This is exactly what WhatsApp's tick system means:

Single grey tick — message reached WhatsApp server
Double grey tick — message delivered to User B's device and ACK received
Double blue tick — User B has read the message

Key Insight: Explicit ACKs are the foundation of reliable message delivery. Never assume delivery succeeded without confirmation from the recipient.

Challenge 4: The Duplicate Message Problem

Interview Question: Server delivers message to User B. User B sends ACK but the ACK gets lost in the network. Server never receives ACK, assumes delivery failed, and retries. User B now sees the same message twice. How do you prevent duplicate messages?

Wrong Approach: Add ACK from server to client so both sides confirm. This creates an infinite loop — ACK for the ACK for the ACK.

Navigation: More ACKs do not solve duplicates. The solution is recognizing duplicates when they arrive rather than preventing retries entirely. Every message needs a globally unique identity so the receiver can detect and discard messages it has already seen.

Wrong Approach 2: Hash the message content for unique identification.

Why It Fails: If User A sends "Hello" twice, both messages produce identical hashes. The second legitimate message gets discarded as a duplicate.

Wrong Approach 3: Use timestamp for unique identification.

Why It Fails: Two messages sent in the same millisecond get identical timestamps. Clock skew between devices also causes ordering and uniqueness issues.

Solution: UUID (Universally Unique Identifier) generated server side for every message.

Server generates UUID for each message
UUID sent to User B along with message content
User B's app stores all received UUIDs locally with 30 day TTL
Duplicate arrives — app checks UUID — already seen — discard silently
TTL matches message retry window — after 30 days message is either delivered or dropped

On UUID collision probability — UUID is 128 bits with 340 undecillion possible values. You would need to generate 1 billion UUIDs per second for 100 years before expecting a single collision. Treat collision as practically impossible.

Key Insight: Idempotency via UUID is the standard solution to duplicate delivery in distributed messaging. The receiver, not the sender, is responsible for deduplication.

Challenge 5: Group Messaging Fan-out

Interview Question: User A sends a message in a group of 1024 members. Some are online on different servers, some are offline. How do you deliver one message to 1024 people simultaneously?

Wrong Approach: Fetch the WebSocket server location of each of the 1024 members from Redis and forward to each server synchronously while User A waits.

Why It Fails: WhatsApp handles 1 billion group messages per day. With average group size of 200 members that is 200 billion Redis lookups and 200 billion server to server forwarding calls per day — all happening synchronously while users wait for send confirmation.

Navigation: User A should never wait for 1024 individual deliveries to complete. This is the same async pattern used for Twitter hashtag processing — publish an event and let a background service handle the fan-out.

Solution: Kafka async fan-out via dedicated Group Service.

User A sends message — server publishes one single event to Kafka instantly
User A gets immediate send confirmation — never waits for 1024 deliveries
Group Service consumes from Kafka
Group Service fetches all 1024 member locations from Redis in one batch lookup
Delivers to online members via their WebSocket servers
Stores messages in database for offline members

Key Insight: Kafka decouples message sending from message delivery. The sender gets instant confirmation while fan-out happens asynchronously in the background at whatever pace the system can handle.

Challenge 6: Group Read Receipts

Interview Question: WhatsApp shows blue double tick in groups only when all 1024 members have read the message. How do you track per member read status efficiently across millions of messages?

Wrong Approach: Store a simple read counter per message.

Why It Fails: A counter tells you how many people have read but not who specifically has read. WhatsApp lets you tap a message to see exactly which members have read and which have not. A counter cannot provide this granularity.

Solution: Redis Set per message using UUID as key.

Key is the message UUID
Value is a Redis Set containing user IDs of members who have read

Operations:

Mark as read — SADD messageUUID UserB — O(1)
Check if specific member read — SISMEMBER messageUUID UserB — O(1) returns true or false
Check if everyone read for blue tick — SCARD messageUUID equals 1024 — O(1)

Memory Management — Two layer cleanup strategy:

Eager deletion — all 1024 members have read — delete Redis Set immediately, no point keeping it
TTL safety net — 30 day TTL catches messages that some members never read, orphaned entries automatically cleaned up

Key Insight: Redis Set gives O(1) membership checking and cardinality counting — perfect for tracking who has read a message. Eager deletion plus TTL prevents memory from growing unbounded.

Challenge 7: Online Status and the Stale Presence Problem

Interview Question: 2 billion users opening and closing WhatsApp constantly. Every open means online, every close means last seen with timestamp. But what happens when a phone crashes or loses internet without explicitly sending an offline signal? The server still shows the user as online forever.

This is called the stale presence problem.

Wrong Approach: Store online or offline status in Redis and update on app open and close.

Why It Fails: App crashes and network drops never trigger an explicit offline signal. Status gets stuck as online indefinitely.

Navigation: If you cannot rely on an explicit offline signal, you need a mechanism where online status automatically expires unless actively refreshed. TTL on the Redis entry solves this — but only if something keeps refreshing it while the user is genuinely online.

Solution: Heartbeat mechanism with short TTL.

User opens WhatsApp — Redis sets UserA to online with 30 second TTL
App sends heartbeat ping every 20 seconds — refreshes TTL
User closes app normally — app sends explicit offline signal — Redis updates to last seen with timestamp
App crashes or loses internet — heartbeats stop — TTL expires after 30 seconds — status automatically becomes offline

The 20 and 30 second rule:

Heartbeat interval 20 seconds — always refreshes before TTL expires
TTL 30 seconds — buffer for network hiccups but short enough to detect crashes quickly

Key Insight: Heartbeat plus TTL is the standard pattern for presence detection in distributed systems. Never rely on explicit disconnect signals alone — networks and devices are too unreliable.

Full Architecture Summary

Real time messaging — WebSocket persistent connections for active users
Offline delivery — FCM and AWS SNS push notifications
Message routing — Redis lookup table mapping users to WebSocket servers
Message durability — Database persistence with explicit ACK pattern
Duplicate prevention — Server generated UUID with 30 day TTL
Group fan-out — Kafka async processing via dedicated Group Service
Group read receipts — Redis Set per message with eager deletion and TTL
Online presence — Redis heartbeat with 30 second TTL auto expiry

Final Thoughts

WhatsApp at 2 billion users is a masterclass in combining simple building blocks — WebSockets, Redis, Kafka, and a database — into a system that feels effortless to the end user. Every feature that seems trivial on the surface hides a distributed systems challenge underneath.

The recurring theme throughout this design is that reliability requires explicit confirmation at every step. ACKs for delivery, UUIDs for deduplication, heartbeats for presence — nothing is assumed, everything is verified.

Happy building. 🚀

Social Media Feed at Scale A System Design Deep Dive — Question by Question

shubham pandey (Connoisseur) — Wed, 11 Mar 2026 10:19:50 +0000

Introduction

A social media feed seems simple on the surface — show the latest tweets from people you follow. But at 300 million users, it becomes one of the most challenging distributed systems problems in software engineering. This post walks through the real complexity challenge by challenge, including the wrong turns and how to navigate out of them.

Challenge 1: The Naive Feed Approach

Interview Question: When millions of users open their feed simultaneously — how would you naively build it and where does that break down?

Wrong Approach: For each user opening the app, fetch latest tweets from all 500 followed accounts iteratively using a for loop. Do this for all 300 million users opening the app.

Why It Fails: Let us do the math. 300 million users multiplied by 500 followed accounts equals 150 billion database lookups just to render feeds simultaneously. That is before anyone even posts a tweet. A single database melts instantly under this pressure.

Key Insight: The naive pull approach is unusable at scale. Reading feed at request time means the database pays the price for every single app open.

Challenge 2: Flipping the Approach — Fan-out on Write

Interview Question: Instead of each user pulling tweets when they open the app, what if you did the work upfront at write time?

Navigation: After understanding that 150 billion lookups is unworkable, the natural flip is — what if the work happens when someone tweets instead of when someone reads? Push the tweet to all followers at write time so that reading the feed becomes a simple instant lookup.

Solution: When someone posts a tweet, the Feed Service pushes it to all followers' personal Redis queues immediately. When a user opens the app they read directly from their own Redis queue. Feed load becomes one single cache lookup instead of 150 billion queries.

This is called Fan-out on Write. Twitter calls the per-user queue the Home Timeline Cache stored in Redis.

Key Insight: Pre-computing the feed at write time trades write complexity for extremely fast reads. Feed load becomes O(1) instead of O(followers).

Challenge 3: The Celebrity Problem

Interview Question: Cristiano Ronaldo has 150 million followers. He posts a tweet. Your Feed Service must now push that one tweet to 150 million Redis queues simultaneously. What happens?

Wrong Approach: Iterate through all followers and push to each queue. This is just the naive approach in reverse — 150 million write operations for one tweet. At 50 tweets per day that is 7.5 billion Redis writes for Ronaldo alone.

Navigation: The key realization is that normal users and celebrity users have fundamentally different fan-out costs. We need to treat them differently.

Solution: Hybrid approach — users above a follower threshold (e.g. 100K followers) are treated as celebrities. Normal accounts use fan-out on write as before. Celebrity accounts skip the personal queue entirely and their tweets are fetched differently at read time.

Key Insight: One size does not fit all. High follower accounts need a completely different write strategy to prevent write amplification explosion.

Challenge 4: The Pagination Cursor Problem

Interview Question: With the hybrid approach, your feed merges two sources — personal Redis queue for normal friends and a separate fetch for celebrity tweets. How do you paginate across two independent sources?

Wrong Approach: Sort on the client side app. Send all tweets from both sources to the phone and let it sort.

Why It Fails: Sending 1500 tweets to 300 million mobile devices simultaneously destroys network bandwidth. Users on slow connections have to download everything before seeing anything.

Second Wrong Approach: Push celebrity tweets to personal Redis queue only when the user opens the app to keep everything in one place.

Why It Fails: If 50 million followers open the app simultaneously after Ronaldo tweets, you now have 50 million write operations triggered at app open time instead of tweet time. The explosion just moved, it did not disappear.

Navigation: The pagination cursor problem with two sources is real but it is the smaller of the two evils compared to write amplification. The right move is to solve the cursor problem rather than abandon the hybrid approach.

Solution: Store the per-user cursor for each source independently. After each scroll request return two cursors to the app — one for personal queue position and one for celebrity cache position. Next scroll request resumes from exactly where each source left off.

Key Insight: Multi-source pagination requires independent cursors per source. The complexity is worth it compared to the alternative of catastrophic write amplification.

Challenge 5: Shared Celebrity Cache

Interview Question: Does every one of the 50 million followers of Ronaldo actually need their own personal copy of his tweet?

Navigation: The hint here was powerful — if all 50 million users are reading the exact same tweet, storing 50 million identical copies is wasteful. What if there was one shared copy everyone reads from?

Solution: Celebrity tweets are stored once in a shared Redis cache. All followers read from the same single cache entry. 50 million users opening the app simultaneously all hit one shared cache — zero write amplification, one read source.

Final hybrid architecture:

Normal friends tweets pushed to personal Redis queue per user
Celebrity tweets stored in shared Redis cache once
At feed load server merges personal queue and relevant celebrity caches
Follower to celebrity mapping stored in cache with DynamoDB as fallback

Key Insight: Shared cache for celebrity tweets eliminates write amplification entirely. One write serves 150 million readers.

Challenge 6: Follower Mapping Storage

Interview Question: How does the server know which celebrity caches to fetch when a user opens their feed? Where do you store the mapping of which celebrities each user follows?

Wrong Approach: Query DynamoDB on every app open.

Why It Fails: DynamoDB adds unnecessary latency for data that rarely changes. You do not unfollow someone every minute.

Navigation: Follower mapping is read on every single app open, changes very infrequently, and is the same data read repeatedly. This is a perfect cache use case.

Solution: Store follower mapping in Redis cache. On cache miss fall back to DynamoDB and reload into cache. This is the cache-aside pattern — cache as the fast layer, DynamoDB as the source of truth underneath.

Key Insight: Cache-aside pattern is ideal for data that is read frequently but updated rarely. Always have a persistent fallback for cache misses.

Challenge 7: Trending Topics — The Counting Problem

Interview Question: Twitter shows trending hashtags from the last hour. At 6000 tweets per second with 3 hashtags each that is 18000 hashtag events per second. Running a COUNT query on your main database every few minutes — what is wrong with this?

Wrong Approach: Query the main tweet database every few minutes counting hashtag occurrences in the last hour.

Why It Fails: An expensive aggregation query competes directly with 6000 writes per second on the same database. The database gets crushed under simultaneous heavy reads and writes.

Second Approach: Store each hashtag in a separate database and increment its counter on every mention.

Problem: 18000 counter increments per second on the same rows causes race conditions. Two requests read the same counter value simultaneously and both try to increment — one update gets lost. Adding locks solves correctness but serializes 18000 operations per second, destroying throughput.

Navigation: The hint was — do you even need a database for counting? Counting does not need to be persistent. Trending from 6 months ago is useless. What if counting lived entirely in memory with a data structure built for atomic increments and automatic sorting?

Solution: Redis Sorted Set. Each hashtag is a member, its mention count is the score. ZINCRBY atomically increments the score with no locking needed. ZREVRANGE returns top N hashtags instantly by score.

Key Insight: Redis Sorted Set replaces a database entirely for counting and ranking. Atomic score increments eliminate race conditions without any locking.

Challenge 8: The Sliding Time Window Problem

Interview Question: Trending should reflect only the last 60 minutes. If you just keep incrementing scores forever, hashtags from yesterday pollute your trending list. How do you make scores reflect only recent mentions?

Wrong Approach: Give each hashtag a 24 hour TTL in Redis.

Why It Fails: TTL deletes the entire key after 24 hours. It does not expire individual mentions within the window. A hashtag with 5 million mentions accumulated over 24 hours still dominates trending even if nobody mentioned it in the last 60 minutes.

Navigation: Instead of expiring the whole hashtag, expire individual mentions. Each mention has a timestamp. Remove mentions older than 60 minutes from the count. Redis Sorted Set supports exactly this with score as timestamp.

Solution: Two Redis Sorted Sets working together.

Per hashtag sorted set tracks the time window:

Member = unique tweet ID
Score = Unix timestamp of the mention
ZREMRANGEBYSCORE removes mentions older than 60 minutes
ZCOUNT returns exact mention count in the sliding window

Global trending sorted set tracks the ranking:

Member = hashtag name
Score = current mention count from the time window
ZREVRANGE returns top 10 trending hashtags instantly

Every new mention updates both structures keeping the sliding window accurate in near real time.

Key Insight: Sliding window is achieved by using timestamp as score and ZREMRANGEBYSCORE to expire old mentions. Two sorted sets separate the concerns of time windowing and global ranking.

Challenge 9: Async Processing with Kafka

Interview Question: Updating Redis sorted sets for every hashtag at 18000 operations per second — should this happen synchronously while the user waits for their tweet to post?

Navigation: The user posts a tweet and immediately gets a response. Hashtag counting is a background concern — the user should never wait for it. This calls for asynchronous processing with a message queue in between.

Solution: Kafka sits between the tweet service and the hashtag counting service.

Flow:

User posts tweet
Tweet service saves tweet and publishes hashtag event to Kafka instantly
User gets immediate response — they never wait for hashtag counting
Hashtag consumer service reads from Kafka and updates Redis sorted sets
If consumer goes down Kafka holds all events — nothing is lost

Why Kafka over a database or Redis queue:

Stores events in order
Multiple consumers can read independently
Events survive consumer downtime
Handles millions of events per second effortlessly

Key Insight: Kafka decouples the tweet service from hashtag processing entirely. The tweet service never blocks and hashtag counting scales independently.

Challenge 10: Real Time Push Notifications

Interview Question: Twitter notifies you within seconds when someone likes your tweet. 300 million phones are open right now. The naive polling approach — each phone asks the server every 5 seconds for new notifications — generates 60 million requests per second, most returning empty. How do you push notifications instantly without polling?

Wrong Approach: Each phone polls the server every few seconds asking for new notifications.

Why It Fails: 300 million phones polling every 5 seconds is 60 million requests per second of pure wasted load. The vast majority return empty responses.

Navigation: Instead of phones asking the server, the server should tell the phones. This requires a persistent connection — the phone connects once and keeps that connection open so the server can push anytime.

Solution: AWS SNS or Google FCM handles persistent connections to all mobile devices at scale. No need to reinvent this — cloud providers have already solved it.

Notification flow:

User likes a tweet
Kafka event published
Notification service consumes from Kafka
Notification service calls AWS SNS or Google FCM
SNS or FCM pushes notification to phone instantly via persistent connection

Key Insight: Never reinvent infrastructure that cloud providers have already solved at scale. AWS SNS and Google FCM handle billions of push notifications daily — use them.

Full Architecture Summary

Feed generation — Fan-out on write to personal Redis queue per user
Celebrity tweets — Shared Redis cache read by all followers
Follower mapping — Redis cache with DynamoDB fallback
Feed merge — Server side merge of personal queue and celebrity cache
Pagination — Independent cursors per source returned to client
Trending computation — Kafka streaming to Redis Sorted Set
Time window — Per hashtag sorted set with timestamp as score
Global ranking — Single global trending sorted set updated in real time
Async processing — Kafka decouples tweet service from hashtag service
Push notifications — AWS SNS and Google FCM for instant mobile delivery

Final Thoughts

Twitter's feed looks like a simple list of posts. Underneath it is a carefully orchestrated system of pre-computed caches, hybrid architectures, sliding time windows, async pipelines, and cloud push infrastructure — all working together to make everything feel instant.

The most valuable lesson from this design is that wrong answers are not failures — they are navigation tools. Every wrong approach revealed exactly why the correct approach exists. That is how real system design thinking works.

Happy building. 🚀

A System Design Deep Dive — Question by Question

shubham pandey (Connoisseur) — Wed, 11 Mar 2026 03:06:57 +0000

Introduction

A URL shortener seems deceptively simple — take a long URL, return a short one. But at scale, it hides some of the most fascinating distributed systems challenges in software engineering. This post walks through the real complexity, challenge by challenge.

Challenge 1: Scaling Under Heavy Traffic

Interview Question: When millions of users are simultaneously shortening URLs and millions more are clicking short links — how do you ensure the system stays fast and doesn't become a bottleneck?

The naive approach is a single server handling everything. The moment traffic spikes, you hit a wall. The fix is horizontal scaling — a load balancer distributes incoming requests across multiple application servers. But this raises an immediate follow-up: what about the database?

Key Insight: Horizontal scaling solves app-layer pressure, but the database becomes the next bottleneck if left as a single instance.

Challenge 2: The Read/Write Imbalance

Interview Question: If all application servers point to one single database for both reads and writes — what happens under heavy read traffic? Redirects outnumber URL creation by roughly 100:1.

A URL shortener is an extremely read-heavy system. For every person shortening a URL, roughly 100 people are clicking it. A single database will buckle under that read pressure. The solution is to treat reads and writes differently. Most reads are for the same popular URLs repeatedly — which is exactly what caching is built for.

Key Insight: Reads and writes have fundamentally different patterns and must be architected independently. Caching is the most powerful lever for read-heavy systems.

Challenge 3: Cache Misses and the Cold Start Problem

Interview Question: Your Redis cache is cold. You have 500 million unique short URLs — you can't cache all of them. What stays in cache, and what happens when a miss falls through to the database?

Even with caching, misses happen. Every miss hits the database. The database needs to be horizontally scalable too — which is why NoSQL databases like Cassandra or DynamoDB are popular here. They are designed to scale out across many nodes, handling reads across distributed partitions.

Key Insight: NoSQL provides horizontal scalability at the storage layer, acting as the safety net for cache misses at any scale.

Challenge 4: Choosing the Right Cache Eviction Strategy

Interview Question: Your cache is full. A new URL needs space. Which entry do you evict — and does your algorithm reflect real-world URL access patterns?

Each strategy alone falls short:

FIFO — evicts the oldest entry, ignores popularity and recency entirely
LFU — a viral URL from 3 months ago that is now dead stays in cache forever
LRU — a URL accessed 1M times but not hit in 2 hours gets evicted over a rarely accessed recent one

The optimal strategy combines both frequency and recency — evict the entry that is infrequently accessed AND hasn't been accessed recently. This is the principle behind W-TinyLFU, the algorithm Redis uses internally in production.

Key Insight: W-TinyLFU (hybrid LFU + LRU) is the gold standard for cache eviction, combining frequency and recency for smarter decisions.

Challenge 5: Unique ID Generation Across Distributed Nodes

Interview Question: Multiple application servers generate short codes simultaneously. How do you ensure no two servers generate the same short code for different URLs?

A central auto-increment counter seems obvious — but it becomes a single point of failure. Master-slave replication helps with availability, but async replication risks duplicate IDs being issued after a failover.

Follow-up: Can you design the system so each node generates IDs independently without coordinating on every request?

The elegant solution is Range-Based ID Allocation. The counter service hands each node a range (e.g., Node A gets 1–1000, Node B gets 1001–2000). Each node generates IDs independently from its range. When a node exhausts its range, it requests a new batch. Counter service is called infrequently — not in the hot path. If the counter service goes down briefly, nodes keep generating from their existing range. ID gaps don't matter — short codes are opaque to users anyway.

Key Insight: Range-based ID allocation decentralizes generation while maintaining global uniqueness — used by Twitter, Instagram, and many others at scale.

Challenge 6: 301 vs 302 Redirects — A Business Decision

Interview Question: 301 is a permanent redirect — browsers cache it, reducing server load. But what does 301 silently break for businesses using your service?

Once a browser caches a 301, it never contacts your servers again for that URL. Analytics die — you cannot track clicks, geography, device type, or referrer. URL updating breaks — if a business wants to change the destination mid-campaign, users with cached 301s will never see the update. 302 ensures every click hits your servers first. Yes, there is a small overhead — but for a service where analytics and flexibility are the core value proposition, 302 is the only sensible choice.

Key Insight: 302 preserves analytics and URL mutability — essential for businesses running campaigns. The slight latency cost is worth the business value.

Challenge 7: Malicious URL Protection

Interview Question: A bad actor shortens a phishing URL. Millions of users click it. How do you protect users — and what about URLs that were clean when shortened but become malicious later?

A purely reactive approach leaves a dangerous time window. The right approach is layered defense. At creation time, check against a 3rd party malicious URL database like Google Safe Browsing API before accepting the URL. Periodic re-scanning re-checks existing URLs regularly since clean URLs can turn malicious later. Reactive blocking via user reports and team verification acts as the final safety net.

Key Insight: Defense in depth shrinks the harmful time window dramatically. No single layer is enough on its own.

Challenge 8: The Thundering Herd / Cache Stampede

Interview Question: A celebrity tweets your short URL to 50 million followers simultaneously. The URL was just created — cache is cold. What happens to your database at that exact moment?

This is not a cold start problem — it is a Cache Stampede. Cold start means cache is empty and traffic arrives gradually so the database warms up slowly. Cache stampede means cache is empty AND millions hit simultaneously in one instant — the database gets obliterated in one shot.

Follow-up: How do you make only one request go to the database and make the rest wait for that result?

The solution is Cache Locking. The first request misses cache, sets an IN_PROGRESS flag in Redis, then goes to the database. All subsequent requests see IN_PROGRESS and wait. The first request returns, populates the cache, removes the flag, and all waiting requests are served from cache instantly. One database hit instead of one million.

Follow-up: What if the first request crashes after setting the flag but before populating the cache?

Give the IN_PROGRESS flag a TTL equal to the expected database response time plus a small buffer — for example 600ms if your DB responds in 500ms. If the request crashes, the flag expires automatically. No manual cleanup, no deadlocks.

Key Insight: Cache locking with TTL-based expiry prevents thundering herd without any risk of deadlock — a production pattern used at Facebook and Twitter scale.

Summary

App layer scaling — Horizontal scaling via load balancer
Database scaling — NoSQL with horizontal sharding
Cache eviction — W-TinyLFU hybrid combining frequency and recency
Unique ID generation — Range-based allocation across nodes
Counter availability — Infrequent calls plus node breathing room
Redirect strategy — 302 for analytics and URL flexibility
Malicious URLs — 3rd party scanning plus periodic recheck plus reactive blocking
Cache stampede — Cache locking with TTL-based expiry

Final Thoughts

A URL shortener is one of the most deceptively deep system design problems. On the surface it is a key-value store. Underneath it forces you to confront horizontal scaling, caching theory, distributed ID generation, HTTP semantics, security, and concurrency all at once. The most important skill is not knowing the answers — it is questioning your own assumptions. Every solution reveals a new edge case. That iterative thinking is what separates good system design from great system design.

Happy building.

The Game Theory of Corporate Growth: Why Being a 10x Engineer Gets You Nowhere

shubham pandey (Connoisseur) — Tue, 10 Mar 2026 14:16:47 +0000

Why the brilliant SDE grinding at 2am keeps getting passed over — and what game theory tells us about it.

You Think You’re Playing Chess. You’re Actually Playing Poker.

Most engineers believe the promotion formula is simple:
Ship fast. Write clean code. Close tickets. Get promoted.

It’s not. And the reason isn’t unfairness or a bad manager. The reason is game theory.

You are not in a solo performance evaluation. You are in a multi-player strategic game where your outcome depends not just on what you do — but on what everyone else does, and what they perceive you to be doing.

The moment you understand this, everything changes.

The Four Games Every SDE Is Playing Simultaneously

Game 1: The Visibility Game

Your output only has value if the right people know it exists. Your manager doesn’t see your code at 2am. They see what surfaces in standups, Slack, and demos. Information is asymmetric — and the player who controls information flow has enormous power.

Game 2: The Coalition Game

Promotions are not decided by one person. They emerge from a coalition of voices — your manager, skip-level, peer reviewers, and cross-functional partners. The question that determines your career is not “what did you build?” It is “who is speaking for you when you’re not in the room?”

Game 3: The Signaling Game

Others cannot directly observe your competence. They read signals — your confidence in design reviews, the problems you volunteer for, the language in your documents. Competence without signal is invisible competence. It counts for nothing.

Game 4: The Reputation Game

This is a repeated game. Every interaction updates someone’s mental model of you. Miss one deadline loudly and it sticks for quarters. Nail ten things quietly and it fades by Friday. The compound interest of reputation is brutally asymmetric.

Why Pure Competence Is a Dominated Strategy

In game theory, a dominated strategy is one that always produces worse outcomes than an alternative — no matter what other players do. Relying purely on technical output is a dominated strategy.

Meet Aarav and Riya:

Aarav writes flawless code, resolves P0s at midnight, and never misses a deadline. He believes the work speaks for itself.
Riya ships solid work, narrates her decisions in design reviews, builds relationships across teams, and makes her impact legible to leadership.

In a pure meritocracy, Aarav wins. In the actual game — where promotions require coalition, signal, and narrative — Riya wins. Not because she gamed the system, but because she played the real game while Aarav played an imaginary one.

The Nash Equilibrium Nobody Tells You About

A Nash Equilibrium is a stable state where no player can improve their outcome by unilaterally changing strategy, given what everyone else is doing.

In most companies, the equilibrium looks like this:

Strong technical output + visibility + internal allies = best stable outcome

If everyone around you is playing the visibility game, the engineer who opts out unilaterally loses ground — even if their code is objectively better. This is the trap. Refusing to play visibility games feels principled, but in game theory, refusing to play is still a move. And it’s a losing one.

Why Competent SDEs Get Skipped for Early Promotion

Here is the central paradox: the most technically skilled SDEs are often the worst at getting promoted early.

They optimize the wrong variable: They go deep on code quality and system design. These matter — but they are necessary, not sufficient.
They underinvest in repeated interactions: Relationship-building is a repeated game. An engineer with 50 low-stakes positive interactions with a VP has more influence than one with a single brilliant architecture discussion.
They mistake legibility for self-promotion: Making work understandable to non-engineers is not bragging — it is translation.
They ignore the coalition structure: Promotion committees don’t see your code. They see the narrative constructed during calibration.
They play a one-shot game in a repeated environment: They treat each quarter as independent, ignoring that reputation compounds over years.

The Five Moves That Actually Drive Early Promotion

Competence is the entry ticket. Here is what separates early promotions from everyone else:

Narrate your reasoning, not just your output. Don’t just close the ticket. Write one paragraph on why you made the tradeoff you did.
Make other people’s wins possible. Unblock colleagues publicly and attribute wins generously. This creates allies who advocate for you without being asked.
Own a problem, not a task. Task-completers get rated "at level." Problem-owners get promoted above it. Volunteer for the ambiguous, messy things.
Manage upward with outcomes, not effort. Give your manager ammunition. "I reduced latency by 40ms, which unblocks the mobile team’s Q3 launch" is a promotion argument.
Build a personal board of directors. Identify 3–5 senior people across functions who respect your work. Their affirmation in a calibration session is worth more than any single project.

The Uncomfortable Conclusion

The engineer who gets the early promotion is not always the best engineer. They are the engineer who understood the actual game being played — and played it well, while also being technically strong enough.

This is not an argument to trade engineering excellence for office politics. It is an argument to stop treating excellence as sufficient when it is only necessary.

The most dangerous belief in a software engineering career is that the work speaks for itself. It doesn’t. You have to speak for it.

The Evolution of Data: From Codd's Tables to the NoSQL Rebellion

shubham pandey (Connoisseur) — Tue, 10 Mar 2026 04:54:52 +0000

A 4-minute history of how the internet broke the rules of data storage

The World Before 1970: Organized Chaos

Before the relational database existed, storing data was a deeply physical problem. Your application code had to know exactly where data lived on disk — which sector, which byte offset. Move the data to new hardware and your entire application broke.

Engineers called this Data Dependence, and it made databases brittle, expensive to maintain, and nearly impossible to scale.

Something had to change.

Codd's Revolution: The Table Is Born

In 1970, IBM researcher Edgar F. Codd published a paper that rewired how the industry thought about data. His idea was elegant: store everything in simple tables, link them with keys, and let a query language handle the rest. Developers would describe what they wanted — not where to find it.

This gave birth to SQL and the RDBMS, backed by four ironclad guarantees known as ACID:

Atomicity — A transaction completes fully or not at all. No half-written bank transfers.
Consistency — Every write must obey the rules. No orphaned records.
Isolation — Concurrent users don't corrupt each other's data.
Durability — Once saved, data survives crashes and power failures.

By the 1980s, Oracle and IBM had turned this into the gold standard for banking, healthcare, and government. For twenty years, RDBMS was simply what a database was.

The Internet Breaks Everything

Then a billion people came online simultaneously — and three walls appeared that RDBMS couldn't climb.

Volume. Databases went from millions of records to trillions. Buying a bigger server worked until it didn't — and at the extreme end, no server was big enough.

Velocity. A million users clicking "Like" at the same moment exposed a fatal flaw: RDBMS locks rows during writes to preserve accuracy. At internet scale, those locks became bottlenecks. Apps hung. Revenue evaporated.

Variety. Data got messy. JSON blobs, social graphs, user-generated content — none of it fit neatly into the rigid columns of a relational table.

The NoSQL Survival Move

NoSQL wasn't invented in a lab. It was built in the trenches by companies literally outgrowing the planet's hardware.

Google needed to index the entire web, so they built Bigtable — a wide-column store that spread data across thousands of commodity servers automatically.

Amazon needed the "Add to Cart" button to work 100% of the time, even during server failures. Their Dynamo paper introduced eventual consistency: accept that two servers might briefly disagree, as long as the system never goes down.

Facebook needed to search hundreds of billions of messages instantly, so they open-sourced Cassandra — a masterless, peer-to-peer database with no single point of failure.

The trade-off was deliberate: sacrifice some of ACID's strict consistency guarantees in exchange for infinite horizontal scale.

The Complexity Tax

NoSQL solved scale. It introduced something harder to measure: complexity.

Without schema enforcement, databases drifted into chaos as teams wrote inconsistently structured data over time. Without transactions, applications had to handle consistency logic themselves — complex retry loops, idempotency requirements, and 3am production incidents.

DynamoDB's strict 400KB item size limit is a perfect example. Hit it with a large user profile and the naive fix — split it into multiple tables — defeats the whole point. The real solution is vertical partitioning: split one fat record into multiple lean items under the same partition key, each accessed independently. It's faster, cheaper to query, and scales cleanly. But you have to know to do it. The complexity never disappears — it just moves from the database into the engineer's head.

2026: The Convergence

The war is over. Both sides won by becoming more like each other.

PostgreSQL now stores flexible JSON natively with full indexing, handles horizontal scaling through extensions like Citus, and covers most workloads that once required a NoSQL system. Meanwhile, DynamoDB and MongoDB added ACID transactions — the very thing they abandoned to get fast.

The modern approach is Polyglot Persistence: use the right tool for each job within the same application.

Use Case	Reach For
Financial data, billing, anything ACID-critical	PostgreSQL, CockroachDB
100M+ users, massive write volume	DynamoDB, Cassandra
Flexible or evolving data structures	PostgreSQL JSONB, MongoDB
Sub-millisecond reads, caching	Redis
Social graphs, recommendations	Neo4j
Full-text search	Elasticsearch
Unsure? Starting fresh?	PostgreSQL. Always.

The Bottom Line

We didn't invent NoSQL because Codd was wrong. We invented it because the internet introduced a new physics of data — volumes and velocities his world didn't contain. RDBMS is the heavy-duty truck built for cargo. NoSQL is the racing car built for the track.

In 2026, the smartest engineers don't pick a side. They know which vehicle the road demands.

Why Your Neighbor Screams “Goal!” Before You Do: A Deep Dive into System Strategy

shubham pandey (Connoisseur) — Mon, 09 Mar 2026 00:25:08 +0000

The Opening Scenario: More Than Just “Lag”
It’s the 89th minute. The match is level. Fifty million people are watching the same striker bear down on goal. And then — your neighbor’s living room erupts. A primal roar rattles the shared wall. Five full seconds later, your phone buzzes: ⚽ GOAL!
You already know. The surprise is dead. The moment is gone.
Most people shrug and call it “lag.” Engineers nod and file it under “latency issues.” But both of those framings are too small. What just happened isn’t a technical glitch — it’s the visible collision of two irreconcilable information philosophies. Understanding the gap between them is one of the most clarifying exercises in systems design you’ll ever encounter.

Part 1: The Emergency Broadcast Problem

To make this concrete, let’s leave the stadium and visit a coastal town bracing for a category-four hurricane. City officials have a single, time-critical objective: warn every resident simultaneously. Two technologies sit on the table.

Option A — The Physical Air-Raid Siren: A single mechanical horn mounted on a hillside. When triggered, a 130-decibel blast propagates outward at the speed of sound. Whether 10 people or 100,000 people live within range, the warning arrives at the same moment — within milliseconds of each other. It doesn’t know your name. It doesn’t know your address. It cannot personalize the message. It just broadcasts, and the physics of sound do the rest.

Option B — The Automated Phone Tree: A sophisticated system that queries a resident database, dials each number individually, authenticates the call, and plays a personalized message — “Your street, Oak Avenue, is in Flood Zone B. Please evacuate to the high school on Elm Street.” It knows everything about you. It delivers exactly the right message to exactly the right person. And it will reach the last resident approximately 45 minutes after the first call goes out.
The strategic conclusion is brutal: In a crisis where the first three minutes determine survival, a system optimized for personalization is, functionally, a system optimized for failure. No matter how good the message is, it doesn’t matter if the recipient is already underwater.
This is the precise architectural tension behind your five-second spoiler. Your neighbor has the siren. Your smartphone has the phone tree. The siren wins — not because it’s superior technology, but because it’s solving a fundamentally different problem.
The goal isn’t just to be fast. It’s to be first. And being first requires building for the peak moment, not the average case.

Part 2: The Anatomy of “Spoilage”

In live sports, information has a half-life. But unlike radioactive decay — gradual, probabilistic — the value of a goal notification experiences instantaneous, total collapse the moment an external source delivers the surprise. One second you’re holding anticipation. The next, the surprise is dead and the notification is worthless.
To solve the problem, you must map every second of delay. There isn’t one culprit. There is a chain of them — the Pipeline of Spoilage.

Stage 1 — The Physical Event (T+0ms)
The ball crosses the line. At this moment, no computer system in the world has registered the goal. It exists only as atoms in motion. The clock starts here.

Stage 2 — The Capture Tax (+40ms to 200ms)
Stadium cameras running at 50–120 frames per second capture the event. The video is encoded, compressed using H.264 or H.265 codecs, and transmitted to the broadcast truck. Even before a single database is updated, you’re already 40–200 milliseconds behind physical reality.

Stage 3 — The Verification Tax (+200ms to 3,000ms)
Data providers like Opta, Stats Perform, or Genius Sports employ human “data scouts” who tag match events in real time, or increasingly use computer vision to detect goal-crossing events automatically. Either way, a confirmation step exists. The system must decide whether the ball actually crossed the line before sending an alert. In a routine goal, this is fast. In a VAR review, this is the step where entire minutes can disappear.

Stage 4 — The Fan-Out Tax (+500ms to 5,000ms)
The confirmed event must now reach 50 million subscribers. How a system architects this fan-out — centralized hub versus distributed edge nodes — is the single most consequential engineering decision in the entire stack. This is where the battle is won or lost.

Stage 5 — The Last-Mile Delivery Tax (+50ms to 500ms)
Your phone must be reached through a cellular tower, residential fiber, or public WiFi. If your app is in a “sleep” state, a wake signal must precede the actual data packet, adding another 200–400ms before the notification even begins rendering.
Add up a reasonable combination of these taxes and you arrive at a 2–6 second gap from physical event to notification. This is not a bug report. It is a physics lesson.

Part 3: The TV Paradox — Why 1970s Technology Beats Your Smartphone

Here is the fact that causes the most cognitive dissonance among engineers: your neighbor’s television — a technology conceptually unchanged since the 1970s — consistently delivers live sports faster than a modern smartphone backed by cloud infrastructure worth billions of dollars.
The resolution to this paradox is that TV and the internet are not competing implementations of the same idea. They are solving different problems using different physics.
Television: “The River”
Traditional broadcasting pushes a single, continuous bitstream into the air via RF signal or down a coaxial cable. Whether one person or one hundred million people are watching, the signal propagates to all of them simultaneously. The receiver is a passive tap on a flowing river of data.
The system has no idea you exist. It doesn’t know your name, your location, or your subscription status. It doesn’t care. It broadcasts, and you tune in. This indifference to the individual is not a limitation — it is the feature. Synchronicity at massive scale costs nothing additional when you’re broadcasting.
The Internet: “The Highway”
Your smartphone establishes a unique, encrypted, stateful connection between your device and a specific server. Every packet is addressed to your IP. The system must find you, route to you, verify your session token, and deliver your specific payload.
When 50 million people want that same payload simultaneously — each requiring their own addressed delivery, their own session verification, their own routing path — you create what engineers call a Thundering Herd: a simultaneous stampede that clogs every highway at once.

The Hidden Advantage Nobody Talks About
There is a further advantage in the TV signal chain that rarely surfaces in these discussions: hardware signal decoding. A television or set-top box decodes video using dedicated silicon — Application-Specific Integrated Circuits (ASICs) running at near-zero latency. A streaming app on a smartphone is a software process competing for CPU cycles with the operating system, background tasks, push notification handlers, and dozens of other apps. The decode pipeline alone can introduce 500ms–2,000ms of additional buffer.
Some premium streaming services have reduced their end-to-end latency to approximately 3–5 seconds using technologies like CMAF (Common Media Application Format) with low-latency HLS chunks. But “low-latency streaming” in this context means reducing from 30–45 seconds of buffer to 3–5 seconds — still nowhere near the sub-500ms a well-engineered WebSocket push notification achieves.
This is why the alert and the video stream are separate problems requiring entirely separate solutions.

Part 4: The Strategy — Architecting a Push-Only CDN

To compete with broadcast television, we must stop treating the internet as a request-response system and start treating it as a real-time pipe. This requires a CDN architecture designed specifically for volatile, time-critical events — not for caching static assets.

A. Moving the “Brain” to the Edge
The classical CDN model uses edge servers to cache and serve files. For real-time event delivery, we go further: we move the fan-out logic itself to the edge.
Instead of a central hub in one data center attempting to push to 50 million users — a process that would take seconds and buckle under the load — we distribute the work. A single high-priority event payload goes out to 500 regional edge nodes distributed globally. Each edge node maintains persistent WebSocket connections with users in its geographic vicinity. When the node receives the event, it fans out locally: the London node alerts London users, the São Paulo node alerts São Paulo users, in parallel.
We have converted one global, slow, sequential task into hundreds of small, parallel, fast tasks.

B. Sharding by Interest
Even at the edge level, you cannot run a for loop over millions of connections. The solution is interest-based sharding: partitioning subscribers by a logical grouping — Team ID, League ID, Match ID — and pre-assigning dedicated worker processes to each shard.
When a goal is scored by Arsenal, the system doesn’t wake up 50 million connections. It triggers the dedicated worker cluster already assigned to the Arsenal interest shard. Users who follow Liverpool, Barcelona, or Bayern Munich are completely unaffected. The event triggers only the exact set of processes needed.
This turns one massive, blocking task into thousands of tiny, independent, parallel tasks — each fast enough to complete in milliseconds.

C. Pre-Warming the “Last Mile”
One of the most underappreciated optimizations targets the sleep-state problem. When a mobile operating system puts your app’s network radio to sleep to preserve battery, a wake-up signal must precede the actual data, adding hundreds of milliseconds at the worst possible moment.
The solution is predictive pre-warming: using match state data to anticipate high-probability goal moments. When the tracking system detects the ball has entered the attacking “final third” of the pitch, it sends a silent, low-priority signal to wake the app’s radio — before the goal happens. By the time the ball hits the net and the confirmation fires, the app is already awake and the “hot path” is open.
This is a case where knowing the context of an event allows you to reduce the delivery cost of that event before it occurs.

Part 5: The Strategist’s Dilemma — Speed vs. Truth

Every architect who works seriously on this problem eventually hits a wall: Do I want to be First, or do I want to be Right?
These are not the same thing, and the system cannot always guarantee both simultaneously.
The Ghost Goal Scenario
A striker smashes the ball into the net. The ball-tracking system confirms the crossing. The fan-out fires. Fifty million notifications are delivered. Three seconds later, the linesman’s flag is raised — offside. The goal is disallowed.
You have just told 50 million people something that is no longer true.
The instinct of a careful developer is to wait for full referee confirmation before sending any notification, ensuring data integrity. The instinct of a senior strategist is more nuanced, and more uncomfortable:
Send it now.
Being “First but temporarily Wrong” is a fixable condition — you send a correction, you add a VAR pending state to the notification. The correction arrives within 60–90 seconds. The user experience is imperfect but recoverable.
Being “Correct but Second” is an unrecoverable condition. Your app is irrelevant. The user has already gotten the information from a faster source and their trust in your platform as a live companion has been permanently degraded.
This is a conscious, deliberate architectural choice: we choose Availability over Strict Consistency. We accept temporary incorrectness as a trade-off for guaranteed speed. This is not laziness. It is strategy.

Technical Appendix:

The Engineer’s Toolkit
For those who want to look under the hood, here is the specific technology stack required to win the Neighbor Race — and the reasoning behind each choice.
UDP / QUIC (HTTP/3) — Ditching the Handshake
Traditional TCP requires a multi-step acknowledgment handshake before data transmission can begin. In a world where the goal is already old news by the time a retransmission is requested, this is an unacceptable overhead.
QUIC (the transport layer underlying HTTP/3) operates over UDP and introduces two critical advantages: 0-RTT connection resumption (if you’ve connected before, the next connection can begin sending data immediately without a handshake) and stream multiplexing without head-of-line blocking (a lost packet doesn’t stall unrelated data streams).
For a lost goal notification packet: don’t retransmit. Move to the next event. The goal is already old.
WebSockets — Keeping the Path Warm
A conventional HTTP request is opened, fulfilled, and closed. For every new event, a new connection must be established — TLS handshake, session verification, routing — all overhead that burns milliseconds you can’t afford.
WebSockets maintain a persistent, bidirectional, full-duplex connection between the client and the server. The “hot path” is always open. When a goal is scored, the event travels down a pipe that’s already warm, already authenticated, already routed. You skip every connection establishment cost at the exact moment when those costs matter most.
Edge Workers (WebAssembly) — Zero Distance Between Decision and Delivery
Running fan-out logic in a central data center means every notification must travel the physical distance from that data center to each user. A user in Jakarta receiving a notification from a server in Virginia adds 150–200ms of raw propagation delay, before any processing overhead.
Edge Workers — Cloudflare Workers, AWS Lambda@Edge, Fastly Compute@Edge — execute fan-out logic in data centers that are physically close to end users. The decision (“broadcast this event”) and the delivery (“push to these WebSockets”) happen within the same facility. The propagation distance collapses to near-zero.
Pre-Warming — Anticipatory State Management
As described in the strategy section: use match-state context to pre-position the system before the event occurs. When the ball enters the final third, the edge worker issues a silent priority signal. When the ball enters the penalty box, connection pools are expanded. When the shot is detected, the fan-out queue is pre-staged.
By the time confirmation arrives, the system is not reacting. It is completing a sequence it already began.

The Stack at a Glance

Technology	Role	Key Advantage
QUIC / HTTP/3	Transport layer	0-RTT resumption, no head-of-line blocking
WebSockets	Persistent delivery channel	No per-event connection overhead
Edge Workers	Distributed fan-out compute	Eliminates propagation delay
Interest Sharding	Subscriber partitioning	Converts O(n) to O(shard size)
Pre-Warming	Radio state management	Eliminates last-mile wake-up delay
CMAF / Low-Lat HLS	Video stream delivery	Reduces stream buffer (but not alerts)

Final Thoughts: Designing for the Physics of Information
When you move from writing features to defining strategy, you stop asking “How do I implement this?” and start asking “What are the physical constraints of this problem?”
The five-second spoiler gap is not a bug waiting to be fixed in the next sprint. It is the inevitable consequence of using a personalized, unicast network to solve a broadcast problem — and then failing to compensate architecturally for that mismatch.
The engineers who close the gap don’t do it by writing faster code. They do it by redesigning the shape of the problem: distributing the fan-out, moving the brain to the edge, sharding by interest, and treating confirmation as a follow-up rather than a prerequisite.
The peak moment — that split second when fifty million people hold their breath — is the most honest stress test an architecture will ever face. You cannot fake your way through it with clever caching. You have to build for it, deliberately and in advance.
That is what separates a developer who ships features from an architect who designs systems.

Found this useful? Share it with a developer who has ever said “it’s just a latency issue.” It’s never just a latency issue.

Forem: shubham pandey (Connoisseur)

Designing Google Maps / Location Service at Scale A System Design Deep Dive — Question by Question

Introduction

Challenge 1: Representing Earth's Road Network

Challenge 2: Fast Routing with Contraction Hierarchies

Challenge 3: Real Time Traffic Data Pipeline

Challenge 4: GPS Noise Filtering for Accurate Traffic

Challenge 5: Map Tile Rendering

Challenge 6: Real Time Traffic Overlay Without Re-rendering Tiles

Challenge 7: Location Search — Text Plus Proximity

Challenge 8: Keeping Data Stores in Sync

Full Architecture Summary

Final Thoughts

Designing Google Drive / Dropbox at Scale A System Design Deep Dive — Question by Question

Introduction

Challenge 1: Resumable Uploads

Challenge 2: Storage Deduplication

Challenge 3: Chunk Level Deduplication

Challenge 4: Deduplication Security — Hash Probing Attack

Challenge 5: Delta Sync — Only Upload What Changed

Challenge 6: Real Time Sync Notifications

Challenge 7: Change Log Retention and Long Term Offline Recovery

Challenge 8: Real Time Collaborative Editing

Full Architecture Summary

Final Thoughts

Designing Netflix / Video Streaming at Scale A System Design Deep Dive — Question by Question

Introduction

Challenge 1: Global Video Delivery

Challenge 2: Smart CDN Cache Management

Challenge 3: Video Chunking and Instant Playback

Challenge 4: Adaptive Bitrate Streaming

Challenge 5: Parallel Video Encoding Pipeline

Challenge 6: Fault Tolerant Encoding Pipeline

Challenge 7: Geo Licensing Checks

Challenge 8: Personalized Recommendations

Full Architecture Summary

Final Thoughts

Redis

Table of Contents

1. The World Before Redis — and Why It Was Broken

The Timeline

2. Inside Redis: The Engine That Shouldn't Work (But Does)

The Event Loop (epoll)

Data Structures — The Real Differentiator

TTL Mechanics — Two-Phase Expiry

3. Persistence: RDB vs AOF — How Redis Survives a Crash

RDB — The Snapshot

AOF — The Append-Only Log

The Hybrid: RDB + AOF

4. Replication & Sentinel — Redis Gets Serious About Availability

How Replication Works

Sentinel — Automated Failover

5. Redis Cluster & Sharding — Going Horizontal

Hash Slots — The Foundation of Sharding

Request Routing — MOVED & ASK

The Multi-Key Trap

Without hash tags — might fail in cluster

With hash tags — guaranteed same slot

Publisher

Subscriber

Pattern subscribe

Producer

Consumer (blocking)

Consumer Group — each message delivered to one worker

Acknowledge — removes from Pending Entry List

Designing a Stock Exchange / Trading System at Scale A System Design Deep Dive — Question by Question

Introduction

Challenge 1: The Order Book Data Structure

Challenge 2: Fault Tolerance Without Sacrificing Microsecond Latency

Challenge 3: Crash Safe Write Ahead Log

Challenge 4: Stop Loss Cascade and Circuit Breakers

Challenge 5: Atomic Trade Settlement

Challenge 6: Counterparty Risk and Pre-Trade Fund Locking

Challenge 7: Balance Storage — ACID vs Speed

Full Architecture Summary

Final Thoughts

Designing Uber / Ride Sharing at Scale: deep dive

Introduction

Challenge 1: Real Time Driver Location Tracking

Challenge 2: Storing and Querying Driver Locations