Forem: Sachin Tolay

Blocking vs Non-blocking vs Asynchronous I/O

Sachin Tolay — Wed, 30 Jul 2025 04:57:36 +0000

When a program performs I/O → like reading from a file or socket → two key questions arise:

Does the program stop and wait for the data, or continue running? (Blocking vs Non-blocking IO)
Does the program keep checking for the result, or get notified when it’s done? (Synchronous vs Asynchronous IO)

These are orthogonal concepts, meaning they can be mixed in different combinations. Each combination comes with trade-offs in performance, complexity, and responsiveness. In this article, we'll break down these models to understand how I/O really works under the hood.

Blocking vs Non-blocking I/O Intuition

Question to ask → After placing the coffee order, do you stand there waiting until it’s ready, or do you walk away and do other things in the meantime?

Blocking: You stand at the coffee counter and wait until your coffee is ready before leaving.
Non-blocking: You place your order and then walk away; if the coffee isn’t ready yet, you don’t wait → you might come back later to check.

Synchronous vs Asynchronous I/O Intuition

Question to ask → After placing the coffee order, do you keep checking if it’s ready, or do they notify you when it’s done?

Synchronous: You keep walking back to the counter every few minutes to ask, “Is my coffee ready yet?”
Asynchronous: You leave and go about your day → when your coffee is ready, they text you so you know to come back.

Note - It’s important to understand that asynchronous and non-blocking are related but different concepts:

Non-blocking I/O is about waiting vs not waiting → whether the program waits (blocks) for the operation to complete or the call returns immediately if data isn’t ready.
Asynchronous I/O is about who drives the control flow → whether the program itself keeps checking (synchronous) or the system notifies the program when the operation completes (asynchronous).

Blocking I/O Implementation - Using read()

The call waits until completion before returning, as shown in the diagram below.

// Blocking read
int n = read(fd, buffer, size); // Blocks until data is ready

Non-blocking I/O Synchronous Implementation 1 – Using read()

Your program calls read() again and again in a loop. If there’s no data, read() returns -1 with EAGAIN. This wastes CPU cycles because you're periodically checking by issuing read() system calls. But between checks, your program can do other things (non-blocking). Your program drives the control flow → it decides when to check for the data availability (synchronous).

fcntl(fd, F_SETFL, O_NONBLOCK);  // Set fd to non-blocking

while (1) {
    int n = read(fd, buffer, sizeof(buffer));
    if (n > 0) {
        // Got data
        handle_data(buffer, n);
        break;
    } else if (n == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
        // No data, do something else
        do_other_work();
    } else {
        // Some other error
        break;
    }
}

Non-blocking I/O Synchronous Implementation 2 – Using select()/poll()

select() is used to multiplex a set of file descriptors → allowing your program to wait efficiently until any one of them is ready for I/O. Unlike repeatedly calling non-blocking read(), which issues a system call each time and wastes CPU cycles when no data is available, select() makes a single blocking system call that sleeps (means no cpu occupied) until at least one descriptor is ready. After select() returns, checking which descriptors are ready using FD_ISSET is a fast user-space operation that doesn’t incur extra system calls, making the whole process much more efficient.

while (1) {
    fd_set fds;
    FD_ZERO(&fds);
    FD_SET(fd1, &fds);
    FD_SET(fd2, &fds);
    int max_fd = (fd1 > fd2) ? fd1 : fd2;

    // Block until one of the FDs is ready to read
    if (select(max_fd + 1, &fds, NULL, NULL, NULL) > 0) {
        if (FD_ISSET(fd1, &fds)) {
            // fd1 has data
            read(fd1, buffer1, sizeof(buffer1));
        }
        if (FD_ISSET(fd2, &fds)) {
            // fd2 has data
            read(fd2, buffer2, sizeof(buffer2));
        }
    }
    // You can also perform other logic here
}

Summary So Far

Non-blocking Asynchronous I/O Implementation — Using OS-Provided Async APIs (e.g., io_uring, Windows IOCP, Linux AIO)

In asynchronous I/O, the program initiates the I/O operation and does not check or wait for the result. Instead, the OS notifies the program (via callbacks, signals, or event queues) when the operation completes, handing control back to the program only when data is ready. This allows maximum concurrency and responsiveness, as your program never blocks or polls.

// Initiate async read operation
async_read(fd, buffer, size, callback_function);

// Meanwhile, do other work here

// callback_function is called by OS when data is ready
void callback_function(int result, char* buffer) {
    if (result > 0) {
        handle_data(buffer, result);
    }
}

Final Summary

Traditional IO vs mmap vs Direct IO: How Disk Access Really Works

Sachin Tolay — Sun, 20 Jul 2025 19:37:49 +0000

In our earlier deep dive into Direct Memory Access (DMA), we explored how data can bypass the CPU to move efficiently between storage and memory.
In this article, we will break down and compare three major approaches to disk access:

Traditional (Buffered) I/O
Memory-Mapped Files
Direct I/O

Traditional IO (Buffered IO)

When you run something like:

int fd = open("data.txt", O_RDONLY);
read(fd, buf, 4096); // read 4096 bytes from fd into buf

Page Cache Lookup → The OS first checks its page cache, a large shared memory pool used to avoid redundant disk access. This cache holds recently accessed file data from all processes.
Read-Ahead → If the OS needs to fetch data from disk, it doesn’t just fetch the 4 KB block you asked for. It reads ahead, often 32 KB or more, anticipating sequential access patterns. We will use this information later in the article against Traditional IO (and mmap too).
Double Copy → Data is first loaded into the OS-managed page cache via DMA, then copied again into your application buffer.
System Call Overhead → Every read() triggers a system call, which is costly → especially during sequential reads when the data is already in the cache.

Memory-Mapped Files (mmap)

mmap offers a powerful way to access files: instead of copying file data into a user buffer via read(), the OS maps the file directly into the process’s virtual memory space.

When you do this:

int fd = open("data.txt", O_RDONLY);
char *mapped = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
char c = mapped[0]; // Triggers a page fault on first access

Step 1: Setting up the Mapping

You’re telling the OS → Map this file into my memory space. I’ll access it like a memory, not like a file.
No data is fetched from disk yet, just pages are marked as not present, so any access triggers a page fault.

Step 2: First Access Per Page → Page Not in Cache → Major Page Fault

When the app accesses a file-backed page for the first time and it’s not in the page cache, a major page fault occurs:

This happens only for the first access to each page.
Page fault is handled transparently → the app just sees a memory access.

Step 3: First Access → Page Cached But Not Mapped → Minor Page Fault

If another process or earlier access already loaded the page into the page cache, but this process hasn’t mapped it yet, a minor page fault occurs:

Step 4: Subsequent Access → Page Already Mapped and Cached → No Fault

If the page is already mapped in the page table and the corresponding data is cached in RAM, then the CPU can directly read it through virtual memory.

Direct I/O

Direct I/O transfers data directly between the storage device and the application buffer, bypassing the OS page cache entirely. This avoids the double copy of data, reducing CPU overhead and preventing page cache pollution due to read-aheads, but requires the application to carefully manage aligned buffers.

When you do this :-

int fd = open("data.txt", O_DIRECT);
void* buf;
posix_memalign(&buf, 4096, 4096);  // Allocate 4KB aligned buffer
read(fd, buf, 4096);

The buffer needs to start at a memory address that’s a multiple of 4 KB (or another fixed size). This is called alignment.
If the buffer isn’t aligned properly, the read operation will usually fail or the system might fall back to traditional I/O.

The application buffer is mapped in virtual memory as usual.
The OS validates access and instructs the DMA controller to transfer data directly to the buffer’s physical memory.
Page cache is bypassed completely.

Comparison

Understanding Direct Memory Access (DMA): How Data Moves Efficiently Between Storage and Memory

Sachin Tolay — Sun, 13 Jul 2025 00:50:13 +0000

Transferring data between Storage and Memory can slow down a computer if the CPU has to manage every step. Direct Memory Access (DMA) is a mechanism that lets a dedicated controller take over this job, freeing up the CPU and making data movement faster and more efficient.

In this article, we’ll explain what DMA is, why it’s important, and how it enables efficient data movement in computer systems.

Note: Terms like MMU and memory controller are mentioned here in a simplified way. For a deeper understanding of these components and their roles, please refer to my detailed articles on Virtual Memory and Memory Controllers.

Why DMA?

Without DMA, the CPU is responsible for every step of the transfer, reading and writing data one byte at a time. This method is known as Programmed I/O.

As shown in the diagram, the CPU stays busy throughout the entire process, slowing transfers and preventing it from handling other tasks. By offloading this work to the DMA controller, the system frees up the CPU and achieves faster, more efficient data movement.

How DMA Works

DMA uses a special hardware controller that works on its own, without needing the CPU. During data transfers, it takes control of the memory bus to move data directly between devices and RAM.

The process involves three key phases:

Setup Phase

The CPU configures the DMA controller with three critical pieces of information:

Source Address → where the data is coming from.
- For read() operations, the source is typically the storage device (like a disk or SSD).
- For write() operations, the source is memory, where the program’s data is prepared.
Destination Address → where the data should go.
- For read(), the destination is memory (so programs can use the data).
- For write(), the destination is the storage device.
Transfer Size → how many bytes to move.

The CPU programs these details into the DMA controller and then steps aside.

Transfer Phase

The DMA controller handles the data transfer without CPU involvement.

Completion Phase

The DMA controller sends an interrupt to notify the CPU when the transfer completes.

Overall Read Flow — read() system call

The application issues a read request via a system call.
The CPU uses the MMU to translate the virtual address of the page cache, which resides in the kernel’s memory space.
The kernel checks the page cache in memory to see if the data is already available.
If the data is cached, it is immediately returned to the application.
If not cached, the CPU configures the DMA controller to transfer data from the storage device.
The DMA commands the storage device to send data.
The DMA writes the incoming data into memory through the memory controller.
When the transfer completes, the DMA interrupts the CPU.
The CPU then returns the data to the application.

Overall Write Flow — write() system call

The application sends data to be written via a system call.
The CPU uses the MMU to translate the virtual address of the page cache, which lies in the kernel’s memory space.
The data is first buffered in the page cache in memory by the kernel.
The CPU sets up the DMA controller to move the buffered data from memory to the storage device.
The DMA reads the buffered data from memory through the memory controller.
The DMA streams the data to the storage device.
After the write finishes, the DMA sends an interrupt to the CPU.
The CPU acknowledges the write completion back to the application.

If you found this explanation helpful and want to stay updated with more clear, detailed guides, follow me! I regularly share deep dives and easy-to-understand articles on a variety of tech topics.

How HDDs and SSDs Store Data The Block Storage Model

Sachin Tolay — Wed, 09 Jul 2025 22:19:20 +0000

When you open a file in your program, it seems like you can read or change any byte you want. But in reality, your storage device doesn’t work with single bytes. Instead, HDDs and SSDs read and write data in larger chunks, called blocks or pages, which are usually a few KBs in size.

This gap between what software wants (small, random access) and how storage hardware works (large, fixed-size chunks) is one of the most important challenges in computer systems.

In this article, we’ll explore:

The two fundamental models of data access → block-addressable and byte-addressable.
Why storage is not byte-addressable like RAM.
How HDDs and SSDs store and access data.
How the block model shapes performance and design.

The Two Fundamental Models: Byte-Addressable vs. Block-Addressable

Byte-Addressable Model

This is how RAM (DRAM) works.

Memory is organized as a sequence of individual bytes, each with its own address.
The CPU can read or write any single byte directly and instantly.
Latency is extremely low (nanoseconds), making random access cheap.

char value = buffer[100];   // Read exactly 1 byte
buffer[200] = 'A';          // Write exactly 1 byte

This fine-grained access makes it possible for RAM to support rich data structures like linked lists, trees, and pointer-chasing algorithms.

Block-Addressable Model

This is how storage devices (HDDs and SSDs) work.

Storage is divided into fixed-size chunks called blocks (in HDDs) or pages (in SSDs).
Typical block/page size: 4 KB or larger.
You cannot read or write a single byte on its own.
Even if you want just 1 byte, the device must read or write the entire block containing it.

Reads and writes operate on these blocks or pages as the atomic unit.

Why RAM and Storage Use Different Access Models

RAM (Byte-Addressable Memory): RAM is like having a mini-fridge in your bedroom. You can grab exactly the water bottle or even a single sip whenever you want → instantly and with no extra effort. This lets the CPU quickly access tiny pieces of data (like single bytes) whenever it needs them.

Storage (Block-Addressable Devices like HDDs/SSDs): Storage is like going all the way to the kitchen. You wouldn’t walk there just to pick up one water bottle → it’s too slow and inefficient. Instead, you grab the water bottle plus some snacks or other items you might need soon.

This means when your program requests data, the storage device reads or writes a whole block (the water bottle + snacks) at once, because making multiple trips for tiny data would be too slow and wear out the hardware faster.

How HDDs Store And Access Data

HDDs are electromechanical devices that store data magnetically on spinning disks called platters.
Platters spin at thousands of RPM (e.g., 5,400, 7,200, 10,000, or 15,000 RPM).
Each platter has concentric tracks divided into sectors (typically 512 bytes or 4 KB).
Each surface of a platter has its own read/write head.
All heads are mounted on a single actuator arm that moves them in unison across the platters.

How Read Works

When you read data from a hard drive, 3 steps are involved:

Seek Time: Moving the Arm
- The actuator arm moves the read/write heads to the correct track (cylinder).
- Typical time: 4–10 ms
- Think of it like → Moving a record player needle to the right groove.
Rotational Latency: Waiting for the Right Sector Once the head is on the correct track, the disk must rotate so the exact sector spins under the head.
- Think of it like → Waiting for the spinning wheel to bring your slice in front of you
- Typically half a rotation worth of time.
Data Transfer
- Once aligned, data is read sector by sector.
- Sequential reads are much faster since the head stays on track.

How Write Works

Writing to an HDD follows the same physical steps as reading:

Seek to the correct track.
Wait for rotation to align the sector.
Transfer data sector by sector.

Handling Changes/Edits in Files

When you edit a file on an HDD, the operating system has to figure out where to put the new or changed data on the disk.

If it’s a small change → like fixing a typo or tweaking a line → the OS can often just overwrite the existing spot directly. It’s quick and easy.
If you add more content →like inserting whole paragraphs or lots of new data → the old space might not be big enough anymore. The OS then has to find free sectors somewhere else on the disk and write the new data there.
After writing, the OS keeps track of where all the pieces of the file are so it can read them in the right order later.
Over time, as you keep editing and adding, parts of the file can end up scattered in many places on the disk. This is called fragmentation. It means the read/write head has to jump around more, which slows things down.

To keep everything fast and tidy, operating systems use techniques like buffering, batching, and defragmentation. These help organize writes better and reduce unnecessary movement.

How SSDs Store and Access Data

SSDs have no moving parts and store data in flash memory cells arranged into pages and blocks.

Pages are typically 4 -16 KB each.
Blocks contain many pages (e.g., 256 pages per block).
All management is handled by the SSD controller and its Flash Translation Layer (FTL).

How Read Works

Reading data from an SSD is simple and fast, thanks to the lack of mechanical parts.

Page-Level Reads

SSDs store data in pages (typically 4 -16 KB each).
Reads happen at the page level → you can’t read less than a page.

Example → Imagine your notebook.
If you want a single note, you have to look at the whole page.
You can’t magically see just one word without opening the page.
Similarly, SSDs always read entire pages, even if your program only wants a few bytes.

No Moving Parts

Unlike HDDs, SSDs have no mechanical parts like spinning platters or moving heads.
There’s no seek time or rotational delay.
Reads are extremely fast → typically tens to hundreds of microseconds.

Example → It’s like opening a notebook instantly to the right page, with no need to flip through slowly.

Flash Translation Layer (FTL)

The SSD controller uses a Flash Translation Layer (FTL) to keep track of where data is physically stored in NAND flash.
When you request data, the FTL quickly maps your logical request to the right physical page and retrieves it.
This mapping is completely invisible to the operating system and user.

Example → Think of having an index at the front of your notebook that tells you exactly which page to turn to for each topic.

How Write Works

When writing data to an SSD, the process is a bit more complex than reading:

Out-of-Place Writes

Flash memory can’t overwrite existing data in place.
New data is always written to a free page.
The old page is marked invalid.

Example → Think of a notebook with 256 pages (like a flash block).
If there are blank pages left, you can write on them immediately.
This is how SSDs handle new data → they just use the next available free page without any extra work.

Erase Before Write Requirement

Once all pages in a block have been written (even if many are now invalid), they can’t just be overwritten.
Flash requires erasing the entire block at once to make its pages writable again.

Example → Imagine you wrote with a pen in that notebook.
If you want to change what’s on a page, you can’t just erase a single line.
You’d have to rip out the entire sheet to get a fresh, blank page.
Similarly, SSDs must erase the whole block to reuse its pages.

Garbage Collection

The SSD’s controller periodically cleans up space by copying valid pages elsewhere and erasing blocks.
This process consolidates free space and makes new pages available for writing.

Example:
Out of 256 pages in a block, maybe 200 are invalid (old data you don’t need).
56 pages are still valid.
The SSD will:

Copy the 56 valid pages to another clean block.

Erase the original block completely. — Now, all 256 pages are blank and ready for new writes.

How the Block Model Shapes Performance and Design

Block-based storage prefers bigger, aligned, sequential operations. Software is designed to take advantage of this by buffering, batching, and organizing data to reduce costly small writes.

Understanding OAuth 2.0 and OpenID Connect: A Step-by-Step Guide

Sachin Tolay — Thu, 03 Jul 2025 20:14:05 +0000

If you’ve ever clicked “Sign in with Google” or “Connect with Facebook” on a website or app, you’ve interacted with technologies called OAuth and OpenID Connect (OIDC). These two standards form the backbone of secure authentication and authorization on the web today.

This guide will walk you through everything you need to know, in simple language and step-by-step explanations. By the end, you’ll have a solid grasp of both protocols, their roles, and why they matter.

Authentication and Authorization

Before diving into OAuth and OIDC, it’s important to understand two key concepts:

Authentication → process of verifying who you are. For example, logging in with your username and password proves your identity.
Authorization → process of determining what you’re allowed to do. For instance, once logged in, determining if you have permission to access certain files or perform specific actions.

The Limitations of Password-Based Logins (and Why OAuth/OIDC Exist)

Most websites and apps still rely on the traditional method of logging in with a username and password. While simple, this approach has several serious drawbacks:

User Experience Challenges

Remembering multiple passwords is difficult.
Users face friction creating and managing separate accounts.

Security Threats

Phishing attacks trick users into handing over passwords.
Weak or reused passwords put accounts at risk.

Storage and Management Burdens

Storing passwords securely is challenging.

Single Sign-On and Federated Login to the Rescue

As we’ve seen, managing multiple usernames and passwords is both inconvenient and insecure. Single Sign-On (SSO) reduces password fatigue by letting you log in once to access multiple apps within the same organization.

However, SSO typically works only inside one organization or domain. Federated Login solves this by letting you use trusted providers like Google or Facebook to sign in across different websites. This builds cross-domain trust without needing separate passwords for every service.

OAuth 2.0 provides the framework for handling secure access to user data without sharing passwords. OpenID Connect (OIDC) builds on OAuth to add login and identity, making federated login possible across websites.

OAuth 2.0 Intuition

Think of OAuth like getting a signed access pass from City Hall so you can pick up someone else’s property from a secure warehouse.

Here’s how it works:

Your friend (Resource Owner) wants You (Client) to pick up their box from the Warehouse (Resource Server).
Your friend goes to City Hall (Authorization Server). At City Hall:
- They present their ID for identity verification (e.g., username/password).
- They explicitly state → “I want to authorize this person to collect my box”.
City Hall reviews and approves the request. They issue your friend an Authorization Letter.
Your friend gives You the authorization letter from City Hall, saying → “Take this back to City Hall to get your official access pass”.
You take the authorization letter back to City Hall :
- City Hall verifies the authorization letter by confirming that it’s legitimate and hasn’t expired.
- City Hall then issues you a signed access pass.
You take the signed access pass to the Warehouse (Resource Server):
- The Warehouse examines the pass, verifies City Hall’s signature to ensure it’s authentic.
- It checks who authorized it (your friend), whom it’s allowed for (you), what it allows (picking up the box), and expiry date.
- Once all details are validated, the Warehouse gives you the box on behalf of your friend.

OAuth 2.0: The Authorization Framework

When a client app needs to act on a user’s behalf → like accessing their files, calendar, or other resources → it needs a way to request permission securely. OAuth 2.0 provides a standardized framework for this authorization, allowing users to grant limited access to the apps without sharing their passwords with them.

OAuth 2.0 focuses purely on what an app is allowed to do, not who the user is. That’s where OpenID Connect (OIDC) comes in, adding authentication on top of OAuth 2.0 to verify user identity.

Because OIDC is built on OAuth 2.0, understanding OAuth first makes everything else clearer. That’s why we’ll begin by exploring OAuth 2.0 in detail.

OAuth 2.0 Flow Types For Different Use Cases

OAuth 2.0 defines different flows to handle different scenarios securely. In each scenario, the actors and their capabilities can vary, which is why there are different flows tailored to specific use cases.

In this article, we’ll focus on the Authorization Code flow (non-PKCE) because it demonstrates the core OAuth concepts most clearly.

Authorization Code Flow

The Four Key Players

Resource Owner (Your Friend): The user who owns the protected resources.
Client (You): The application requesting access on the resource owner’s behalf.
Authorization Server (City Hall): Authenticates the resource owner, client and issues signed access tokens.
Resource Server (Warehouse): Hosts the protected resources and verifies access tokens before granting access.

Client Registration And End-To-End Flow

Before the OAuth flow can even start, the Client app must be registered with the Authorization Server (like Google, Microsoft, etc.). Think of this as the client obtaining an official ID that proves who it is whenever it requests authorization.

During registration, the Authorization Server issues:

Client ID: A public identifier for your app (like a username for the app).
Client Secret: A private key for server-side authentication (like a password for the app that must be kept secret).
Redirect URI: The specific URL on the client app where the Authorization Server will send the user after authorization (like telling the Authorization Server, “After login, send them back here”).

As explained in the example above, the Authorization Code flow has two key phases:

Phase 1: Resource Owner Authentication & Consent

The resource owner authenticates with the authorization server and grants permission to the client app.

Phase 2: Client Authentication & Token Exchange

The client app authenticates itself and exchanges the authorization code for the access token.

The authorization code issued by the Authorization Server simply represents the authorization granted by the resource owner to the client app.

But there’s a key problem: The authorization code doesn’t tell the client who the resource owner actually is.

In other words, OAuth alone doesn’t help the client app verify that the person using it is really the resource owner who granted the permission.

Why does this matter? → Because the client is acting on behalf of the user and it needs to be sure that the person interacting with it is indeed the same resource owner who gave that authorization. Otherwise, the client might present or act on sensitive user data for the wrong person. This is the gap that OpenID Connect (OIDC) solves.

OpenID Connect (OIDC): Adding Identity to OAuth

OIDC is a simple identity layer built on top of OAuth 2.0. It extends the OAuth flow by returning an ID Token alongside the Access Token. This ID Token contains information about the authenticated user. In other words:

OAuth 2.0 → Authorization (what can the app do?)
OpenID Connect → Authentication + Authorization (who is the caller user, and what can the app do on their behalf?)

In the next article(s), we’ll dive into how access tokens are actually implemented in real-world systems.

If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!

Core Attributes of Distributed Systems: Reliability, Availability, Scalability, and More

Sachin Tolay — Mon, 30 Jun 2025 19:41:28 +0000

Whether you’re building a simple web app or a large distributed system, users don’t just expect it to work → they want it to be fast, always available, secure, and to run smoothly without unexpected interruptions.

These expectations are captured in what we call system quality attributes or non-functional requirements.

In this article, we’ll explore the most critical attributes that any serious system should aim to deliver, especially in distributed environments. We’ll cover why each attribute matters for the users, how to measure it, and how to achieve it both proactively and reactively.

Reliability

Definition: Reliability is the ability of a system to operate correctly and continuously over time, delivering accurate results without unexpected interruptions or failures.

Why It Matters

Users rely on your system to behave predictably. If your banking app transfers money to the wrong account or your flight booking app glitches, it erodes customer trust instantly.

How to Measure:

Mean Time Between Failures (MTBF): Average time system runs before failing.
Error rate: Frequency of incorrect results (e.g., data corruption or logic bugs).

Proactive Techniques (making the system reliable in advance)

Fault prevention (Stop mistakes before they happen) → Write clean code, perform code reviews, use static analysis tools.
Fault removal (Find and fix mistakes early): Use automated testing, debugging, and formal verification.

Reactive Techniques (handling faults when they occur)

Fault tolerance (Keep working despite faults) → Use retries, replication/redundancy, graceful degradation, and error correction.
Fault detection (Spot problems quickly) → Monitor logs, set up alerts, use health checks and diagnostics.
Fault recovery (Fix issues promptly) → Restart services, failover to backups, roll back to safe states.

Availability

Definition: Availability is the ability of a system to be up and responsive when needed, ensuring users can access it at any time. It focuses on being ready to serve, not on whether the response is correct (which is covered by reliability).

Why It Matters

If your system crashes or is down during peak hours, users will leave. For mission-critical systems like trading, even seconds of downtime can be disastrous.

How to Measure

Uptime percentage → e.g., 99.9% uptime = ~8.7 hours of downtime/year.
Mean Time to Recovery (MTTR): How fast you recover from failure.
High availability (HA) typically refers to uptime of 99.9% or more, achieved through redundancy and failover strategies.

Proactive Techniques

Capacity planning: Predict demand and provision enough resources.
Redundant infrastructure: Extra hardware or cloud zones ready to take over.

Reactive Techniques

Failover mechanisms: Automatically switch to backup nodes or servers.
Auto-healing: Restart crashed services or containers automatically.

Scalability

Definition: Scalability is the ability of a system to handle more users or more data by adding more resources, without significantly slowing down or crashing.

Why It Matters

What works smoothly for 10 users might completely break when 10,000 people show up. If your product becomes popular, you want it to grow without falling apart.

How to Measure

Throughput → How many requests per second your system can handle.
Latency under load → How fast your system responds when many users are active at once.

Proactive Techniques (preparing for growth in advance)

Design for scalability (Build with growth in mind) → Use stateless designs, modular components, and databases that can be partitioned or scaled out.
Capacity planning (Plan ahead for future load) → Estimate how much traffic or data you’ll have later and make sure your system can handle it.

Reactive Techniques (handling growth when it happens)

Auto-scaling (Add resources on the fly) → Automatically spin up more servers when traffic spikes.
Load balancing (Distribute work evenly) → Spread incoming requests across multiple servers so no single one gets overloaded.

Maintainability

Definition: Maintainability is the ability of a system to be easily changed, updated, fixed, or improved over time without introducing new problems.

Why It Matters

Requirements always change. Bugs appear. New features need to be added. If your system is messy or overly complex, even small changes become risky and time-consuming. A maintainable system is easy to understand, modify, and operate day to day, letting teams respond quickly and confidently to new needs.

How to Measure

Mean Time to Modify (MTTM) → How long it takes to make a change or add a new feature.
Code churn → How frequently the code is updated or changed, which can indicate areas that are difficult to maintain or keep stable.

Proactive Techniques (making the system easier to change in advance)

Modular design (Break it into manageable parts) → Structure your system as small, independent components that are easier to understand, test, and replace.
Simplicity (Avoid unnecessary complexity) → Keep designs and code clear and straightforward to reduce errors and make it easier for new developers to pick up.
Clear documentation and standards (Help everyone stay aligned) → Write understandable docs and follow consistent coding styles so others can safely make changes.
Operability considerations (Design for smooth running in production) → Build clear configuration, easy deployment processes, and good monitoring hooks to simplify day-to-day management.

Reactive Techniques (improving it over time)

Refactoring (Clean up continuously) → Regularly improve the structure of code without changing its behavior to keep it healthy and easy to work with.
Automated regression tests (Prevent breaking existing features) → Run tests that ensure changes don’t accidentally introduce new bugs.
Incremental improvements (Make small, safe changes) → Tackle technical debt gradually without big risky rewrites.

Security

Definition: Security is the ability of a system to protect itself from unauthorized access, misuse, or attacks.

Why It Matters

A single security breach can damage your reputation, leak sensitive data, or cause big financial losses. Attackers don’t wait for you to be ready → you have to plan ahead.

How to Measure

Time to detect and respond → How quickly you can find and fix security issues.
Number of vulnerabilities over time → Track how many security flaws are open and how quickly they’re closed.
Compliance scores → Certifications like SOC2 or ISO 27001 that show your security practices meet industry standards.

Proactive Techniques (protecting the system in advance)

Threat modeling (Think like an attacker) → Identify and fix weak points before someone exploits them.
Secure defaults (Build security in by default) → Use encryption, strong passwords, and access controls.
Security scans (Catch issues early) → Run automated tools to find known vulnerabilities in your code.

Reactive Techniques (responding when something goes wrong)

Intrusion detection (Spot attacks fast) → Use systems that alert you to suspicious activity in real time.
Incident response (Limit the damage) → Apply security patches quickly and have a plan to contain and fix breaches.

If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!

Memory Models Explained: How Threads Really See Memory

Sachin Tolay — Sat, 28 Jun 2025 16:37:18 +0000

Modern processors and compilers aggressively reorder instructions to improve performance → a behavior we explored in detail in my previous article: Instruction Reordering: Your Code Doesn’t Always Run in the Order You Wrote It.
To write correct concurrent code or to understand why it breaks, we need to explore memory models: the formal rules that define how threads see and interact with memory operations.
This article explains:

What memory models are and how they work
The main types of memory models, including Sequential consistency, Total Store Order and relaxed/weak models

What Is a Memory Model?

A memory model is a contract between your program, the compiler, and the CPU that defines:

Which memory operations - loads (reads) and stores (writes) can be reordered
When the effects of a write become visible to other threads
How multiple threads observe reads and writes performed by others

Without a memory model, there’s no way to reason about multithreaded programs → each thread could see operations in any order, leading to unpredictable behavior.

Sequential Consistency: The Intuitive Model

The simplest memory model is Sequential Consistency, defined by Leslie Lamport as:

The result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order issued by that processor.

It defines a system where:

All threads see memory operations (reads and writes) in the same global order.
Each thread sees its own operations occur in the same order as written in the program.

In other words, the execution behaves as if there is a single shared timeline, and all operations from all threads are placed on that timeline in a way that respects each thread’s original instruction order.

This model is easy for programmers to reason about because it matches what we typically expect: operations happen one after another, and everyone sees the same thing.

However, enforcing this strict order requires coordination between cores and often prevents performance optimizations like instruction reordering, store buffering, and speculative execution. That’s why modern hardware typically implements weaker memory models that are more relaxed, but harder to reason about.

Example

Consider two threads sharing variables x and y initialized to 0:

// Thread 1
x = 1;
r1 = y;

// Thread 2
y = 1;
r2 = x;

Under sequential consistency, the result where both r1 == 0 and r2 == 0 is impossible. At least one thread should see the other’s write.

Total Store Order (TSO): Strong but Practical

The x86 architecture (used in Intel and AMD CPUs) follows the Total Store Order (TSO) memory model. It’s stronger and easier to reason about than many weak/relaxed models, while still allowing one key optimization to improve performance. Here’s how it works:

Stores (writes) happen in order

For example, if Thread 1 executes:

x = 1;  
y = 2;

Any other thread that observes these values will always see x = 1 before y = 2.

Loads (reads) happen in order

This means if your code says:

r1 = x;  
r2 = y;

Then the CPU will load x before y, just as written.

But loads and stores can be reordered with respect to each other.

A later load can be executed before an earlier store, as long as they access different variables. For example:

x = 1;     // Store to x  
y = 2;     // Store to y  
r1 = z;    // Load from z. Note -  z is not accessed above.

Even though x = 1 and y = 2 come first, the CPU might delay committing those stores while performing the load from z early. As a result, r1 might see an outdated value of z, and other threads might not yet observe the updated values of x or y.

Relaxed/Weak: Performance First, Predictability Later

Modern architectures like ARM, POWER etc. implement relaxed memory models. These models give the CPU more freedom to reorder instructions for maximum performance, but they also make it harder for programmers to reason about how memory behaves in concurrent programs.

Unlike TSO (which only reorders loads with earlier stores), relaxed models allow:

Stores to be reordered with other stores
Loads to be reordered with other loads
Stores and loads to be reordered with each other, in both directions That means almost any combination of reordering is allowed → unless the programmer uses explicit memory barriers or synchronization instructions to enforce ordering.

Example

In a relaxed model, this code in Thread 1:

x = 1;
y = 2;

Might be observed by another thread as:

y = 2 happening before x = 1
Or only one of the stores being visible
Or even both stores being delayed entirely

Similarly, two loads:

r1 = x;
r2 = y;

May execute in reverse order internally, and r1 might see an older value while r2 sees a newer one → depending on what the hardware decides.

Programmers can no longer assume that memory behaves “as written.” Writing correct concurrent code now depends on understanding language-level memory models, atomic operations, and memory fences.

In the next article, we’ll explore how synchronization mechanisms (like locks, atomics, and memory barriers) help us write correct concurrent code → even under relaxed memory models, and dive into how they work under the hood.

If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!

Instruction Reordering: Your Code Doesn’t Always Run in the Order You Wrote It

Sachin Tolay — Thu, 26 Jun 2025 18:00:12 +0000

When writing code, you naturally expect instructions to run one after the other in the exact order they appear. For example:

x=1;
y=2;

You’d expect x = 1 to complete before y = 2 starts.

But in reality, modern CPUs and compilers don’t always execute instructions in the exact sequence you wrote them. Instead, they reorder instructions internally to improve performance. While this might sound risky, it’s a core optimization that enables today’s processors to run billions of instructions per second.

To fully appreciate why these reorderings happen, it helps to first understand the parallel execution techniques CPUs use that I have explained in detail here: Superscalar vs SIMD vs Multicore: Understanding Modern CPU Parallelism.

This article explains:

What instruction reordering is.
Why do CPUs and compilers perform it.
How it affects multithreaded programs.
And why understanding it is critical for writing correct concurrent code.

What is Instruction Reordering?

Instruction reordering means the order in which instructions are executed can differ from the order they appear in your source code. There are two main types of reordering:

Compiler Reordering — The compiler rearranges instructions as part of the code generation process to produce faster machine code.
CPU Reordering (Out-of-Order Execution) — CPUs execute instructions out of their original order internally to better utilize available execution units and reduce pipeline stalls.

Both types of reordering are done transparently to the programmer in single-threaded programs, so your code behaves as expected. However, when multiple threads interact via shared memory, these reorderings can cause subtle and hard-to-debug bugs.

Why Do CPUs and Compilers Reorder Instructions?

Both CPUs and compilers reorder instructions primarily to improve performance by making better use of hardware resources and minimizing delays.

Improving CPU Utilization

Modern CPUs have multiple execution units per core (such as ALUs, FPUs, and load/store units) that can operate in parallel. To keep these units busy, the CPU issues and executes multiple independent instructions simultaneously, even if they appear sequentially in your code.

a = b + c; // Instruction 1
x = y + z;  // Instruction 2 (independent of Instruction 1)

Here, the CPU can execute both instructions at the same time in different execution units, rather than waiting to finish Instruction 1 before starting Instruction 2. This parallelism boosts throughput.

Hiding Memory Latency

Memory access can be slow compared to CPU speeds. When an instruction needs data from memory, the CPU doesn’t just wait idly → it reorders instructions to execute other independent instructions that are ready to run.

x = 1; // Instruction 1
y = slowLoad(); // Instruction 2 (memory access, slower)
z = 2;          // Instruction 3

While Instruction 2 waits for the memory load, the CPU can execute Instruction 3 immediately, avoiding pipeline stalls and improving efficiency.

Compiler Optimizations

Compilers reorder instructions during code optimization to produce faster, more efficient machine code. This includes:

Reordering independent instructions to improve scheduling.
Moving calculations that don’t change out of loops.
Eliminating repeated computations by reusing previously computed values.

Consider the following code snippet inside a loop:

for (int i = 0; i < 1000; i++) {
  int a = 5 * 2; // same calculation every iteration
  int b = a + i;
  int c = 5 * 2; // repeated calculation
  array[i] = b + c;
}

After optimization, the generated code could look like:

int a = 5 * 2; // computed once before the loop
for (int i = 0; i < 1000; i++) {
  int b = a + i;
  int c = a; // reuse computed value
  array[i] = b + c;
}

Why Instruction Reordering Matters in Multithreaded Programs

When multiple threads access shared memory without proper synchronization, instruction reordering can lead to unexpected behaviors.

// Shared variables
int data = 0;
int flag = 0;

// Thread 1
data = 42; // Step 1
flag = 1; // Step 2

// Thread 2
if (flag == 1) { // Step 3
  print(data); // Step 4
}

You’d expect that if Thread 2 sees flag == 1, it should also see data = 42. But if the compiler or CPU reorders flag = 1 before data = 42, Thread 2 might read flag == 1 but data = 0. This kind of subtle bug is caused by instruction reordering combined with visibility issues in multithreaded memory.

How CPUs & Compilers Avoid Breaking Single-Threaded Programs

Even though CPUs and compilers reorder instructions to run faster, they make sure your program still behaves as if the instructions ran exactly in the order you wrote them → when you’re running a single thread.

They do this by carefully tracking data dependencies between instructions. For example, if one instruction needs the result of another, it won’t be moved before that instruction.

CPUs also use special hardware mechanisms, like reorder buffers, to keep track of the original program order and only commit results in that order. So, even if instructions execute out of order internally, the program’s visible behavior stays consistent.

This means you don’t have to worry about your code acting strangely due to instruction reordering in normal, single-threaded programs.

How to Write Correct Multithreaded Programs Despite Reordering

As explained before, Instruction reordering can lead to subtle and unpredictable bugs in multithreaded programs. One thread might observe memory updates from another in an unexpected order, breaking the intended logic of your program.

Because CPUs and compilers apply many different optimizations based on context, hardware, and surrounding code, the specific ordering of operations can vary in ways you might not expect. This makes reasoning about shared memory behavior tricky without proper safeguards.

To write correct multithreaded code, you need to use synchronization tools that control visibility and ordering between threads:

Locks (mutexes): Prevent simultaneous access to shared data.
Atomic operations: Ensure safe, indivisible updates.
Memory barriers (fences): Stop certain reorderings from happening across critical instruction boundaries.

These tools help establish happens-before relationships → ensuring that the operations in one thread become visible to another in a predictable and controlled manner. To apply these tools correctly, it’s important to understand:

What memory models are, and how they define visibility and ordering guarantees
How different synchronization mechanisms enforce those guarantees

These advanced topics will be explored in upcoming articles, as they are essential for writing safe and efficient concurrent systems.

If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!

Superscalar vs SIMD vs Multicore: Understanding Modern CPU Parallelism

Sachin Tolay — Wed, 25 Jun 2025 05:46:13 +0000

For many years, improving CPU performance meant increasing clock speed → allowing more cycles per second. But today, we’ve reached practical limits in how fast we can push frequency due to power, heat, and physical constraints.

As a result, modern CPU design focuses less on running faster and more on doing more per cycle. To achieve this, processors use three key architectural techniques:

Superscalar execution
SIMD (Single Instruction, Multiple Data)
Multicore parallelism

Together, these allow a CPU to complete multiple operations in a single clock cycle → making better use of each tick without increasing the clock rate itself.

Before diving into these techniques, it’s important to understand CPU pipelining, the foundation of all modern CPU execution, which is covered in a separate article — CPU Pipelining: How Modern Processors Execute Instructions Faster.

Superscalar: Executing Multiple Instructions Per Cycle

A superscalar processor can issue and execute multiple instructions within a single clock cycle. This is achieved by replicating execution units (such as ALUs, FPUs, and load/store units) as illustrated in the diagram above, and by incorporating scheduling logic that performs several key functions :-

Analyzing Dependencies Between Instructions
Scheduling Independent Instructions Across Execution Units
Register Renaming to Eliminate False Dependencies
Reordering Instructions to Hide Stalls

This approach exploits instruction-level parallelism (ILP) → the presence of independent instructions within a single thread that can be executed simultaneously.

Superscalar Scheduling in Action

Consider the following simple code snippet:

int a = x + y; // Instruction 1
int b = m * n; // Instruction 2
a = p + q;     // Instruction 3 (reuses 'a')

Here’s how a 2-way superscalar CPU handles this:

Instruction 1 and Instruction 2 are independent and can be issued in parallel, assuming two ALUs are available.
Instruction 3 writes to a again. Although it doesn’t depend on Instruction 1, the reuse of the variable name a could create a false write-after-write dependency.
To resolve this, the CPU uses register renaming to map each version of a to a different register:
- a (from x + y) → Register P1
- b → Register P2
- a (from p + q) → Register P3
- This allows Instruction 1 and Instruction 3 to be issued out of order or in parallel, without waiting on one another.
- If, for example, x + y causes a stall (e.g., due to a cache miss), the CPU can reorder execution and run Instruction 2 or Instruction 3 first → keeping the pipeline active.

SIMD: Applying One Instruction to Multiple Data Elements

SIMD (Single Instruction, Multiple Data) allows a single instruction to operate on multiple values at once. This is ideal for vector math, graphics, or matrix processing, where the same operation repeats across arrays. This exploits data-level parallelism (DLP) → applying the same instruction to many data points.

// Pseudo-vectorized addition using SIMD
float a[4] = {1.0, 2.0, 3.0, 4.0};
float b[4] = {10.0, 20.0, 30.0, 40.0};
float c[4];
for (int i = 0; i < 4; i++) {
  c[i] = a[i] + b[i];
}

A SIMD instruction can perform all 4 additions in one CPU instruction.

Multicore: Running Multiple Threads in Parallel

A multicore processor has multiple independent cores, each capable of executing its own thread. Threads may come from the same program (multithreading) or different programs (multiprocessing). This exploits thread-level parallelism (TLP) → running independent streams of instructions in parallel.

All Three Combined: Parallelism at Every Level

Modern CPUs combine superscalar, SIMD, and multicore techniques to maximize throughput per cycle. This allows multiple threads to run across cores, with each core executing multiple instructions per cycle, and each instruction operating on multiple data values.

Example:

A CPU with:

4 cores (multicore),
each capable of issuing 4 instructions per cycle (superscalar),
and supporting 256-bit SIMD (processing 8 floats at once)

can potentially perform: 4 cores × 4 instructions × 8 data elements = 128 operations per cycle.

If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!

CPU Pipelining: How Modern Processors Execute Instructions Faster

Sachin Tolay — Sun, 22 Jun 2025 16:29:26 +0000

The key to modern processors’ speed lies in their ability to execute many instructions in parallel, and the foundation for that is a technique called pipelining. Though introduced decades ago, pipelining remains central to how today’s CPUs achieve high performance, powering even the most advanced architectures.

In this article, we’ll explore how pipelining works, how it improves CPU performance, and the common bottlenecks that can limit its efficiency.

Note — I have already written in-depth articles covering the memory hierarchy → including cache, virtual memory, and DRAM, so this article will not dive deeply into memory accesses.

A Simple Way to Understand CPU Pipelining

Imagine you work at a burger joint. You have to make 3 burgers, and each one needs to go through these 3 steps:

Grill the patty (1 min)
Assemble the burger (1 min)
Wrap it (1 min)

Without Pipelining: One Worker Does Everything

Imagine you have one worker who knows how to do all three tasks: grilling, assembling, and wrapping. They make each burger from start to finish before moving on to the next one:

Minute 1–3: Burger 1
Minute 4–6: Burger 2
Minute 7–9: Burger 3
Total time for 3 burgers = 9 minutes The worker is skilled, but because they do everything alone, they can only work on one burger at a time. No overlap.

With Pipelining (Assembly Line): Each specialized worker does only their task

Here, you have 3 workers, and each one is specialized → they only know how to do their specific task.

Minute 1: Worker 1 grills Burger 1
Minute 2: Worker 1 grills Burger 2, Worker 2 assembles Burger 1
Minute 3: Worker 1 grills Burger 3, Worker 2 assembles Burger 2, Worker 3 wraps Burger 1
Minute 4: Worker 2 assembles Burger 3, Worker 3 wraps Burger 2
Minute 5: Worker 3 wraps Burger 3
Total time for 3 burgers = 5 minutes.

After the pipeline is full (minute 3 onwards), one burger finishes every minute.

How This Relates to CPUs

Each burger = one CPU instruction
Each step = a CPU pipeline stage (explained in next section)
Each worker = a specialized hardware unit in the CPU
Without pipelining: everything runs one at a time, in order
With pipelining: stages overlap, and the CPU finishes one instruction per cycle (after the pipeline fills)

CPU Pipeline: Stages and Specialized Units

As explained in the previous section, a CPU pipeline works like an assembly line, where each instruction moves through a series of stages. Each stage is handled by a dedicated hardware unit, optimized for just that task. The table below maps each stage to its corresponding function and hardware unit. Compare each row with the matching element in the diagram above.

A CPU Pipeline Example

Let’s walk through 3 simple CPU instructions and how they move through the pipeline:

I1: R1 = MEM[0x1000] ; Load value at memory[0x1000] into R1
I2: R2 = MEM[0x1004] ; Load value at memory[0x1004] into R2
I3: R3 = R1 + R2 ; Add R1 and R2, store result in R3

Let’s assume:
memory[0x1000] = 10
memory[0x1004] = 20

The following table summarizes how each instruction progresses through the pipeline stages over multiple cycles:

Cycle-by-Cycle View

Instruction Details

Bottlenecks in CPU Pipelining

While CPU pipelining speeds up processing by working on multiple instructions at once, it faces several challenges that can slow things down. These bottlenecks limit how efficiently the pipeline runs:

Data Hazards

When an instruction needs the result of a previous instruction that isn’t ready yet, the pipeline must pause to avoid errors. For example, in the example above, instruction I3 stalls in the decode stage because it depends on the results of I1 and I2, which aren’t ready yet. This stall is a classic data hazard.

Solutions

Stalling the pipeline until data is ready.
Data forwarding to pass results directly between pipeline stages, bypassing stages like WB.
Compiler optimizations like reordering instructions to avoid dependencies.
Out-of-order execution so the CPU can run independent instructions while waiting.
Register renaming to avoid false dependencies between instructions.

Control Hazards (Branching)

Sometimes, the CPU comes across a decision point in the program, such as an if-else statement or a loop. At this moment, the CPU needs to figure out which set of instructions to run next. However, it often cannot know the correct path immediately because the condition it’s checking hasn’t been fully evaluated yet. This uncertainty causes the pipeline to pause or clear instructions that were loaded based on a guess, which slows down processing. This delay is called a branch penalty.

Solutions

Branch prediction to guess the most likely path.
Speculative execution to continue down a guessed path and discard it if wrong.
Delayed branching (used in some architectures) to rearrange instructions after a branch.

Structural Hazards

Structural hazards happen when two or more instructions need to use the same specialized hardware resource at the same time, but the CPU has only one of that resource available.

For example, if two instructions both want to use the Arithmetic Logic Unit (ALU) simultaneously, one instruction has to wait until the resource is free. This waiting slows down the pipeline because instructions can’t proceed in parallel as planned.

Solutions

More hardware units (e.g., multiple ALUs or load/store units) per CPU core.
Enhanced resource scheduling to better manage shared hardware access.

Pipeline Stalls (Bubbles)

To resolve hazards or wait for data, the CPU sometimes inserts idle cycles where no instruction completes. For example, in the earlier pipeline walkthrough, instruction I3 has to stall because it depends on the results of I1 and I2, which aren’t ready yet. During this stall, the pipeline pauses at the decode stage, waiting for the needed data, which temporarily slows down the overall instruction flow.

Solutions

Hazard detection units to predict and manage stalls.
Out-of-order execution to keep the pipeline busy with other instructions.
Compiler scheduling to rearrange instructions and minimize idle time.

Each of these bottlenecks and their solutions are complex topics on their own and deserve detailed explanations. They will be covered in separate articles for a deeper dive.

If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!

Cache Coherence: How the MESI Protocol Keeps Multi-Core CPUs Consistent

Sachin Tolay — Thu, 19 Jun 2025 23:59:08 +0000

Modern multi-core CPUs depend on caches to accelerate memory access and improve performance. However, when multiple cores cache the same memory address, maintaining a consistent view of memory across all cores and main memory (known as cache coherence) becomes a tricky problem.

One of the most widely used solutions to this challenge is the MESI cache coherence protocol. In this article, we’ll break down what cache coherence means, why it’s important, and how the MESI protocol ensures your multi-core CPU operates reliably and efficiently.

If you’re interested in diving deeper into how caches are organized and structured, I have written a separate article covering that in detail → Understanding CPU Cache Organization and Structure.

What Is Cache Coherence And Why Does It Matter?

When multiple cores cache the same memory address, and one of them updates it, how do we make sure all other cores see the updated value? Example :-

Core 1 and Core 2 both cache the value at memory address X, which initially holds 10.
At this point, both cores have their own local copies of X (value: 10) in their private L1 caches.
Now, Core 1 updates X to 20 in its own L1 cache.
However, Core 2’s cache still holds the old value → 10.
Worse, main memory also still has the outdated value: 10.

Now, if Core 2 tries to read X, it will retrieve the outdated value (10) from its own cache, unaware that Core 1 has already updated it to 20. This kind of mismatch can lead to incorrect or unpredictable application behavior. This mismatch is what cache coherence aims to solve: Ensuring that all cores (and main memory) have a consistent and up-to-date view of memory.

How CPUs Handle Writes: Write-Through vs Write-Back

Before we talk about how cache coherence is maintained, it’s important to understand how caches handle writes, because that directly impacts why coherence is even needed.

Write-Through Caching

In this strategy, whenever a core writes to a cache line, the same update is immediately written to main memory as well. This keeps memory always up-to-date, making coherence simpler to maintain. But there’s a catch:

Every write results in a memory operation, which increases memory bandwidth usage.
It introduces latency, as writes must wait for memory.
And most importantly, it defeats the purpose of having fast, local caches → which is to reduce the need to access slower main memory in the first place.

Write-Back Caching (Used in Most CPUs Today)

In write-back caching, when a core updates a value:

The change is made only in the core’s private cache.
The updated value is not written to main memory immediately.
Instead, the new value stays in the cache and is written back only when the cache line is evicted or needs to be shared. And that’s exactly where cache coherence protocols like MESI are needed → to ensure all cores always see the correct, updated data.

MESI Protocol

MESI stands for the four states each cache line can have → Modified (M), Exclusive (E), Shared (S) and Invalid (I). These states help the CPU know:

Which core has the most recent version of a piece of data,
Whether that version is the same as what’s in main memory,
What the CPU should do when a core tries to read or write that data.

Modified (M) → “I changed it, and no one else has it.”

If a CPU core has a cache line in the Modified state:

That core is the only one with the latest version of the data.
The data has been changed and no longer matches what’s in main memory.
If another core needs the data, the CPU can either:
- Send the updated data directly to the other core (cache-to-cache transfer), or
- Write it back to main memory if needed (e.g., on eviction).

Exclusive (E) → “I have the only clean copy.”

If a CPU core has a cache line in the Exclusive state:

That core is the only one with the latest version of the data.
The data matches the main memory → it has not been modified yet.
The core can:
- Read the data freely.
- Write to it directly, which promotes the cache line to the Modified state.
- No other core has a copy, so there’s no need for invalidation.

Shared (S) → “Others may have it too, and it’s read-only.”

If a CPU core has a cache line in the Shared state:

One or more cores may have a copy of the latest version of the data in their caches; none have modified it.
The data in all those caches matches the main memory → it’s the latest clean version.
The core can:
- Read the data freely.
- Not write to it unless it first invalidates all other copies and gains exclusive access.

Invalid (I) → “My copy is stale or gone.”

If a CPU core has a cache line in the Invalid state:

The cache line is not valid → the core cannot use it. It may have been :
- Evicted due to limited cache space,
- Invalidated by another core’s write,
- Or never loaded at all.
The core:
- Cannot read or write to this data.
- Must fetch a fresh copy from memory or another core’s cache to use it.
Any access will cause a cache miss and trigger MESI protocol actions.

How Caches Communicate: Bus Snooping and Cache-to-Cache Transfer

Caches communicate in two key ways: Bus Snooping and Cache-to-Cache Transfers.

Bus Snooping: A Subscription Mechanism

Bus snooping is a hardware technique where each core monitors the shared system bus to keep an eye on what other cores are doing with memory.

Every time a core reads or writes to a memory address, that action is broadcast on the system bus.
Other cores snoop (listen) to the bus.
If another core has a copy of the requested data, it can:
- Respond with the most recent version (in Modified or Exclusive state).
- Invalidate or update its own cached copy if needed.
- Trigger a state change in its MESI cache line.

Cache-to-Cache Transfer: A Reply Mechanism

When a core issues a memory read request, and another core already has the most recent copy of the requested data in its cache, it can respond directly → this is called a cache-to-cache transfer.

Instead of fetching the data from main memory, the owning core:

Snoops the request via the bus,
Recognizes that it holds the latest copy, and
Sends the data directly to the requesting core.

This not only satisfies the request more quickly, but it also saves the latency of accessing main memory.

In this section, we just focused on how caches communicate over the system bus to stay coordinated. The specific actions caches take in various scenarios, will be covered in detail in the next section.

MESI State Transitions

We’ll break it down into five common scenarios that illustrate most of the transitions you’ll encounter. Lets assume, we have 3 cores → Core 1, Core 2 and Core 3.

First Read Scenario: Core 1 Reads a Line Not in Any Cache

Core 1 issues a memory read request (BusRd), which is broadcast on the system bus.
Core 2 and Core 3 snoop this request but do not have the line cached. Since no other cache has the line, none respond to the broadcast.
Core 1 then fetches the data from main memory in Exclusive (E) state, indicating this core is the sole owner.

State Transitions

Second Read Scenario: Core 2 Reads the Same Line

Core 2 issues a read request (BusRd) broadcast, which is broadcast on the system bus.
Core 1 snoops and sees it has the line in Exclusive (E).
Core 1 supplies the data directly to Core 2 via cache-to-cache transfer, saving a slower memory access.
Both Core 1 and Core 2 downgrade their cache lines to Shared (S).
Core 3 snoops but does not have the line, so doesn’t respond.

State Transitions

First Write Scenario: Core 1 Writes to the Shared Line

Core 1 issues a write intent request (BusUpgr) broadcast on the bus.
Core 2 and Core 3 snoop the broadcast:
- Core 2 invalidates its Shared (S) copy → Invalid (I).
- Core 3 does nothing (already Invalid).
Core 1 waits for invalidation acknowledgments.
Core 1 upgrades its cache line directly to Modified (M).
Core 1 now has exclusive write access; Core 2 and Core 3 have invalid copies.

State Transitions

Second Write Scenario: Core 3 Writes After Core 1’s Modified

Core 3 issues a read-for-ownership request (BusRdX) broadcast.
Core 1 snoops, sees it has the line Modified (M).
Core 1 supplies the updated data directly to Core 3 (cache-to-cache transfer).
Core 1 invalidates its Modified (M) copy → Invalid (I).
Core 3 caches the line as Modified (M) and performs the write.
Core 2 remains Invalid.

State Transitions

Concurrent Write Scenario: Core 1 and Core 2 Try to Write Simultaneously

Both Core 1 and Core 2 issue write intent requests (BusRdX) around the same time.
The bus orders the write intents, allowing only one core (say Core 1) to perform its MESI transition to Modified (M). Core 2 must wait for its turn or retry once the bus is available again.
Core 1 proceeds to upgrade its line to Modified (M).
Core 2’s request is delayed or retried after Core 1’s invalidations.
Other cores snoop and invalidate as needed.

Every action, such as a read or write request on the bus, is completed fully and indivisibly before another conflicting request starts. The bus serializes these requests, ensuring no two cores simultaneously hold conflicting states for the same cache line. This atomicity guarantees data consistency and correctness across all caches.

Limitations of MESI

While MESI effectively maintains cache coherence in many systems, it has some important limitations:

False Sharing → MESI operates at the cache line granularity, not variable granularity. That means even if two threads access different variables, if those variables fall on the same cache line:
- MESI treats them as shared data.
- This causes unnecessary invalidations, even though no real data conflict exists.
Scalability Issues → MESI relies on bus snooping, where all cores must snoop every memory transaction:
- As the number of cores increases, the snooping traffic grows rapidly.
- More cores mean more invalidations, more broadcasts, and more bus congestion.
Latency on Writes → To write to a cache line that’s shared, a core must broadcast a write intent, wait for other cores to invalidate their copies, then perform the write. This adds latency, especially when multiple cores frequently access the same data, or when contention is high.
No Built-in Support for Synchronization → MESI doesn’t handle higher-level synchronization (like locks or barriers). It only ensures data coherence, not program correctness.

If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!

Understanding CPU Cache Organization and Structure

Sachin Tolay — Thu, 19 Jun 2025 15:58:25 +0000

Software performance is deeply influenced by how efficiently memory is accessed. The story behind memory access latency is layered: it begins with CPU caches, traverses through virtual memory translation, and ultimately reaches physical DRAM. Each layer introduces its own overhead and optimization challenges.

If you’re interested in learning more about how virtual memory and DRAM work, you may want to explore these related articles:

This article focuses solely on CPU caches, the second fastest layer in the memory hierarchy after CPU registers. We’ll dive into the structural design of CPU caches, how they manage data placement and lookup, and how this affects the speed of your code.

Why Do We Need CPU Caches?

CPU caches were introduced to bridge the vast speed gap between fast processors and much slower main memory. Without addressing this gap, CPUs would frequently stall, waiting hundreds of CPU cycles for data.

By the late 1960s, this mismatch had become a major bottleneck. Computer architects considered three possible solutions:

Make main memory faster
- Technologies like magnetic core and early DRAM were too slow to match CPU speeds.
- Faster memory - SRAM was either too expensive or not scalable to large sizes.
Use the CPU more efficiently
- Techniques like pipelining and instruction reordering were emerging.
- While they helped hide latency, they didn’t eliminate it, and the memory bottleneck still remained.
Introduce a small, fast buffer (cache)
- SRAM wasn’t affordable or scalable for large memory sizes, but it was ideal for implementing small, fast memory layers close to the CPU.
- Studies of real-world workloads from operating systems, web servers etc. to machine learning, databases etc. shows that a small portion of memory handles the majority of accesses. This concentration of activity enables caches to be highly effective despite their limited size.

Though all three directions have seen continued development, caches offered the best trade-off between performance, cost, and complexity at the time and became a fundamental part of CPU architecture from the 1970s onward.

Memory Hierarchy: Where Do Caches Fit

Modern computer systems organize memory into a hierarchy to balance speed, capacity, and cost :-

At the top are CPU registers → tiny, ultra-fast storage tightly coupled with the processor.
Just below the registers are CPU caches → small, fast buffers that store recently or frequently used data. They are organized into levels :-
- L1 and L2 caches are private to each CPU core.
- L3 cache is larger and shared across all cores.
When data is needed, the CPU checks these caches in order: L1 → L2 → L3. If the data is not found (a cache miss), it falls back to main memory (DRAM), which is significantly slower.
Each level down the hierarchy offers more capacity and lower cost per bit, but at the cost of higher latency.

Cache Organization And Structure

Caches store data in fixed-size chunks called cache lines, usually 64 bytes long. These lines are grouped into sets, and how many cache lines each set can hold depends on the cache’s associativity (also called the number of ways). This design is similar to a hashmap: each set acts like a hash bucket, and the multiple ways within a set are like chained entries for resolving collisions.

Direct-mapped (1-way associative): Each memory address maps to exactly one set and one specific cache line → so during a lookup, only that single line needs to be checked to determine if the memory address’s data is present.
N-way set-associative (e.g., 2-way associative, 4-way associative): Each set holds multiple cache lines → for example, 4 lines in a 4-way associative cache. A memory address can be stored in any of these lines, so during a lookup, all lines in the set are checked to determine if the memory address’s data is present in the cache.
Fully associative: There are no set divisions → any memory address can be stored in any cache line. During a lookup, all cache lines must be checked to determine if the memory address’s data is present in the cache.

How a Memory Address Maps to Cache: A Hashmap Analogy

CPU caches work a lot like hashmaps. Just as hashmaps use keys to store and retrieve values efficiently, caches use memory addresses to store and find data quickly. Let’s break it down.

Cache Lookup: Using Index, Tag, and Offset from a Physical Address

Index → Finding the Right Set

Hashmap analogy: this is like using a hash function to find the right bucket.

As mentioned earlier, the cache is divided into sets, and each set holds multiple cache lines (depending on the associativity). The index bits of the memory address are used to select the corresponding set.

For example → if your cache has 256 sets, 8 bits of the address would serve as the index (since 2⁸ = 256). This tells the CPU which set to search in → just like a hashmap uses a hash of the key to find a bucket.

Tag → Matching the Entry Inside the Set (Handling Collisions)

Hashmap analogy: once you’re in the bucket, compare stored keys in the bucket to resolve a collision in the bucket.

Because many memory addresses can share the same index (i.e., they map to the same set), the cache must resolve these collisions. This is where the tag comes in.

As mentioned earlier, each set in an N-way set-associative cache contains N cache lines. When the CPU accesses a memory address, it uses the index to locate the set, and then compares the tag from the incoming address against the tags stored in all N cache lines of that set.

If a match is found → cache hit
If no match → cache miss, and one of the existing lines may be evicted to make room

Offset → Extracting the Right Byte

Hashmap analogy: retrieving a specific part of the value associated with the key.

As mentioned earlier, each cache line typically holds a block of contiguous memory, such as 64 bytes. Once the correct cache line is identified through a cache hit (via index and tag comparison), the CPU uses the offset bits from the address to select the exact byte within that cache line. For a 64-byte cache line, we need 6 bits for the offset → since 2⁶ = 64.

Now, we have seen how the CPU uses the index, tag, and offset to find data in the cache on a cache hit. But what if none of the tags in the set match? This is a cache miss, and the CPU must fetch the entire data block (equivalent to cache line) from slower main memory. Since the 64-bit data bus transfers 8 bytes (64 bits) per CPU cycle, it typically takes about 8 CPU cycles to transfer a 64-byte data block into the cache line.

Cache Update: What Happens on a Cache Miss

When a cache miss occurs, the cache must be updated with the data block containing the requested memory address. Here’s how this process works step-by-step:

Fetching Data from Main Memory → The CPU reads the entire data block (e.g., 64 bytes) starting at the aligned memory address from main memory. For example, if the cache line size is 64 bytes and the CPU requests address 0x1A3F, the cache loads the whole 64-byte block starting at the aligned base address 0x1A00 (since 64 bytes require 6 offset bits, the lower 6 bits are cleared).
Locating a Slot in the Set → The block is loaded into the cache set identified by the index bits. If the cache is N-way set associative, the data can be placed in any of the N cache lines within that set. If there’s a free line available, the new block is stored there directly. Otherwise, the cache must evict an existing line based on its replacement policy.
Evicting an Existing Line if Needed → If the set is full (all lines are occupied), the cache uses a replacement policy, like Least Recently Used (LRU), to decide which existing cache line to evict to make room for the new block.
Updating the Tag → The tag field of the selected cache line is updated to reflect the new block’s address, enabling future hits.
Populating the Cache Hierarchy → A cache miss in L1 doesn't just populate L1, the fetched block is inserted into all relevant levels of the cache hierarchy (L1, L2, and L3).
Resuming Access → After the update, the CPU can access the requested byte in the cache line using the offset bits, just as it would on a cache hit.

Comparing Different Set Associative Types

Cache Replacement Policies

During a cache miss, when a new memory block needs to be brought into the cache, and the set it maps to is already full, the cache must decide which existing block to evict. This decision is made using a replacement policy. The most common policies include:

LRU (Least Recently Used) → Evicts the block that hasn’t been used for the longest time.
PLRU (Pseudo-LRU) → An approximation of LRU designed for efficient hardware implementation, often used in modern CPUs to balance complexity and performance.
FIFO (First-In, First-Out) → Evicts the block that has been in the cache the longest, regardless of usage.
Random → Chooses a block to evict at random; simple to implement in hardware, though less predictable.

How Cache Structure Affects Performance

The way caches are organized → including their line size, associativity, and replacement policies → directly influences how well your code performs. Caches are designed to exploit two key principles of memory access: Temporal Locality and Spatial Locality.

Temporal Locality

Programs tend to reuse data they accessed recently. If a value is used in quick succession, keeping it in the cache can significantly reduce memory latency.

Mechanisms Targeting Temporal Locality

Retaining recently accessed data: Frequently used values → like loop counters, accumulator variables, or function arguments → are kept in the L1 cache, where access is fastest. As long as these values are reused quickly, they remain cached.
Smart replacement policies: When a cache needs to evict data, it uses policies like LRU or PLRU to decide which data is least likely to be reused. This helps preserve recently accessed (and likely-to-be-reused) blocks.
Multi-level cache hierarchy: If a block is evicted from L1 due to pressure, it may still exist in L2 or L3 → larger, slightly slower caches that act as a backup. This layered design improves hit rates across varying temporal reuse patterns.
Prefetching with reuse prediction: Modern CPUs can predict reuse based on past access patterns. If the processor or compiler detects that certain memory is accessed repeatedly, it can preemptively fetch or retain that data to improve hit rates.

Example → Local variables in a loop:

int sum = 0;
for (int i = 0; i < 1000; ++i) {
   sum += i;
}

The variables sum and i are updated on every iteration. It remains in a register or L1 cache because of temporal locality.

Example → Global counters or state flags:

int error_count = 0;
void log_error() {
  error_count++;
}

If log_error() is called frequently, the global variable error_count is repeatedly accessed and benefits from temporal locality.

Pitfalls That Break Temporal Locality

Large working sets: If your program uses more data than can fit in the cache, older data gets evicted before it can be reused.
Irregular access patterns: Frequently jumping between unrelated memory regions (e.g., random linked list traversal) prevents the cache from predicting reuse.

Spatial Locality

Programs also tend to access memory locations that are close together. That’s why caches load entire cache lines (e.g., 64 bytes). If you read one byte, there’s a good chance nearby bytes will be accessed soon → so pre-loading them pays off.

Mechanisms Targeting Spatial Locality

Cache lines store blocks of data: When a CPU accesses one memory address, the cache loads an entire cache line from main memory. This block includes the requested byte and adjacent ones, anticipating that they’ll be accessed soon.
Block-based memory transfer: Since a single cache line holds 64 bytes, accessing nearby addresses (like a[i+1], a[i+2], etc.) doesn’t require additional memory accesses → they’re already in the same line or adjacent lines.
Hardware prefetching: Modern CPUs automatically detect sequential access patterns (like array iteration) and start fetching the next cache lines ahead of time, reducing wait time for upcoming memory accesses.

Example → Array iteration

for (int i = 0; i < N; ++i) {
  a[i] = a[i] * 2;
}

Elements a[0], a[1], a[2], … are stored contiguously in memory. Accessing them in order leverages spatial locality → once a cache line is loaded, the next few elements are likely already there.

Example → Struct field access

struct Point { int x; int y; } p;
int sum = p.x + p.y;

Both x and y are stored next to each other in memory. Accessing them together benefits from spatial locality.

Example → Matrix access (row-wise)

for (int i = 0; i < rows; ++i)
  for (int j = 0; j < cols; ++j)
    matrix[i][j] += 1;

If the matrix is stored in row-major order (as in C/C++), this access pattern walks through memory linearly, maximizing spatial locality.

Pitfalls That Break Spatial Locality

Strided accesses with large gaps: Accessing data with large steps skips over useful cache lines.
```
for (int i = 0; i < N; i += 64) {
  process(a[i]);
}
```
Each access lands on a new cache line, wasting preloaded data.
Accessing only one field in a wide struct repeatedly:
```
struct Big { int a; char padding[60]; int b; };
Big arr[100];
for (int i = 0; i < 100; ++i) {
  process(arr[i].a);
}
```
Here, each cache line holds mostly unused data, leading to cache pollution.
Unaligned data structures: If structures span multiple cache lines due to misalignment, accessing even a single field can trigger multiple memory accesses.

If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!