<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sachin Tolay</title>
    <description>The latest articles on Forem by Sachin Tolay (@sachin_tolay_052a7e539e57).</description>
    <link>https://forem.com/sachin_tolay_052a7e539e57</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3254526%2F7febbaf7-bdf2-45d4-a54b-f3a0a99a42b8.jpeg</url>
      <title>Forem: Sachin Tolay</title>
      <link>https://forem.com/sachin_tolay_052a7e539e57</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sachin_tolay_052a7e539e57"/>
    <language>en</language>
    <item>
      <title>Blocking vs Non-blocking vs Asynchronous I/O</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Wed, 30 Jul 2025 04:57:36 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/blocking-vs-non-blocking-vs-asynchronous-io-3ei4</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/blocking-vs-non-blocking-vs-asynchronous-io-3ei4</guid>
      <description>&lt;p&gt;When a program performs I/O → like reading from a file or socket → two key questions arise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the program stop and wait for the data, or continue running? (&lt;strong&gt;Blocking vs Non-blocking IO&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;Does the program keep checking for the result, or get notified when it’s done? (&lt;strong&gt;Synchronous vs Asynchronous IO&lt;/strong&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are orthogonal concepts, meaning they can be mixed in different combinations. Each combination comes with trade-offs in &lt;strong&gt;performance, complexity, and responsiveness&lt;/strong&gt;. In this article, we'll break down these models to understand how I/O really works under the hood.&lt;/p&gt;

&lt;h2&gt;
  
  
  Blocking vs Non-blocking I/O Intuition
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Question to ask&lt;/strong&gt; → After placing the coffee order, do you stand there waiting until it’s ready, or do you walk away and do other things in the meantime?&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Blocking:&lt;/strong&gt; You stand at the coffee counter and wait until your coffee is ready before leaving.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-blocking:&lt;/strong&gt; You place your order and then walk away; if the coffee isn’t ready yet, you don’t wait → you might come back later to check.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Synchronous vs Asynchronous I/O Intuition
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Question to ask&lt;/strong&gt; → After placing the coffee order, do you keep checking if it’s ready, or do they notify you when it’s done?&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Synchronous:&lt;/strong&gt; You keep walking back to the counter every few minutes to ask, “Is my coffee ready yet?”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Asynchronous:&lt;/strong&gt; You leave and go about your day → when your coffee is ready, they text you so you know to come back.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt; - It’s important to understand that &lt;strong&gt;asynchronous&lt;/strong&gt; and &lt;strong&gt;non-blocking&lt;/strong&gt; are related but different concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-blocking I/O is about &lt;strong&gt;waiting vs not waiting&lt;/strong&gt; → whether the program waits (blocks) for the operation to complete or the call returns immediately if data isn’t ready.&lt;/li&gt;
&lt;li&gt;Asynchronous I/O is about &lt;strong&gt;who drives the control flow&lt;/strong&gt; → whether the program itself keeps checking (synchronous) or the system notifies the program when the operation completes (asynchronous).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Blocking I/O Implementation - Using read()
&lt;/h2&gt;

&lt;p&gt;The call waits until completion before returning, as shown in the diagram below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Blocking read
int n = read(fd, buffer, size); // Blocks until data is ready
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2un1x1gj51shm2czc6ss.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2un1x1gj51shm2czc6ss.png" alt="Blocking read()" width="769" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Non-blocking I/O Synchronous Implementation 1 – Using read()
&lt;/h2&gt;

&lt;p&gt;Your program calls &lt;strong&gt;read()&lt;/strong&gt; again and again in a loop. If there’s no data, &lt;strong&gt;read()&lt;/strong&gt; returns -1 with EAGAIN. This wastes CPU cycles because you're periodically checking by issuing read() system calls. But between checks, your program can do other things (&lt;strong&gt;non-blocking&lt;/strong&gt;). Your program drives the control flow → it decides when to check for the data availability (&lt;strong&gt;synchronous&lt;/strong&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fcntl(fd, F_SETFL, O_NONBLOCK);  // Set fd to non-blocking

while (1) {
    int n = read(fd, buffer, sizeof(buffer));
    if (n &amp;gt; 0) {
        // Got data
        handle_data(buffer, n);
        break;
    } else if (n == -1 &amp;amp;&amp;amp; (errno == EAGAIN || errno == EWOULDBLOCK)) {
        // No data, do something else
        do_other_work();
    } else {
        // Some other error
        break;
    }
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Non-blocking I/O Synchronous Implementation 2 – Using select()/poll()
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;select()&lt;/strong&gt; is used to multiplex a set of file descriptors → allowing your program to wait efficiently until &lt;strong&gt;any one&lt;/strong&gt; of them is ready for I/O. Unlike repeatedly calling non-blocking read(), which issues a system call each time and wastes CPU cycles when no data is available, select() makes a single blocking system call that &lt;strong&gt;sleeps&lt;/strong&gt; (means no cpu occupied) until at least one descriptor is ready. After select() returns, checking which descriptors are ready using &lt;strong&gt;FD_ISSET&lt;/strong&gt; is a fast &lt;strong&gt;user-space&lt;/strong&gt; operation that doesn’t incur extra system calls, making the whole process much more efficient.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;while (1) {
    fd_set fds;
    FD_ZERO(&amp;amp;fds);
    FD_SET(fd1, &amp;amp;fds);
    FD_SET(fd2, &amp;amp;fds);
    int max_fd = (fd1 &amp;gt; fd2) ? fd1 : fd2;

    // Block until one of the FDs is ready to read
    if (select(max_fd + 1, &amp;amp;fds, NULL, NULL, NULL) &amp;gt; 0) {
        if (FD_ISSET(fd1, &amp;amp;fds)) {
            // fd1 has data
            read(fd1, buffer1, sizeof(buffer1));
        }
        if (FD_ISSET(fd2, &amp;amp;fds)) {
            // fd2 has data
            read(fd2, buffer2, sizeof(buffer2));
        }
    }
    // You can also perform other logic here
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary So Far
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frimjz3jgh7dx8aq4doso.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frimjz3jgh7dx8aq4doso.png" alt="Summary So Far" width="800" height="157"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Non-blocking Asynchronous I/O Implementation — Using OS-Provided Async APIs (e.g., io_uring, Windows IOCP, Linux AIO)
&lt;/h2&gt;

&lt;p&gt;In asynchronous I/O, the program initiates the I/O operation and &lt;strong&gt;does not&lt;/strong&gt; check or wait for the result. Instead, the OS &lt;strong&gt;notifies&lt;/strong&gt; the program (via callbacks, signals, or event queues) when the operation completes, handing control back to the program only when data is ready. This allows maximum concurrency and responsiveness, as your program never blocks or polls.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Initiate async read operation
async_read(fd, buffer, size, callback_function);

// Meanwhile, do other work here

// callback_function is called by OS when data is ready
void callback_function(int result, char* buffer) {
    if (result &amp;gt; 0) {
        handle_data(buffer, result);
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6lidowy77vabsmkm3jd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6lidowy77vabsmkm3jd.png" alt="Async IO" width="800" height="521"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Summary
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fosx1z6k8u914d3q6t380.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fosx1z6k8u914d3q6t380.png" alt="Final Summary" width="800" height="205"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>tutorial</category>
      <category>learning</category>
      <category>linux</category>
    </item>
    <item>
      <title>Traditional IO vs mmap vs Direct IO: How Disk Access Really Works</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Sun, 20 Jul 2025 19:37:49 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/traditional-io-vs-mmap-vs-direct-io-how-disk-access-really-works-1h4l</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/traditional-io-vs-mmap-vs-direct-io-how-disk-access-really-works-1h4l</guid>
      <description>&lt;p&gt;In our earlier deep dive into &lt;a href="https://dev.to/sachin_tolay_052a7e539e57/understanding-direct-memory-access-dma-how-data-moves-efficiently-between-storage-and-memory-13nn"&gt;Direct Memory Access (DMA)&lt;/a&gt;, we explored how data can bypass the CPU to move efficiently between storage and memory.&lt;br&gt;
In this article, we will break down and compare three major approaches to disk access:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traditional (Buffered) I/O&lt;/li&gt;
&lt;li&gt;Memory-Mapped Files&lt;/li&gt;
&lt;li&gt;Direct I/O&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Traditional IO (Buffered IO)
&lt;/h2&gt;

&lt;p&gt;When you run something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int fd = open("data.txt", O_RDONLY);
read(fd, buf, 4096); // read 4096 bytes from fd into buf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6lc9dvo84sb816n5gyrh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6lc9dvo84sb816n5gyrh.png" alt="Traditional IO" width="800" height="602"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Page Cache Lookup&lt;/strong&gt; → The OS first checks its &lt;strong&gt;page cache&lt;/strong&gt;, a large shared memory pool used to avoid redundant disk access. This cache holds recently accessed file data from &lt;strong&gt;all&lt;/strong&gt; processes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read-Ahead&lt;/strong&gt; → If the OS needs to fetch data from disk, it doesn’t just fetch the 4 KB block you asked for. It reads ahead, often 32 KB or more, anticipating sequential access patterns. We will use this information later in the article against Traditional IO (and mmap too).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Double Copy&lt;/strong&gt; → Data is first loaded into the OS-managed page cache via DMA, then copied again into your application buffer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System Call Overhead&lt;/strong&gt; → Every read() triggers a system call, which is costly → especially during sequential reads when the data is already in the cache.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Memory-Mapped Files (mmap)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;mmap&lt;/strong&gt; offers a powerful way to access files: instead of copying file data into a user buffer via read(), the OS maps the file directly into the process’s virtual memory space.&lt;/p&gt;

&lt;p&gt;When you do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int fd = open("data.txt", O_RDONLY);
char *mapped = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
char c = mapped[0]; // Triggers a page fault on first access
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 1: Setting up the Mapping
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgkwgr44bmfis0etb5ww0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgkwgr44bmfis0etb5ww0.png" alt="mmap system call" width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You’re telling the OS → Map this file into my memory space. I’ll access it like a memory, not like a file.&lt;/li&gt;
&lt;li&gt;No data is fetched from disk yet, just pages are marked as &lt;strong&gt;not present&lt;/strong&gt;, so any access triggers a &lt;strong&gt;page fault&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: First Access Per Page → Page Not in Cache → Major Page Fault
&lt;/h3&gt;

&lt;p&gt;When the app accesses a file-backed page for the first time and it’s &lt;strong&gt;not&lt;/strong&gt; in the page cache, a &lt;strong&gt;major page fault&lt;/strong&gt; occurs:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzvyybzqwbmqq4qrs48jl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzvyybzqwbmqq4qrs48jl.png" alt="First Access, Page Not In Cache" width="800" height="328"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This happens only for the &lt;strong&gt;first&lt;/strong&gt; access to &lt;strong&gt;each&lt;/strong&gt; page.&lt;/li&gt;
&lt;li&gt;Page fault is handled transparently → the app just sees a memory access.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: First Access → Page Cached But Not Mapped → Minor Page Fault
&lt;/h3&gt;

&lt;p&gt;If another process or earlier access already loaded the page into the &lt;strong&gt;page cache&lt;/strong&gt;, but this process hasn’t mapped it yet, a &lt;strong&gt;minor page fault&lt;/strong&gt; occurs:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1wlmyji7ljohfvd7l4g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1wlmyji7ljohfvd7l4g.png" alt="First Access, Page In Cache" width="800" height="338"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Subsequent Access → Page Already Mapped and Cached → No Fault
&lt;/h3&gt;

&lt;p&gt;If the page is &lt;strong&gt;already&lt;/strong&gt; mapped in the page table and the corresponding data is &lt;strong&gt;cached&lt;/strong&gt; in RAM, then the CPU can directly read it through virtual memory.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2nr1jh5y6q6v9e2n0k8x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2nr1jh5y6q6v9e2n0k8x.png" alt="Page Mapped and Cached" width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Direct I/O
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Direct I/O&lt;/strong&gt; transfers data directly between the storage device and the application buffer, bypassing the OS page cache entirely. This avoids the double copy of data, reducing CPU overhead and preventing page cache pollution due to read-aheads, but requires the application to carefully manage aligned buffers.&lt;/p&gt;

&lt;p&gt;When you do this :-&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int fd = open("data.txt", O_DIRECT);
void* buf;
posix_memalign(&amp;amp;buf, 4096, 4096);  // Allocate 4KB aligned buffer
read(fd, buf, 4096);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The buffer needs to start at a memory address that’s a multiple of 4 KB (or another fixed size). This is called &lt;strong&gt;alignment&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;If the buffer isn’t aligned properly, the read operation will usually fail or the system might fall back to traditional I/O.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fem4dl96fa84ir2xd15y5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fem4dl96fa84ir2xd15y5.png" alt="Direct IO" width="800" height="287"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The application buffer is mapped in virtual memory as usual.&lt;/li&gt;
&lt;li&gt;The OS validates access and instructs the DMA controller to transfer data &lt;strong&gt;directly&lt;/strong&gt; to the buffer’s physical memory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page cache is bypassed completely.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxrzx8oy2rf47f75r1ev.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxrzx8oy2rf47f75r1ev.png" alt="Summary And Comparison" width="800" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>learning</category>
      <category>programming</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Understanding Direct Memory Access (DMA): How Data Moves Efficiently Between Storage and Memory</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Sun, 13 Jul 2025 00:50:13 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/understanding-direct-memory-access-dma-how-data-moves-efficiently-between-storage-and-memory-13nn</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/understanding-direct-memory-access-dma-how-data-moves-efficiently-between-storage-and-memory-13nn</guid>
      <description>&lt;p&gt;Transferring data between &lt;strong&gt;Storage&lt;/strong&gt; and &lt;strong&gt;Memory&lt;/strong&gt; can slow down a computer if the CPU has to manage every step. &lt;strong&gt;Direct Memory Access (DMA)&lt;/strong&gt; is a mechanism that lets a dedicated controller take over this job, freeing up the CPU and making data movement faster and more efficient.&lt;/p&gt;

&lt;p&gt;In this article, we’ll explain what DMA is, why it’s important, and how it enables efficient data movement in computer systems.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: Terms like MMU and memory controller are mentioned here in a simplified way. For a deeper understanding of these components and their roles, please refer to my detailed articles on &lt;a href="https://dev.to/sachin_tolay_052a7e539e57/memory-access-demystified-how-virtual-memory-caches-and-dram-impact-performance-5ec6"&gt;Virtual Memory&lt;/a&gt; and &lt;a href="https://dev.to/sachin_tolay_052a7e539e57/understanding-dram-internals-how-channels-banks-and-dram-access-patterns-impact-performance-3fef"&gt;Memory Controllers&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DMA?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvmuxd5ccr3yzxei6fnd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvmuxd5ccr3yzxei6fnd.png" alt="Programmed I/O" width="800" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Without DMA, the CPU is responsible for every step of the transfer, reading and writing data one byte at a time. This method is known as &lt;strong&gt;Programmed I/O&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;As shown in the diagram, the CPU stays busy throughout the entire process, slowing transfers and preventing it from handling other tasks. By offloading this work to the DMA controller, the system frees up the CPU and achieves faster, more efficient data movement.&lt;/p&gt;

&lt;h2&gt;
  
  
  How DMA Works
&lt;/h2&gt;

&lt;p&gt;DMA uses a special hardware controller that works on its own, without needing the CPU. During data transfers, it takes control of the memory bus to move data directly between devices and RAM.&lt;/p&gt;

&lt;p&gt;The process involves three key phases:&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup Phase
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdp8paw10wznoq6f66orx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdp8paw10wznoq6f66orx.png" alt="DMA setup phase" width="800" height="275"&gt;&lt;/a&gt;&lt;br&gt;
The CPU configures the DMA controller with three critical pieces of information:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Source Address&lt;/strong&gt; → where the data is coming from.

&lt;ul&gt;
&lt;li&gt;For &lt;strong&gt;read()&lt;/strong&gt; operations, the source is typically the storage device (like a disk or SSD).&lt;/li&gt;
&lt;li&gt;For &lt;strong&gt;write()&lt;/strong&gt; operations, the source is memory, where the program’s data is prepared.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destination Address&lt;/strong&gt; → where the data should go.

&lt;ul&gt;
&lt;li&gt;For &lt;strong&gt;read()&lt;/strong&gt;, the destination is memory (so programs can use the data).&lt;/li&gt;
&lt;li&gt;For &lt;strong&gt;write()&lt;/strong&gt;, the destination is the storage device.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transfer Size&lt;/strong&gt; → how many bytes to move.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The CPU programs these details into the DMA controller and then steps aside.&lt;/p&gt;

&lt;h3&gt;
  
  
  Transfer Phase
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab54fje56skvtfr5jjpg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab54fje56skvtfr5jjpg.png" alt="DMA transfer phase" width="800" height="436"&gt;&lt;/a&gt;&lt;br&gt;
The DMA controller handles the data transfer without CPU involvement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Completion Phase
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftzxoq5e1ds26wtm8a5fq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftzxoq5e1ds26wtm8a5fq.png" alt="DMA completion phase" width="800" height="351"&gt;&lt;/a&gt;&lt;br&gt;
The DMA controller sends an interrupt to notify the CPU when the transfer completes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overall Read Flow — read() system call
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhvvv3r4a1cm5sa4dtitn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhvvv3r4a1cm5sa4dtitn.png" alt="Overall read flow" width="800" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The application issues a read request via a system call.&lt;/li&gt;
&lt;li&gt;The CPU uses the MMU to translate the virtual address of the page cache, which resides in the kernel’s memory space.&lt;/li&gt;
&lt;li&gt;The kernel checks the page cache in memory to see if the data is already available.&lt;/li&gt;
&lt;li&gt;If the data is cached, it is immediately returned to the application.&lt;/li&gt;
&lt;li&gt;If not cached, the CPU configures the DMA controller to transfer data from the storage device.&lt;/li&gt;
&lt;li&gt;The DMA commands the storage device to send data.&lt;/li&gt;
&lt;li&gt;The DMA writes the incoming data into memory through the memory controller.&lt;/li&gt;
&lt;li&gt;When the transfer completes, the DMA interrupts the CPU.&lt;/li&gt;
&lt;li&gt;The CPU then returns the data to the application.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Overall Write Flow — write() system call
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2e1u967qq3pv26vgltx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2e1u967qq3pv26vgltx.png" alt="Overall write flow" width="800" height="249"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The application sends data to be written via a system call.&lt;/li&gt;
&lt;li&gt;The CPU uses the MMU to translate the virtual address of the page cache, which lies in the kernel’s memory space.&lt;/li&gt;
&lt;li&gt;The data is first buffered in the page cache in memory by the kernel.&lt;/li&gt;
&lt;li&gt;The CPU sets up the DMA controller to move the buffered data from memory to the storage device.&lt;/li&gt;
&lt;li&gt;The DMA reads the buffered data from memory through the memory controller.&lt;/li&gt;
&lt;li&gt;The DMA streams the data to the storage device.&lt;/li&gt;
&lt;li&gt;After the write finishes, the DMA sends an interrupt to the CPU.&lt;/li&gt;
&lt;li&gt;The CPU acknowledges the write completion back to the application.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;If you found this explanation helpful and want to stay updated with more clear, detailed guides, follow me! I regularly share deep dives and easy-to-understand articles on a variety of tech topics.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>softwareengineering</category>
      <category>architecture</category>
      <category>dma</category>
    </item>
    <item>
      <title>How HDDs and SSDs Store Data The Block Storage Model</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Wed, 09 Jul 2025 22:19:20 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/how-hdds-and-ssds-store-data-the-block-storage-model-4m9i</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/how-hdds-and-ssds-store-data-the-block-storage-model-4m9i</guid>
      <description>&lt;p&gt;When you open a file in your program, it seems like you can read or change any byte you want. But in reality, your storage device doesn’t work with single bytes. Instead, &lt;strong&gt;HDDs&lt;/strong&gt; and &lt;strong&gt;SSDs&lt;/strong&gt; read and write data in larger chunks, called &lt;strong&gt;blocks&lt;/strong&gt; or &lt;strong&gt;pages&lt;/strong&gt;, which are usually a few KBs in size.&lt;/p&gt;

&lt;p&gt;This gap between what software wants (&lt;strong&gt;small, random access&lt;/strong&gt;) and how storage hardware works (&lt;strong&gt;large, fixed-size chunks&lt;/strong&gt;) is one of the most important challenges in computer systems.&lt;/p&gt;

&lt;p&gt;In this article, we’ll explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The two fundamental models of data access → &lt;strong&gt;block-addressable&lt;/strong&gt; and &lt;strong&gt;byte-addressable&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Why storage is not &lt;strong&gt;byte-addressable&lt;/strong&gt; like RAM.&lt;/li&gt;
&lt;li&gt;How HDDs and SSDs store and access data.&lt;/li&gt;
&lt;li&gt;How the block model shapes performance and design.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Two Fundamental Models: Byte-Addressable vs. Block-Addressable
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Byte-Addressable Model
&lt;/h3&gt;

&lt;p&gt;This is how &lt;strong&gt;RAM (DRAM)&lt;/strong&gt; works.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory is organized as a sequence of individual bytes, each with its own address.&lt;/li&gt;
&lt;li&gt;The CPU can read or write any &lt;strong&gt;single byte&lt;/strong&gt; directly and instantly.&lt;/li&gt;
&lt;li&gt;Latency is extremely low (nanoseconds), making random access cheap.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;char value = buffer[100];   // Read exactly 1 byte
buffer[200] = 'A';          // Write exactly 1 byte
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This fine-grained access makes it possible for RAM to support rich data structures like linked lists, trees, and pointer-chasing algorithms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Block-Addressable Model
&lt;/h3&gt;

&lt;p&gt;This is how &lt;strong&gt;storage devices&lt;/strong&gt; (HDDs and SSDs) work.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage is divided into fixed-size chunks called &lt;strong&gt;blocks&lt;/strong&gt; (in HDDs) or &lt;strong&gt;pages&lt;/strong&gt; (in SSDs).&lt;/li&gt;
&lt;li&gt;Typical block/page size: &lt;strong&gt;4 KB&lt;/strong&gt; or larger.&lt;/li&gt;
&lt;li&gt;You cannot read or write a single byte on its own.&lt;/li&gt;
&lt;li&gt;Even if you want just 1 byte, the device must read or write the &lt;strong&gt;entire block&lt;/strong&gt; containing it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reads and writes operate on these blocks or pages as the atomic unit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why RAM and Storage Use Different Access Models
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;RAM (Byte-Addressable Memory):&lt;/strong&gt; RAM is like having a mini-fridge in your bedroom. You can grab exactly the water bottle or even a single sip whenever you want → instantly and with no extra effort. This lets the CPU quickly access tiny pieces of data (like single bytes) whenever it needs them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Storage (Block-Addressable Devices like HDDs/SSDs):&lt;/strong&gt; Storage is like going all the way to the kitchen. You wouldn’t walk there just to pick up one water bottle → it’s too slow and inefficient. Instead, you grab the water bottle plus some snacks or other items you might need soon.&lt;/p&gt;

&lt;p&gt;This means when your program requests data, the storage device reads or writes a whole block (the water bottle + snacks) at once, because making multiple trips for tiny data would be too slow and wear out the hardware faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  How HDDs Store And Access Data
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4wpmk2guiqy4tfvm3a44.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4wpmk2guiqy4tfvm3a44.png" alt="HDD" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HDDs are &lt;strong&gt;electromechanical&lt;/strong&gt; devices that store data magnetically on spinning disks called &lt;strong&gt;platters&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Platters spin at thousands of RPM (e.g., 5,400, 7,200, 10,000, or 15,000 RPM).&lt;/li&gt;
&lt;li&gt;Each platter has concentric &lt;strong&gt;tracks&lt;/strong&gt; divided into &lt;strong&gt;sectors&lt;/strong&gt; (typically 512 bytes or 4 KB).&lt;/li&gt;
&lt;li&gt;Each surface of a platter has its own &lt;strong&gt;read/write head&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;All heads are mounted on a single &lt;strong&gt;actuator arm&lt;/strong&gt; that moves them in unison across the platters.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How Read Works
&lt;/h3&gt;

&lt;p&gt;When you read data from a hard drive, 3 steps are involved:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Seek Time: Moving the Arm&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The actuator arm moves the read/write heads to the correct track (cylinder).&lt;/li&gt;
&lt;li&gt;Typical time: &lt;strong&gt;4–10 ms&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Think of it like → Moving a record player needle to the right groove.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotational Latency: Waiting for the Right Sector&lt;/strong&gt;
Once the head is on the correct track, the disk must rotate so the exact sector spins under the head.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyxo486xocofwkgt5sluz.png" alt="RPM vs Latency" width="671" height="285"&gt;

&lt;ul&gt;
&lt;li&gt;Think of it like → Waiting for the spinning wheel to bring your slice in front of you&lt;/li&gt;
&lt;li&gt;Typically half a rotation worth of time.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Transfer&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Once aligned, data is read sector by sector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential reads&lt;/strong&gt; are much faster since the head stays on track.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  How Write Works
&lt;/h3&gt;

&lt;p&gt;Writing to an HDD follows the same physical steps as reading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Seek&lt;/strong&gt; to the correct track.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wait for rotation&lt;/strong&gt; to align the sector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transfer&lt;/strong&gt; data sector by sector.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Handling Changes/Edits in Files
&lt;/h4&gt;

&lt;p&gt;When you edit a file on an HDD, the operating system has to figure out where to put the new or changed data on the disk.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;If it’s a small change&lt;/strong&gt; → like fixing a typo or tweaking a line → the OS can often just overwrite the existing spot directly. It’s quick and easy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you add more content&lt;/strong&gt; →like inserting whole paragraphs or lots of new data → the old space might not be big enough anymore. The OS then has to find free sectors somewhere else on the disk and write the new data there.&lt;/li&gt;
&lt;li&gt;After writing, the OS keeps track of where all the pieces of the file are so it can read them in the right order later.&lt;/li&gt;
&lt;li&gt;Over time, as you keep editing and adding, parts of the file can end up scattered in many places on the disk. This is called &lt;strong&gt;fragmentation&lt;/strong&gt;. It means the read/write head has to jump around more, which slows things down.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To keep everything fast and tidy, operating systems use techniques like &lt;strong&gt;buffering&lt;/strong&gt;, &lt;strong&gt;batching&lt;/strong&gt;, and &lt;strong&gt;defragmentation&lt;/strong&gt;. These help organize writes better and reduce unnecessary movement.&lt;/p&gt;

&lt;h2&gt;
  
  
  How SSDs Store and Access Data
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb22qfw5535qpx5zp3qqy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb22qfw5535qpx5zp3qqy.png" alt="SSD" width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
SSDs have no moving parts and store data in flash memory cells arranged into &lt;strong&gt;pages&lt;/strong&gt; and &lt;strong&gt;blocks&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pages are typically &lt;strong&gt;4 -16 KB&lt;/strong&gt; each.&lt;/li&gt;
&lt;li&gt;Blocks contain many pages (e.g., &lt;strong&gt;256 pages&lt;/strong&gt; per block).&lt;/li&gt;
&lt;li&gt;All management is handled by the &lt;strong&gt;SSD controller&lt;/strong&gt; and its &lt;strong&gt;Flash Translation Layer (FTL)&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How Read Works
&lt;/h3&gt;

&lt;p&gt;Reading data from an SSD is &lt;strong&gt;simple&lt;/strong&gt; and &lt;strong&gt;fast&lt;/strong&gt;, thanks to the lack of mechanical parts.&lt;/p&gt;

&lt;h4&gt;
  
  
  Page-Level Reads
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;SSDs store data in &lt;strong&gt;pages&lt;/strong&gt; (typically &lt;strong&gt;4 -16 KB&lt;/strong&gt; each).&lt;/li&gt;
&lt;li&gt;Reads happen at the page level → you can’t read less than a page.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Example → Imagine your notebook.&lt;br&gt;
If you want a single note, you have to look at the whole page.&lt;br&gt;
You can’t magically see just one word without opening the page.&lt;br&gt;
Similarly, SSDs always read entire pages, even if your program only wants a few bytes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  No Moving Parts
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Unlike HDDs, SSDs have no mechanical parts like spinning platters or moving heads.&lt;/li&gt;
&lt;li&gt;There’s no seek time or rotational delay.&lt;/li&gt;
&lt;li&gt;Reads are extremely fast → typically &lt;strong&gt;tens to hundreds of microseconds&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Example → It’s like opening a notebook instantly to the right page, with no need to flip through slowly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Flash Translation Layer (FTL)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;The SSD controller uses a &lt;strong&gt;Flash Translation Layer (FTL)&lt;/strong&gt; to keep track of where data is physically stored in NAND flash.&lt;/li&gt;
&lt;li&gt;When you request data, the FTL quickly maps your logical request to the right physical page and retrieves it.&lt;/li&gt;
&lt;li&gt;This mapping is completely invisible to the operating system and user.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Example → Think of having an index at the front of your notebook that tells you exactly which page to turn to for each topic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  How Write Works
&lt;/h3&gt;

&lt;p&gt;When writing data to an SSD, the process is a bit more complex than reading:&lt;/p&gt;

&lt;h4&gt;
  
  
  Out-of-Place Writes
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Flash memory &lt;strong&gt;can’t overwrite&lt;/strong&gt; existing data in place.&lt;/li&gt;
&lt;li&gt;New data is always written to a &lt;strong&gt;free page&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The old page is marked &lt;strong&gt;invalid&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Example → Think of a notebook with 256 pages (like a flash block).&lt;br&gt;
If there are blank pages left, you can write on them immediately.&lt;br&gt;
This is how SSDs handle new data → they just use the next available free page without any extra work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Erase Before Write Requirement
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Once all pages in a block have been written (even if many are now invalid), they can’t just be overwritten.&lt;/li&gt;
&lt;li&gt;Flash requires erasing the entire block at once to make its pages writable again.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Example → Imagine you wrote with a pen in that notebook.&lt;br&gt;
If you want to change what’s on a page, you can’t just erase a single line.&lt;br&gt;
You’d have to rip out the entire sheet to get a fresh, blank page.&lt;br&gt;
Similarly, SSDs must erase the whole block to reuse its pages.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Garbage Collection
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;The SSD’s controller periodically cleans up space by copying valid pages elsewhere and erasing blocks.&lt;/li&gt;
&lt;li&gt;This process consolidates free space and makes new pages available for writing.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Example:&lt;br&gt;
Out of 256 pages in a block, maybe 200 are invalid (old data you don’t need).&lt;br&gt;
56 pages are still valid.&lt;br&gt;
The SSD will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Copy the 56 valid pages to another clean block.&lt;/li&gt;
&lt;li&gt;Erase the original block completely.
— Now, all 256 pages are blank and ready for new writes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  How the Block Model Shapes Performance and Design
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvzfjac9opt8f2q466i1h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvzfjac9opt8f2q466i1h.png" alt="Summary" width="800" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Block-based storage prefers &lt;strong&gt;bigger&lt;/strong&gt;, &lt;strong&gt;aligned&lt;/strong&gt;, &lt;strong&gt;sequential&lt;/strong&gt; operations. Software is designed to take advantage of this by &lt;strong&gt;buffering&lt;/strong&gt;, &lt;strong&gt;batching&lt;/strong&gt;, and &lt;strong&gt;organizing&lt;/strong&gt; data to reduce costly small writes.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Understanding OAuth 2.0 and OpenID Connect: A Step-by-Step Guide</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Thu, 03 Jul 2025 20:14:05 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/understanding-oauth-20-and-openid-connect-a-step-by-step-guide-4nf4</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/understanding-oauth-20-and-openid-connect-a-step-by-step-guide-4nf4</guid>
      <description>&lt;p&gt;If you’ve ever clicked “&lt;strong&gt;Sign in with Google&lt;/strong&gt;” or “&lt;strong&gt;Connect with Facebook&lt;/strong&gt;” on a website or app, you’ve interacted with technologies called &lt;strong&gt;OAuth&lt;/strong&gt; and &lt;strong&gt;OpenID Connect (OIDC)&lt;/strong&gt;. These two standards form the backbone of secure authentication and authorization on the web today.&lt;/p&gt;

&lt;p&gt;This guide will walk you through everything you need to know, in simple language and step-by-step explanations. By the end, you’ll have a solid grasp of both protocols, their roles, and why they matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authentication and Authorization
&lt;/h2&gt;

&lt;p&gt;Before diving into OAuth and OIDC, it’s important to understand two key concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Authentication&lt;/strong&gt; → process of verifying who you are. For example, logging in with your username and password proves your identity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authorization&lt;/strong&gt; → process of determining what you’re allowed to do. For instance, once logged in, determining if you have permission to access certain files or perform specific actions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Limitations of Password-Based Logins (and Why OAuth/OIDC Exist)
&lt;/h2&gt;

&lt;p&gt;Most websites and apps still rely on the traditional method of logging in with a username and password. While simple, this approach has several serious drawbacks:&lt;/p&gt;

&lt;h3&gt;
  
  
  User Experience Challenges
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Remembering multiple passwords is difficult.&lt;/li&gt;
&lt;li&gt;Users face friction creating and managing separate accounts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Security Threats
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Phishing attacks trick users into handing over passwords.&lt;/li&gt;
&lt;li&gt;Weak or reused passwords put accounts at risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Storage and Management Burdens
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Storing passwords securely is challenging.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Single Sign-On and Federated Login to the Rescue
&lt;/h2&gt;

&lt;p&gt;As we’ve seen, managing multiple usernames and passwords is both inconvenient and insecure. &lt;strong&gt;Single Sign-On (SSO)&lt;/strong&gt; reduces password fatigue by letting you log in once to access multiple apps within the same organization.&lt;/p&gt;

&lt;p&gt;However, SSO typically works only inside one organization or domain. &lt;strong&gt;Federated Login&lt;/strong&gt; solves this by letting you use trusted providers like Google or Facebook to sign in across different websites. This builds cross-domain trust without needing separate passwords for every service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OAuth 2.0&lt;/strong&gt; provides the framework for handling secure access to user data without sharing passwords. &lt;strong&gt;OpenID Connect (OIDC)&lt;/strong&gt; builds on OAuth to add login and identity, making federated login possible across websites.&lt;/p&gt;

&lt;h2&gt;
  
  
  OAuth 2.0 Intuition
&lt;/h2&gt;

&lt;p&gt;Think of OAuth like getting a signed access pass from City Hall so you can pick up someone else’s property from a secure warehouse.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm1q0o1h16jlru07st58v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm1q0o1h16jlru07st58v.png" alt="OAuth Example" width="800" height="869"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s how it works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Your friend (&lt;strong&gt;Resource Owner&lt;/strong&gt;) wants You (&lt;strong&gt;Client&lt;/strong&gt;) to pick up their box from the Warehouse (&lt;strong&gt;Resource Server&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;Your friend goes to City Hall (&lt;strong&gt;Authorization Server&lt;/strong&gt;). At City Hall:

&lt;ul&gt;
&lt;li&gt;They present their ID for identity verification (e.g., &lt;strong&gt;username/password&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;They explicitly state → “&lt;em&gt;I want to authorize this person to collect my box&lt;/em&gt;”.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;City Hall reviews and approves the request. They issue your friend an &lt;strong&gt;Authorization Letter&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Your friend gives You the authorization letter from City Hall, saying → “&lt;em&gt;Take this back to City Hall to get your official access pass&lt;/em&gt;”.&lt;/li&gt;
&lt;li&gt;You take the authorization letter back to City Hall :

&lt;ul&gt;
&lt;li&gt;City Hall verifies the authorization letter by confirming that it’s legitimate and hasn’t expired.&lt;/li&gt;
&lt;li&gt;City Hall then issues you a &lt;strong&gt;signed access pass&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;You take the &lt;strong&gt;signed access pass&lt;/strong&gt; to the Warehouse (&lt;strong&gt;Resource Server&lt;/strong&gt;):

&lt;ul&gt;
&lt;li&gt;The Warehouse examines the pass, verifies City Hall’s signature to ensure it’s authentic.&lt;/li&gt;
&lt;li&gt;It checks who authorized it (your friend), whom it’s allowed for (you), what it allows (picking up the box), and expiry date.&lt;/li&gt;
&lt;li&gt;Once all details are validated, the Warehouse gives you the box on behalf of your friend.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  OAuth 2.0: The Authorization Framework
&lt;/h2&gt;

&lt;p&gt;When a client app needs to act on a user’s behalf → like accessing their files, calendar, or other resources → it needs a way to request permission securely. OAuth 2.0 provides a standardized framework for this &lt;strong&gt;authorization&lt;/strong&gt;, allowing users to grant limited access to the apps without sharing their passwords with them.&lt;/p&gt;

&lt;p&gt;OAuth 2.0 focuses purely on &lt;strong&gt;what&lt;/strong&gt; an app is allowed to do, not &lt;strong&gt;who&lt;/strong&gt; the user is. That’s where &lt;strong&gt;OpenID Connect (OIDC)&lt;/strong&gt; comes in, adding &lt;strong&gt;authentication&lt;/strong&gt; on top of OAuth 2.0 to verify user identity.&lt;/p&gt;

&lt;p&gt;Because OIDC is built on OAuth 2.0, understanding OAuth first makes everything else clearer. That’s why we’ll begin by exploring OAuth 2.0 in detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  OAuth 2.0 Flow Types For Different Use Cases
&lt;/h3&gt;

&lt;p&gt;OAuth 2.0 defines different flows to handle different scenarios securely. In each scenario, the actors and their capabilities can vary, which is why there are different flows tailored to specific use cases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexfhe4u8ewa25l4i0qbu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexfhe4u8ewa25l4i0qbu.png" alt="OAuth Flow Types" width="800" height="185"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this article, we’ll focus on the &lt;strong&gt;Authorization Code flow (non-PKCE)&lt;/strong&gt; because it demonstrates the core OAuth concepts most clearly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authorization Code Flow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Four Key Players
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Resource Owner (Your Friend)&lt;/strong&gt;: The user who owns the protected resources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client (You)&lt;/strong&gt;: The application requesting access on the resource owner’s behalf.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authorization Server (City Hall)&lt;/strong&gt;: Authenticates the resource owner, client and issues signed access tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource Server (Warehouse)&lt;/strong&gt;: Hosts the protected resources and verifies access tokens before granting access.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Client Registration And End-To-End Flow
&lt;/h3&gt;

&lt;p&gt;Before the OAuth flow can even start, the &lt;strong&gt;Client app&lt;/strong&gt; must be registered with the &lt;strong&gt;Authorization Server&lt;/strong&gt; (like Google, Microsoft, etc.). Think of this as the client obtaining an official ID that proves who it is whenever it requests authorization.&lt;/p&gt;

&lt;p&gt;During registration, the Authorization Server issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Client ID&lt;/strong&gt;: A public identifier for your app (like a username for the app).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client Secret&lt;/strong&gt;: A private key for server-side authentication (like a password for the app that must be kept secret).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redirect URI&lt;/strong&gt;: The specific URL on the client app where the Authorization Server will send the user after authorization (like telling the Authorization Server, “After login, send them back here”).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As explained in the example above, the Authorization Code flow has two key phases:&lt;/p&gt;

&lt;h4&gt;
  
  
  Phase 1: Resource Owner Authentication &amp;amp; Consent
&lt;/h4&gt;

&lt;p&gt;The resource owner authenticates with the authorization server and grants permission to the client app.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6v1gdsvzu60ci1650ivm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6v1gdsvzu60ci1650ivm.png" alt="Authorization Code" width="800" height="525"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Phase 2: Client Authentication &amp;amp; Token Exchange
&lt;/h4&gt;

&lt;p&gt;The client app authenticates itself and exchanges the &lt;strong&gt;authorization code&lt;/strong&gt; for the &lt;strong&gt;access token&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2yf2bqjbsoue4lizkuxw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2yf2bqjbsoue4lizkuxw.png" alt="Access Token" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;authorization code&lt;/strong&gt; issued by the Authorization Server simply represents the &lt;strong&gt;authorization&lt;/strong&gt; granted by the resource owner to the client app.&lt;/p&gt;

&lt;p&gt;But there’s a key problem: &lt;em&gt;The authorization code doesn’t tell the client who the resource owner actually is&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In other words, OAuth alone doesn’t help the client app verify that the person using it is really the resource owner who granted the permission.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does this matter?&lt;/strong&gt; → Because the client is acting on behalf of the user and it needs to be sure that the person interacting with it is indeed the same resource owner who gave that authorization. Otherwise, the client might present or act on sensitive user data for the wrong person. &lt;strong&gt;This is the gap that OpenID Connect (OIDC) solves&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  OpenID Connect (OIDC): Adding Identity to OAuth
&lt;/h2&gt;

&lt;p&gt;OIDC is a simple identity layer built on top of OAuth 2.0. It extends the OAuth flow by returning an &lt;strong&gt;ID Token&lt;/strong&gt; alongside the &lt;strong&gt;Access Token&lt;/strong&gt;. This ID Token contains information about the authenticated user. In other words:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OAuth 2.0 → Authorization (what can the app do?)&lt;/li&gt;
&lt;li&gt;OpenID Connect → Authentication + Authorization (who is the caller user, and what can the app do on their behalf?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyadzisrjhidq2pr96z0i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyadzisrjhidq2pr96z0i.png" alt="End to End OIDC" width="800" height="535"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;In the next article(s), we’ll dive into &lt;strong&gt;how access tokens are actually implemented&lt;/strong&gt; in real-world systems.&lt;/p&gt;

&lt;p&gt;If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!&lt;/p&gt;

</description>
      <category>identity</category>
      <category>oauth</category>
      <category>iam</category>
    </item>
    <item>
      <title>Core Attributes of Distributed Systems: Reliability, Availability, Scalability, and More</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Mon, 30 Jun 2025 19:41:28 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/core-attributes-of-distributed-systems-reliability-availability-scalability-and-more-23p6</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/core-attributes-of-distributed-systems-reliability-availability-scalability-and-more-23p6</guid>
      <description>&lt;p&gt;Whether you’re building a simple web app or a large distributed system, users don’t just expect it to work → they want it to be fast, always available, secure, and to run smoothly without unexpected interruptions.&lt;/p&gt;

&lt;p&gt;These expectations are captured in what we call &lt;strong&gt;system quality attributes&lt;/strong&gt; or &lt;strong&gt;non-functional requirements&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this article, we’ll explore the most critical attributes that any serious system should aim to deliver, especially in distributed environments. We’ll cover why each attribute matters for the users, how to measure it, and how to achieve it both proactively and reactively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reliability
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Definition: Reliability is the ability of a system to operate correctly and continuously over time, delivering accurate results without unexpected interruptions or failures.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why It Matters
&lt;/h3&gt;

&lt;p&gt;Users rely on your system to behave predictably. If your banking app transfers money to the wrong account or your flight booking app glitches, it erodes customer trust instantly.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Measure:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mean Time Between Failures (MTBF)&lt;/strong&gt;: Average time system runs before failing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rate&lt;/strong&gt;: Frequency of incorrect results (e.g., data corruption or logic bugs).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Proactive Techniques (making the system reliable in advance)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fault prevention (Stop mistakes before they happen)&lt;/strong&gt; → Write clean code, perform code reviews, use static analysis tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fault removal (Find and fix mistakes early)&lt;/strong&gt;: Use automated testing, debugging, and formal verification.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reactive Techniques (handling faults when they occur)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fault tolerance (Keep working despite faults)&lt;/strong&gt; → Use retries, replication/redundancy, graceful degradation, and error correction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fault detection (Spot problems quickly)&lt;/strong&gt; → Monitor logs, set up alerts, use health checks and diagnostics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fault recovery (Fix issues promptly)&lt;/strong&gt; → Restart services, failover to backups, roll back to safe states.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Availability
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Definition: Availability is the ability of a system to be up and responsive when needed, ensuring users can access it at any time. It focuses on being ready to serve, not on whether the response is correct (which is covered by reliability).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why It Matters
&lt;/h3&gt;

&lt;p&gt;If your system crashes or is down during peak hours, users will leave. For mission-critical systems like trading, even seconds of downtime can be disastrous.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Measure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Uptime percentage&lt;/strong&gt; → e.g., 99.9% uptime = ~8.7 hours of downtime/year.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mean Time to Recovery (MTTR)&lt;/strong&gt;: How fast you recover from failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High availability (HA)&lt;/strong&gt; typically refers to uptime of 99.9% or more, achieved through redundancy and failover strategies.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Proactive Techniques
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capacity planning&lt;/strong&gt;: Predict demand and provision enough resources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redundant infrastructure&lt;/strong&gt;: Extra hardware or cloud zones ready to take over.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reactive Techniques
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failover mechanisms&lt;/strong&gt;: Automatically switch to backup nodes or servers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-healing&lt;/strong&gt;: Restart crashed services or containers automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scalability
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Definition: Scalability is the ability of a system to handle more users or more data by adding more resources, without significantly slowing down or crashing.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why It Matters
&lt;/h3&gt;

&lt;p&gt;What works smoothly for 10 users might completely break when 10,000 people show up. If your product becomes popular, you want it to grow without falling apart.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Measure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt; → How many requests per second your system can handle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency under load&lt;/strong&gt; → How fast your system responds when many users are active at once.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Proactive Techniques (preparing for growth in advance)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Design for scalability (Build with growth in mind)&lt;/strong&gt; → Use stateless designs, modular components, and databases that can be partitioned or scaled out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity planning (Plan ahead for future load)&lt;/strong&gt; → Estimate how much traffic or data you’ll have later and make sure your system can handle it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reactive Techniques (handling growth when it happens)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auto-scaling (Add resources on the fly)&lt;/strong&gt; → Automatically spin up more servers when traffic spikes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load balancing (Distribute work evenly)&lt;/strong&gt; → Spread incoming requests across multiple servers so no single one gets overloaded.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Maintainability
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Definition: Maintainability is the ability of a system to be easily changed, updated, fixed, or improved over time without introducing new problems.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why It Matters
&lt;/h3&gt;

&lt;p&gt;Requirements always change. Bugs appear. New features need to be added. If your system is messy or overly complex, even small changes become risky and time-consuming. A maintainable system is easy to understand, modify, and operate day to day, letting teams respond quickly and confidently to new needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Measure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mean Time to Modify (MTTM)&lt;/strong&gt; → How long it takes to make a change or add a new feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code churn&lt;/strong&gt; → How frequently the code is updated or changed, which can indicate areas that are difficult to maintain or keep stable.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Proactive Techniques (making the system easier to change in advance)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Modular design (Break it into manageable parts)&lt;/strong&gt; → Structure your system as small, independent components that are easier to understand, test, and replace.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplicity (Avoid unnecessary complexity)&lt;/strong&gt; → Keep designs and code clear and straightforward to reduce errors and make it easier for new developers to pick up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clear documentation and standards (Help everyone stay aligned)&lt;/strong&gt; → Write understandable docs and follow consistent coding styles so others can safely make changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operability considerations (Design for smooth running in production)&lt;/strong&gt; → Build clear configuration, easy deployment processes, and good monitoring hooks to simplify day-to-day management.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reactive Techniques (improving it over time)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Refactoring (Clean up continuously)&lt;/strong&gt; → Regularly improve the structure of code without changing its behavior to keep it healthy and easy to work with.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated regression tests (Prevent breaking existing features)&lt;/strong&gt; → Run tests that ensure changes don’t accidentally introduce new bugs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental improvements (Make small, safe changes)&lt;/strong&gt; → Tackle technical debt gradually without big risky rewrites.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Definition: Security is the ability of a system to protect itself from unauthorized access, misuse, or attacks.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why It Matters
&lt;/h3&gt;

&lt;p&gt;A single security breach can damage your reputation, leak sensitive data, or cause big financial losses. Attackers don’t wait for you to be ready → you have to plan ahead.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Measure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Time to detect and respond&lt;/strong&gt; → How quickly you can find and fix security issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Number of vulnerabilities over time&lt;/strong&gt; → Track how many security flaws are open and how quickly they’re closed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance scores&lt;/strong&gt; → Certifications like SOC2 or ISO 27001 that show your security practices meet industry standards.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Proactive Techniques (protecting the system in advance)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Threat modeling (Think like an attacker)&lt;/strong&gt; → Identify and fix weak points before someone exploits them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure defaults (Build security in by default)&lt;/strong&gt; → Use encryption, strong passwords, and access controls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security scans (Catch issues early)&lt;/strong&gt; → Run automated tools to find known vulnerabilities in your code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reactive Techniques (responding when something goes wrong)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Intrusion detection (Spot attacks fast)&lt;/strong&gt; → Use systems that alert you to suspicious activity in real time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident response (Limit the damage)&lt;/strong&gt; → Apply security patches quickly and have a plan to contain and fix breaches.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>database</category>
      <category>scalability</category>
      <category>availability</category>
    </item>
    <item>
      <title>Memory Models Explained: How Threads Really See Memory</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Sat, 28 Jun 2025 16:37:18 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/memory-models-explained-how-threads-really-see-memory-174l</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/memory-models-explained-how-threads-really-see-memory-174l</guid>
      <description>&lt;p&gt;Modern processors and compilers aggressively reorder instructions to improve performance → a behavior we explored in detail in my previous article: &lt;a href="https://dev.to/sachin_tolay_052a7e539e57/instruction-reordering-your-code-doesnt-always-run-in-the-order-you-wrote-it-bb8"&gt;Instruction Reordering: Your Code Doesn’t Always Run in the Order You Wrote It&lt;/a&gt;.&lt;br&gt;
To write correct concurrent code or to understand why it breaks, we need to explore &lt;strong&gt;memory models&lt;/strong&gt;: the formal rules that define how threads see and interact with memory operations.&lt;br&gt;
This article explains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What memory models are and how they work&lt;/li&gt;
&lt;li&gt;The main types of memory models, including Sequential consistency, Total Store Order and relaxed/weak models&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  What Is a Memory Model?
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;memory model&lt;/strong&gt; is a contract between your program, the compiler, and the CPU that defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which memory operations - &lt;strong&gt;loads (reads)&lt;/strong&gt; and &lt;strong&gt;stores (writes)&lt;/strong&gt; can be reordered&lt;/li&gt;
&lt;li&gt;When the effects of a write become visible to other threads&lt;/li&gt;
&lt;li&gt;How multiple threads observe reads and writes performed by others&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without a memory model, there’s no way to reason about multithreaded programs → each thread could see operations in any order, leading to unpredictable behavior.&lt;/p&gt;
&lt;h2&gt;
  
  
  Sequential Consistency: The Intuitive Model
&lt;/h2&gt;

&lt;p&gt;The simplest memory model is &lt;strong&gt;Sequential Consistency&lt;/strong&gt;, defined by Leslie Lamport as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order issued by that processor.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It defines a system where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All threads see memory operations (reads and writes) in the &lt;strong&gt;same global order&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Each thread sees its own operations occur in the same order as written in the program.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, the execution behaves as if there is a single &lt;strong&gt;shared&lt;/strong&gt; timeline, and all operations from all threads are placed on that timeline in a way that respects each thread’s &lt;strong&gt;original&lt;/strong&gt; instruction order.&lt;/p&gt;

&lt;p&gt;This model is easy for programmers to reason about because it matches what we typically expect: operations happen one after another, and everyone sees the same thing.&lt;/p&gt;

&lt;p&gt;However, enforcing this strict order requires coordination between cores and often prevents performance optimizations like instruction reordering, store buffering, and speculative execution. That’s why modern hardware typically implements &lt;strong&gt;weaker memory models&lt;/strong&gt; that are more relaxed, but harder to reason about.&lt;/p&gt;
&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Consider two threads sharing variables &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt; initialized to 0:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Thread 1
x = 1;
r1 = y;

// Thread 2
y = 1;
r2 = x;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under sequential consistency, the result where both &lt;code&gt;r1 == 0&lt;/code&gt; and &lt;code&gt;r2 == 0&lt;/code&gt; is impossible. At least one thread should see the other’s write.&lt;/p&gt;

&lt;h2&gt;
  
  
  Total Store Order (TSO): Strong but Practical
&lt;/h2&gt;

&lt;p&gt;The x86 architecture (used in Intel and AMD CPUs) follows the &lt;strong&gt;Total Store Order (TSO)&lt;/strong&gt; memory model. It’s stronger and easier to reason about than many weak/relaxed models, while still allowing one key optimization to improve performance. Here’s how it works:&lt;/p&gt;

&lt;h3&gt;
  
  
  Stores (writes) happen in order
&lt;/h3&gt;

&lt;p&gt;For example, if Thread 1 executes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = 1;  
y = 2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any other thread that observes these values will always see &lt;code&gt;x = 1&lt;/code&gt; before &lt;code&gt;y = 2&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Loads (reads) happen in order
&lt;/h3&gt;

&lt;p&gt;This means if your code says:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;r1 = x;  
r2 = y;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the CPU will load &lt;code&gt;x&lt;/code&gt; before &lt;code&gt;y&lt;/code&gt;, just as written.&lt;/p&gt;

&lt;h3&gt;
  
  
  But loads and stores can be reordered with respect to each other.
&lt;/h3&gt;

&lt;p&gt;A later load can be executed before an earlier store, as long as they access &lt;strong&gt;different&lt;/strong&gt; variables. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = 1;     // Store to x  
y = 2;     // Store to y  
r1 = z;    // Load from z. Note -  z is not accessed above.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even though &lt;code&gt;x = 1&lt;/code&gt; and &lt;code&gt;y = 2&lt;/code&gt; come first, the CPU might delay committing those stores while performing the load from &lt;code&gt;z&lt;/code&gt; early. As a result, &lt;code&gt;r1&lt;/code&gt; might see an outdated value of &lt;code&gt;z&lt;/code&gt;, and other threads might not yet observe the updated values of &lt;code&gt;x&lt;/code&gt; or &lt;code&gt;y&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Relaxed/Weak: Performance First, Predictability Later
&lt;/h2&gt;

&lt;p&gt;Modern architectures like ARM, POWER etc. implement &lt;strong&gt;relaxed memory models&lt;/strong&gt;. These models give the CPU more freedom to reorder instructions for maximum performance, but they also make it harder for programmers to reason about how memory behaves in concurrent programs.&lt;/p&gt;

&lt;p&gt;Unlike &lt;strong&gt;TSO&lt;/strong&gt; (which only reorders loads with earlier stores), relaxed models allow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stores to be reordered with other stores&lt;/li&gt;
&lt;li&gt;Loads to be reordered with other loads&lt;/li&gt;
&lt;li&gt;Stores and loads to be reordered with each other, in both directions
That means almost any combination of reordering is allowed → unless the programmer uses explicit &lt;strong&gt;memory barriers&lt;/strong&gt; or &lt;strong&gt;synchronization&lt;/strong&gt; instructions to enforce ordering.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;In a relaxed model, this code in Thread 1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = 1;
y = 2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Might be observed by another thread as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;y = 2&lt;/code&gt; happening before &lt;code&gt;x = 1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Or only one of the stores being visible&lt;/li&gt;
&lt;li&gt;Or even both stores being delayed entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Similarly, two loads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;r1 = x;
r2 = y;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;May execute in reverse order internally, and &lt;code&gt;r1&lt;/code&gt; might see an older value while &lt;code&gt;r2&lt;/code&gt; sees a newer one → depending on what the hardware decides.&lt;/p&gt;

&lt;p&gt;Programmers can no longer assume that memory behaves “as written.” Writing correct concurrent code now depends on understanding language-level memory models, atomic operations, and memory fences.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8g0ybrjex3md4ttsrmcd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8g0ybrjex3md4ttsrmcd.png" alt="Memory Models Summary" width="800" height="201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the next article, we’ll explore how &lt;strong&gt;synchronization mechanisms&lt;/strong&gt; (like locks, atomics, and memory barriers) help us write correct concurrent code → even under relaxed memory models,  and dive into how they work under the hood.&lt;/p&gt;




&lt;p&gt;If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!&lt;/p&gt;

</description>
      <category>memory</category>
      <category>programming</category>
      <category>performance</category>
      <category>multithreading</category>
    </item>
    <item>
      <title>Instruction Reordering: Your Code Doesn’t Always Run in the Order You Wrote It</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Thu, 26 Jun 2025 18:00:12 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/instruction-reordering-your-code-doesnt-always-run-in-the-order-you-wrote-it-bb8</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/instruction-reordering-your-code-doesnt-always-run-in-the-order-you-wrote-it-bb8</guid>
      <description>&lt;p&gt;When writing code, you naturally expect instructions to run one after the other in the exact order they appear. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x=1;
y=2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You’d expect &lt;code&gt;x = 1&lt;/code&gt; to complete before &lt;code&gt;y = 2&lt;/code&gt; starts.&lt;/p&gt;

&lt;p&gt;But in reality, modern CPUs and compilers don’t always execute instructions in the exact sequence you wrote them. Instead, they &lt;strong&gt;reorder instructions internally&lt;/strong&gt; to improve performance. While this might sound risky, it’s a core optimization that enables today’s processors to run billions of instructions per second.&lt;/p&gt;

&lt;p&gt;To fully appreciate why these reorderings happen, it helps to first understand the parallel execution techniques CPUs use that I have explained in detail here: &lt;a href="https://dev.to/sachin_tolay_052a7e539e57/superscalar-vs-simd-vs-multicore-understanding-modern-cpu-parallelism-3jao"&gt;Superscalar vs SIMD vs Multicore: Understanding Modern CPU Parallelism&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This article explains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What instruction reordering is.&lt;/li&gt;
&lt;li&gt;Why do CPUs and compilers perform it.&lt;/li&gt;
&lt;li&gt;How it affects multithreaded programs.&lt;/li&gt;
&lt;li&gt;And why understanding it is critical for writing correct concurrent code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What is Instruction Reordering?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Instruction reordering&lt;/strong&gt; means the order in which instructions are executed can differ from the order they appear in your source code. There are two main types of reordering:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Compiler Reordering&lt;/strong&gt; — The compiler rearranges instructions as part of the code generation process to produce faster machine code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU Reordering (Out-of-Order Execution)&lt;/strong&gt; — CPUs execute instructions out of their original order internally to better utilize available execution units and reduce pipeline stalls.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both types of reordering are done &lt;strong&gt;transparently&lt;/strong&gt; to the programmer in &lt;strong&gt;single-threaded&lt;/strong&gt; programs, so your code behaves as expected. However, when multiple threads interact via shared memory, these reorderings can cause subtle and hard-to-debug bugs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Do CPUs and Compilers Reorder Instructions?
&lt;/h2&gt;

&lt;p&gt;Both CPUs and compilers reorder instructions primarily to &lt;strong&gt;improve performance&lt;/strong&gt; by making better use of hardware resources and minimizing delays.&lt;/p&gt;

&lt;h3&gt;
  
  
  Improving CPU Utilization
&lt;/h3&gt;

&lt;p&gt;Modern CPUs have multiple execution units per core (such as ALUs, FPUs, and load/store units) that can operate in parallel. To keep these units busy, the CPU issues and executes multiple independent instructions simultaneously, even if they appear sequentially in your code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a = b + c; // Instruction 1
x = y + z;  // Instruction 2 (independent of Instruction 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, the CPU can execute both instructions at the same time in different execution units, rather than waiting to finish &lt;strong&gt;Instruction 1&lt;/strong&gt; before starting &lt;strong&gt;Instruction 2&lt;/strong&gt;. This parallelism boosts throughput.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hiding Memory Latency
&lt;/h3&gt;

&lt;p&gt;Memory access can be slow compared to CPU speeds. When an instruction needs data from memory, the CPU doesn’t just wait idly → it reorders instructions to execute other independent instructions that are ready to run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = 1; // Instruction 1
y = slowLoad(); // Instruction 2 (memory access, slower)
z = 2;          // Instruction 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While &lt;strong&gt;Instruction 2&lt;/strong&gt; waits for the memory load, the CPU can execute &lt;strong&gt;Instruction 3&lt;/strong&gt; immediately, avoiding pipeline stalls and improving efficiency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compiler Optimizations
&lt;/h3&gt;

&lt;p&gt;Compilers reorder instructions during code optimization to produce faster, more efficient machine code. This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reordering independent instructions to improve scheduling.&lt;/li&gt;
&lt;li&gt;Moving calculations that don’t change out of loops.&lt;/li&gt;
&lt;li&gt;Eliminating repeated computations by reusing previously computed values.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Consider the following code snippet inside a loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for (int i = 0; i &amp;lt; 1000; i++) {
  int a = 5 * 2; // same calculation every iteration
  int b = a + i;
  int c = 5 * 2; // repeated calculation
  array[i] = b + c;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After optimization, the generated code could look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int a = 5 * 2; // computed once before the loop
for (int i = 0; i &amp;lt; 1000; i++) {
  int b = a + i;
  int c = a; // reuse computed value
  array[i] = b + c;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why Instruction Reordering Matters in Multithreaded Programs
&lt;/h2&gt;

&lt;p&gt;When multiple threads access shared memory without proper synchronization, instruction reordering can lead to unexpected behaviors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Shared variables
int data = 0;
int flag = 0;

// Thread 1
data = 42; // Step 1
flag = 1; // Step 2

// Thread 2
if (flag == 1) { // Step 3
  print(data); // Step 4
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You’d expect that if &lt;strong&gt;Thread 2&lt;/strong&gt; sees &lt;code&gt;flag == 1&lt;/code&gt;, it should also see &lt;code&gt;data = 42&lt;/code&gt;. But if the compiler or CPU reorders &lt;code&gt;flag = 1&lt;/code&gt; before &lt;code&gt;data = 42&lt;/code&gt;, &lt;strong&gt;Thread 2&lt;/strong&gt; might read &lt;code&gt;flag == 1&lt;/code&gt; but &lt;code&gt;data = 0&lt;/code&gt;. This kind of subtle bug is caused by instruction reordering combined with visibility issues in multithreaded memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  How CPUs &amp;amp; Compilers Avoid Breaking Single-Threaded Programs
&lt;/h2&gt;

&lt;p&gt;Even though CPUs and compilers reorder instructions to run faster, they make sure your program still behaves as if the instructions ran exactly in the order you wrote them → when you’re running a &lt;strong&gt;single thread&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;They do this by carefully tracking &lt;strong&gt;data dependencies&lt;/strong&gt; between instructions. For example, if one instruction needs the result of another, it won’t be moved before that instruction.&lt;/p&gt;

&lt;p&gt;CPUs also use special hardware mechanisms, like &lt;strong&gt;reorder buffers&lt;/strong&gt;, to keep track of the original program order and only commit results in that order. So, even if instructions execute out of order internally, the program’s visible behavior stays consistent.&lt;/p&gt;

&lt;p&gt;This means you don’t have to worry about your code acting strangely due to instruction reordering in normal, &lt;strong&gt;single-threaded&lt;/strong&gt; programs.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Write Correct Multithreaded Programs Despite Reordering
&lt;/h2&gt;

&lt;p&gt;As explained before, Instruction reordering can lead to subtle and unpredictable bugs in multithreaded programs. One thread might observe memory updates from another in an unexpected order, breaking the intended logic of your program.&lt;/p&gt;

&lt;p&gt;Because CPUs and compilers apply many different optimizations based on context, hardware, and surrounding code, the specific ordering of operations can vary in ways you might not expect. This makes reasoning about shared memory behavior tricky without proper safeguards.&lt;/p&gt;

&lt;p&gt;To write correct multithreaded code, you need to use &lt;strong&gt;synchronization tools&lt;/strong&gt; that control visibility and ordering between threads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Locks (mutexes)&lt;/strong&gt;: Prevent simultaneous access to shared data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomic operations&lt;/strong&gt;: Ensure safe, indivisible updates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory barriers (fences)&lt;/strong&gt;: Stop certain reorderings from happening across critical instruction boundaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tools help establish &lt;em&gt;happens-before relationships&lt;/em&gt; → ensuring that the operations in one thread become visible to another in a predictable and controlled manner. To apply these tools correctly, it’s important to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What memory models are, and how they define visibility and ordering guarantees&lt;/li&gt;
&lt;li&gt;How different synchronization mechanisms enforce those guarantees&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These advanced topics will be explored in upcoming articles, as they are essential for writing safe and efficient concurrent systems.&lt;/p&gt;




&lt;p&gt;If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!&lt;/p&gt;

</description>
      <category>programming</category>
      <category>tutorial</category>
      <category>coding</category>
      <category>memory</category>
    </item>
    <item>
      <title>Superscalar vs SIMD vs Multicore: Understanding Modern CPU Parallelism</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Wed, 25 Jun 2025 05:46:13 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/superscalar-vs-simd-vs-multicore-understanding-modern-cpu-parallelism-3jao</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/superscalar-vs-simd-vs-multicore-understanding-modern-cpu-parallelism-3jao</guid>
      <description>&lt;p&gt;For many years, improving CPU performance meant increasing clock speed → allowing more cycles per second. But today, we’ve reached practical limits in how fast we can push frequency due to power, heat, and physical constraints.&lt;/p&gt;

&lt;p&gt;As a result, modern CPU design focuses less on running faster and more on &lt;strong&gt;doing more per cycle&lt;/strong&gt;. To achieve this, processors use three key architectural techniques:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Superscalar execution&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SIMD (Single Instruction, Multiple Data)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multicore parallelism&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, these allow a CPU to complete multiple operations in a single clock cycle → making better use of each tick without increasing the clock rate itself.&lt;/p&gt;

&lt;p&gt;Before diving into these techniques, it’s important to understand &lt;strong&gt;CPU pipelining&lt;/strong&gt;, the foundation of all modern CPU execution, which is covered in a separate article — &lt;a href="https://dev.to/sachin_tolay_052a7e539e57/cpu-pipelining-how-modern-processors-execute-instructions-faster-3i71"&gt;CPU Pipelining: How Modern Processors Execute Instructions Faster&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Superscalar: Executing Multiple Instructions Per Cycle
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88lao0kogep0bfyqx8b9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88lao0kogep0bfyqx8b9.png" alt="Multiple Decode units, ALU etc" width="800" height="877"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A superscalar processor can issue and execute multiple instructions within a single clock cycle. This is achieved by replicating execution units (such as ALUs, FPUs, and load/store units) as illustrated in the diagram above, and by incorporating scheduling logic that performs several key functions :-&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Analyzing Dependencies Between Instructions&lt;/li&gt;
&lt;li&gt;Scheduling Independent Instructions Across Execution Units&lt;/li&gt;
&lt;li&gt;Register Renaming to Eliminate False Dependencies&lt;/li&gt;
&lt;li&gt;Reordering Instructions to Hide Stalls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach exploits &lt;strong&gt;instruction-level parallelism (ILP)&lt;/strong&gt; → the presence of independent instructions within a single thread that can be executed simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Superscalar Scheduling in Action
&lt;/h3&gt;

&lt;p&gt;Consider the following simple code snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int a = x + y; // Instruction 1
int b = m * n; // Instruction 2
a = p + q;     // Instruction 3 (reuses 'a')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s how a &lt;strong&gt;2-way superscalar CPU&lt;/strong&gt; handles this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instruction 1&lt;/strong&gt; and &lt;strong&gt;Instruction 2&lt;/strong&gt; are &lt;strong&gt;independent&lt;/strong&gt; and can be issued &lt;strong&gt;in parallel&lt;/strong&gt;, assuming two ALUs are available.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instruction 3&lt;/strong&gt; writes to &lt;code&gt;a&lt;/code&gt; again. Although it doesn’t depend on &lt;strong&gt;Instruction 1&lt;/strong&gt;, the reuse of the variable name &lt;code&gt;a&lt;/code&gt; could create a &lt;strong&gt;false write-after-write dependency&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;To resolve this, the CPU uses &lt;strong&gt;register renaming&lt;/strong&gt; to map each version of a to a &lt;strong&gt;different&lt;/strong&gt; register:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;a (from x + y)&lt;/code&gt; → Register &lt;code&gt;P1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;b&lt;/code&gt; → Register &lt;code&gt;P2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;a (from p + q)&lt;/code&gt; → Register &lt;code&gt;P3&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;This allows &lt;strong&gt;Instruction 1&lt;/strong&gt; and &lt;strong&gt;Instruction 3&lt;/strong&gt; to be issued &lt;strong&gt;out of order or in parallel&lt;/strong&gt;, without waiting on one another.&lt;/li&gt;
&lt;li&gt;If, for example, &lt;code&gt;x + y&lt;/code&gt; causes a stall (e.g., due to a cache miss), the CPU can &lt;strong&gt;reorder&lt;/strong&gt; execution and run &lt;strong&gt;Instruction 2&lt;/strong&gt; or &lt;strong&gt;Instruction 3&lt;/strong&gt; first → keeping the pipeline active.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  SIMD: Applying One Instruction to Multiple Data Elements
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjls8p5ksq8dpvfryiogc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjls8p5ksq8dpvfryiogc.png" alt="SIMD" width="800" height="589"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SIMD&lt;/strong&gt; (Single Instruction, Multiple Data) allows a single instruction to operate on multiple values at once. This is ideal for vector math, graphics, or matrix processing, where the same operation repeats across arrays. This exploits &lt;strong&gt;data-level parallelism (DLP)&lt;/strong&gt; → applying the same instruction to many data points.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Pseudo-vectorized addition using SIMD
float a[4] = {1.0, 2.0, 3.0, 4.0};
float b[4] = {10.0, 20.0, 30.0, 40.0};
float c[4];
for (int i = 0; i &amp;lt; 4; i++) {
  c[i] = a[i] + b[i];
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A SIMD instruction can perform all 4 additions in one CPU instruction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multicore: Running Multiple Threads in Parallel
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;multicore processor&lt;/strong&gt; has multiple independent cores, each capable of executing its own thread. Threads may come from the same program (multithreading) or different programs (multiprocessing). This exploits &lt;strong&gt;thread-level parallelism (TLP)&lt;/strong&gt; → running independent streams of instructions in parallel.&lt;/p&gt;

&lt;h2&gt;
  
  
  All Three Combined: Parallelism at Every Level
&lt;/h2&gt;

&lt;p&gt;Modern CPUs combine &lt;strong&gt;superscalar, SIMD, and multicore&lt;/strong&gt; techniques to maximize throughput per cycle. This allows multiple threads to run across cores, with each core executing multiple instructions per cycle, and each instruction operating on multiple data values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;A CPU with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4 cores (&lt;strong&gt;multicore&lt;/strong&gt;),&lt;/li&gt;
&lt;li&gt;each capable of issuing 4 instructions per cycle (&lt;strong&gt;superscalar&lt;/strong&gt;),&lt;/li&gt;
&lt;li&gt;and supporting 256-bit SIMD (processing 8 floats at once)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;can potentially perform: &lt;strong&gt;4 cores × 4 instructions × 8 data elements = 128 operations per cycle&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjenulchd9yr4nyi0acp3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjenulchd9yr4nyi0acp3.png" alt="Summary" width="800" height="167"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!&lt;/p&gt;

</description>
      <category>cpu</category>
      <category>systemdesign</category>
      <category>tutorial</category>
      <category>superscalar</category>
    </item>
    <item>
      <title>CPU Pipelining: How Modern Processors Execute Instructions Faster</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Sun, 22 Jun 2025 16:29:26 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/cpu-pipelining-how-modern-processors-execute-instructions-faster-3i71</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/cpu-pipelining-how-modern-processors-execute-instructions-faster-3i71</guid>
      <description>&lt;p&gt;The key to modern processors’ speed lies in their ability to execute many instructions in parallel, and the foundation for that is a technique called &lt;strong&gt;pipelining&lt;/strong&gt;. Though introduced decades ago, pipelining remains central to how today’s CPUs achieve high performance, powering even the most advanced architectures.&lt;/p&gt;

&lt;p&gt;In this article, we’ll explore how pipelining works, how it improves CPU performance, and the common bottlenecks that can limit its efficiency.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note — I have already written in-depth articles covering the memory hierarchy → including &lt;a href="https://dev.to/sachin_tolay_052a7e539e57/understanding-cpu-cache-organization-and-structure-4o1o"&gt;cache&lt;/a&gt;, &lt;a href="https://dev.to/sachin_tolay_052a7e539e57/memory-access-demystified-how-virtual-memory-caches-and-dram-impact-performance-5ec6"&gt;virtual memory&lt;/a&gt;, and &lt;a href="https://dev.to/sachin_tolay_052a7e539e57/understanding-dram-internals-how-channels-banks-and-dram-access-patterns-impact-performance-3fef"&gt;DRAM&lt;/a&gt;, so this article will not dive deeply into memory accesses.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A Simple Way to Understand CPU Pipelining
&lt;/h2&gt;

&lt;p&gt;Imagine you work at a burger joint. You have to make 3 burgers, and each one needs to go through these 3 steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grill the patty (1 min)&lt;/li&gt;
&lt;li&gt;Assemble the burger (1 min)&lt;/li&gt;
&lt;li&gt;Wrap it (1 min)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Without Pipelining: One Worker Does Everything
&lt;/h3&gt;

&lt;p&gt;Imagine you have &lt;strong&gt;one worker&lt;/strong&gt; who knows how to do all three tasks: grilling, assembling, and wrapping. They make each burger from start to finish before moving on to the next one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minute 1–3: Burger 1&lt;/li&gt;
&lt;li&gt;Minute 4–6: Burger 2&lt;/li&gt;
&lt;li&gt;Minute 7–9: Burger 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total time for 3 burgers = 9 minutes&lt;/strong&gt;
The worker is skilled, but because they do everything alone, they can only work on one burger at a time. No overlap.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  With Pipelining (Assembly Line): Each specialized worker does only their task
&lt;/h3&gt;

&lt;p&gt;Here, you have &lt;strong&gt;3 workers&lt;/strong&gt;, and each one is &lt;strong&gt;specialized&lt;/strong&gt; → they only know how to do their specific task.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minute 1: Worker 1 grills Burger 1&lt;/li&gt;
&lt;li&gt;Minute 2: Worker 1 grills Burger 2, Worker 2 assembles Burger 1&lt;/li&gt;
&lt;li&gt;Minute 3: Worker 1 grills Burger 3, Worker 2 assembles Burger 2, Worker 3 wraps Burger 1&lt;/li&gt;
&lt;li&gt;Minute 4: Worker 2 assembles Burger 3, Worker 3 wraps Burger 2&lt;/li&gt;
&lt;li&gt;Minute 5: Worker 3 wraps Burger 3&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total time for 3 burgers = 5 minutes.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After the pipeline is full (minute 3 onwards), one burger finishes every minute.&lt;/p&gt;

&lt;h3&gt;
  
  
  How This Relates to CPUs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Each burger = one CPU instruction&lt;/li&gt;
&lt;li&gt;Each step = a CPU pipeline stage (explained in next section)&lt;/li&gt;
&lt;li&gt;Each worker = a specialized hardware unit in the CPU&lt;/li&gt;
&lt;li&gt;Without pipelining: everything runs one at a time, in order&lt;/li&gt;
&lt;li&gt;With pipelining: stages overlap, and the CPU finishes one instruction per cycle (after the pipeline fills)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  CPU Pipeline: Stages and Specialized Units
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl07jaxw7h6m9qpogyuq4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl07jaxw7h6m9qpogyuq4.png" alt="CPU pipelining" width="800" height="494"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As explained in the previous section, a CPU pipeline works like an assembly line, where each instruction moves through a series of &lt;strong&gt;stages&lt;/strong&gt;. Each &lt;strong&gt;stage&lt;/strong&gt; is handled by a dedicated hardware unit, optimized for just that task. The table below maps each stage to its corresponding function and hardware unit. Compare each row with the matching element in the diagram above.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fob9792ue48zeebdggdf1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fob9792ue48zeebdggdf1.png" alt="Stages of CPU pipelines" width="800" height="219"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  A CPU Pipeline Example
&lt;/h3&gt;

&lt;p&gt;Let’s walk through 3 simple CPU instructions and how they move through the pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I1: R1 = MEM[0x1000] ; Load value at memory[0x1000] into R1
I2: R2 = MEM[0x1004] ; Load value at memory[0x1004] into R2
I3: R3 = R1 + R2 ; Add R1 and R2, store result in R3

Let’s assume:
memory[0x1000] = 10
memory[0x1004] = 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following table summarizes how each instruction progresses through the pipeline stages over multiple cycles:&lt;/p&gt;

&lt;h3&gt;
  
  
  Cycle-by-Cycle View
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdjca56csh4w5x3hplb99.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdjca56csh4w5x3hplb99.png" alt="Cycle-by-cycle view" width="800" height="273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Instruction Details
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrlx3ws9c6jwhfawlri5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrlx3ws9c6jwhfawlri5.png" alt="Instruction Details" width="800" height="173"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottlenecks in CPU Pipelining
&lt;/h2&gt;

&lt;p&gt;While CPU pipelining speeds up processing by working on multiple instructions at once, it faces several challenges that can slow things down. These bottlenecks limit how efficiently the pipeline runs:&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Hazards
&lt;/h3&gt;

&lt;p&gt;When an instruction needs the result of a previous instruction that isn’t ready yet, the pipeline must pause to avoid errors. For example, in the example above, &lt;strong&gt;instruction I3 stalls&lt;/strong&gt; in the decode stage because it depends on the results of I1 and I2, which aren’t ready yet. This stall is a classic data hazard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stalling&lt;/strong&gt; the pipeline until data is ready.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data forwarding&lt;/strong&gt; to pass results directly between pipeline stages, bypassing stages like WB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compiler optimizations&lt;/strong&gt; like reordering instructions to avoid dependencies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Out-of-order execution&lt;/strong&gt; so the CPU can run independent instructions while waiting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Register renaming&lt;/strong&gt; to avoid false dependencies between instructions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Control Hazards (Branching)
&lt;/h3&gt;

&lt;p&gt;Sometimes, the CPU comes across a decision point in the program, such as an &lt;strong&gt;if-else statement&lt;/strong&gt; or a loop. At this moment, the CPU needs to figure out which set of instructions to run next. However, it often cannot know the correct path immediately because the condition it’s checking hasn’t been fully evaluated yet. This uncertainty causes the pipeline to &lt;strong&gt;pause or clear instructions&lt;/strong&gt; that were loaded based on a guess, which slows down processing. This delay is called a &lt;strong&gt;branch penalty&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Branch prediction&lt;/strong&gt; to guess the most likely path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speculative execution&lt;/strong&gt; to continue down a guessed path and discard it if wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delayed branching&lt;/strong&gt; (used in some architectures) to rearrange instructions after a branch.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Structural Hazards
&lt;/h3&gt;

&lt;p&gt;Structural hazards happen when two or more instructions need to use the same specialized hardware resource at the same time, but the CPU has only one of that resource available.&lt;/p&gt;

&lt;p&gt;For example, if two instructions both want to use the Arithmetic Logic Unit (ALU) simultaneously, one instruction has to wait until the resource is free. This waiting slows down the pipeline because instructions can’t proceed in parallel as planned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More hardware units&lt;/strong&gt; (e.g., multiple ALUs or load/store units) per CPU core.&lt;/li&gt;
&lt;li&gt;Enhanced &lt;strong&gt;resource scheduling&lt;/strong&gt; to better manage shared hardware access.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pipeline Stalls (Bubbles)
&lt;/h3&gt;

&lt;p&gt;To resolve hazards or wait for data, the CPU sometimes inserts idle cycles where no instruction completes. For example, in the earlier pipeline walkthrough, instruction &lt;strong&gt;I3 has to stall&lt;/strong&gt; because it depends on the results of &lt;strong&gt;I1 and I2&lt;/strong&gt;, which aren’t ready yet. During this stall, the pipeline pauses at the decode stage, waiting for the needed data, which temporarily slows down the overall instruction flow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hazard detection units&lt;/strong&gt; to predict and manage stalls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Out-of-order execution&lt;/strong&gt; to keep the pipeline busy with other instructions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compiler scheduling&lt;/strong&gt; to rearrange instructions and minimize idle time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these bottlenecks and their solutions are complex topics on their own and deserve detailed explanations. They will be covered in separate articles for a deeper dive.&lt;/p&gt;




&lt;p&gt;If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!&lt;/p&gt;

</description>
      <category>programming</category>
      <category>cpu</category>
      <category>tutorial</category>
      <category>performance</category>
    </item>
    <item>
      <title>Cache Coherence: How the MESI Protocol Keeps Multi-Core CPUs Consistent</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Thu, 19 Jun 2025 23:59:08 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/cache-coherence-how-the-mesi-protocol-keeps-multi-core-cpus-consistent-170j</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/cache-coherence-how-the-mesi-protocol-keeps-multi-core-cpus-consistent-170j</guid>
      <description>&lt;p&gt;Modern multi-core CPUs depend on caches to accelerate memory access and improve performance. However, when multiple cores cache the same memory address, maintaining a &lt;strong&gt;consistent view of memory&lt;/strong&gt; across all cores and main memory (known as &lt;strong&gt;cache coherence&lt;/strong&gt;) becomes a tricky problem.&lt;/p&gt;

&lt;p&gt;One of the most widely used solutions to this challenge is the &lt;strong&gt;MESI cache coherence protocol&lt;/strong&gt;. In this article, we’ll break down what cache coherence means, why it’s important, and how the MESI protocol ensures your multi-core CPU operates reliably and efficiently.&lt;/p&gt;

&lt;p&gt;If you’re interested in diving deeper into how caches are organized and structured, I have written a separate article covering that in detail → &lt;a href="https://dev.to/sachin_tolay_052a7e539e57/understanding-cpu-cache-organization-and-structure-4o1o"&gt;Understanding CPU Cache Organization and Structure&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Cache Coherence And Why Does It Matter?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu716ehc0smo9iylepyy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu716ehc0smo9iylepyy.png" alt="Memory Hierarchy" width="743" height="778"&gt;&lt;/a&gt;&lt;br&gt;
When multiple cores cache the &lt;strong&gt;same&lt;/strong&gt; memory address, and one of them &lt;strong&gt;updates it&lt;/strong&gt;, how do we make sure all other cores see the updated value? Example :-&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Core 1&lt;/strong&gt; and &lt;strong&gt;Core 2&lt;/strong&gt; both cache the value at memory address &lt;strong&gt;X&lt;/strong&gt;, which initially holds &lt;strong&gt;10&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;At this point, both cores have their own local copies of &lt;strong&gt;X&lt;/strong&gt; (value: &lt;strong&gt;10&lt;/strong&gt;) in their private &lt;strong&gt;L1&lt;/strong&gt; caches.&lt;/li&gt;
&lt;li&gt;Now, &lt;strong&gt;Core 1&lt;/strong&gt; updates &lt;strong&gt;X&lt;/strong&gt; to &lt;strong&gt;20&lt;/strong&gt; in its own &lt;strong&gt;L1&lt;/strong&gt; cache.&lt;/li&gt;
&lt;li&gt;However, &lt;strong&gt;Core 2’s&lt;/strong&gt; cache still holds the old value → &lt;strong&gt;10&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Worse, &lt;strong&gt;main memory&lt;/strong&gt; also still has the outdated value: 10.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now, if &lt;strong&gt;Core 2&lt;/strong&gt; tries to read &lt;strong&gt;X&lt;/strong&gt;, it will retrieve the outdated value (&lt;strong&gt;10&lt;/strong&gt;) from its own cache, unaware that &lt;strong&gt;Core 1&lt;/strong&gt; has already updated it to &lt;strong&gt;20&lt;/strong&gt;. This kind of mismatch can lead to incorrect or unpredictable application behavior. This mismatch is what &lt;strong&gt;cache coherence&lt;/strong&gt; aims to solve: &lt;em&gt;Ensuring that all cores (and main memory) have a consistent and up-to-date view of memory&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How CPUs Handle Writes: Write-Through vs Write-Back
&lt;/h2&gt;

&lt;p&gt;Before we talk about how cache coherence is maintained, it’s important to understand how caches handle writes, because that directly impacts why coherence is even needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Write-Through Caching
&lt;/h3&gt;

&lt;p&gt;In this strategy, whenever a core writes to a cache line, the same update is immediately written to main memory as well. This keeps memory always up-to-date, making coherence simpler to maintain. But there’s a catch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every write results in a memory operation, which increases &lt;strong&gt;memory bandwidth usage&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;It introduces &lt;strong&gt;latency&lt;/strong&gt;, as writes must wait for memory.&lt;/li&gt;
&lt;li&gt;And most importantly, it defeats the purpose of having fast, local caches → which is to reduce the need to access slower main memory in the first place.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Write-Back Caching (Used in Most CPUs Today)
&lt;/h3&gt;

&lt;p&gt;In write-back caching, when a core updates a value:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The change is made only in the core’s &lt;strong&gt;private&lt;/strong&gt; cache.&lt;/li&gt;
&lt;li&gt;The updated value is not written to &lt;strong&gt;main memory&lt;/strong&gt; immediately.&lt;/li&gt;
&lt;li&gt;Instead, the new value stays in the cache and is written back only when the cache line is evicted or needs to be shared.
And that’s exactly where &lt;strong&gt;cache coherence protocols like MESI&lt;/strong&gt; are needed → to ensure all cores always see the correct, updated data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  MESI Protocol
&lt;/h2&gt;

&lt;p&gt;MESI stands for the four states each cache line can have → &lt;strong&gt;Modified (M)&lt;/strong&gt;, &lt;strong&gt;Exclusive (E)&lt;/strong&gt;, &lt;strong&gt;Shared (S)&lt;/strong&gt; and &lt;strong&gt;Invalid (I)&lt;/strong&gt;. These states help the CPU know:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which core has the most recent version of a piece of data,&lt;/li&gt;
&lt;li&gt;Whether that version is the same as what’s in main memory,&lt;/li&gt;
&lt;li&gt;What the CPU should do when a core tries to read or write that data.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Modified (M) → “I changed it, and no one else has it.”
&lt;/h3&gt;

&lt;p&gt;If a CPU core has a cache line in the &lt;strong&gt;Modified&lt;/strong&gt; state:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;That core is the &lt;strong&gt;only&lt;/strong&gt; one with the &lt;strong&gt;latest&lt;/strong&gt; version of the data.&lt;/li&gt;
&lt;li&gt;The data has been changed and no longer matches what’s in main memory.&lt;/li&gt;
&lt;li&gt;If another core needs the data, the CPU can either:

&lt;ul&gt;
&lt;li&gt;Send the updated data directly to the other core (cache-to-cache transfer), or&lt;/li&gt;
&lt;li&gt;Write it back to main memory if needed (e.g., on eviction).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Exclusive (E) → “I have the only clean copy.”
&lt;/h3&gt;

&lt;p&gt;If a CPU core has a cache line in the &lt;strong&gt;Exclusive&lt;/strong&gt; state:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;That core is the &lt;strong&gt;only&lt;/strong&gt; one with the &lt;strong&gt;latest&lt;/strong&gt; version of the data.&lt;/li&gt;
&lt;li&gt;The data &lt;strong&gt;matches&lt;/strong&gt; the main memory → it has not been modified yet.&lt;/li&gt;
&lt;li&gt;The core can:

&lt;ul&gt;
&lt;li&gt;Read the data freely.&lt;/li&gt;
&lt;li&gt;Write to it directly, which promotes the cache line to the &lt;strong&gt;Modified&lt;/strong&gt; state.&lt;/li&gt;
&lt;li&gt;No other core has a copy, so there’s &lt;strong&gt;no need for invalidation&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Shared (S) → “Others may have it too, and it’s read-only.”
&lt;/h3&gt;

&lt;p&gt;If a CPU core has a cache line in the &lt;strong&gt;Shared&lt;/strong&gt; state:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;One or more cores&lt;/strong&gt; may have a copy of the &lt;strong&gt;latest&lt;/strong&gt; version of the data in their caches; &lt;strong&gt;none&lt;/strong&gt; have modified it.&lt;/li&gt;
&lt;li&gt;The data in all those caches &lt;strong&gt;matches&lt;/strong&gt; the main memory → it’s the &lt;strong&gt;latest clean version&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The core can:

&lt;ul&gt;
&lt;li&gt;Read the data freely.&lt;/li&gt;
&lt;li&gt;Not write to it unless it first &lt;strong&gt;invalidates&lt;/strong&gt; all other copies and gains &lt;strong&gt;exclusive&lt;/strong&gt; access.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Invalid (I) → “My copy is stale or gone.”
&lt;/h3&gt;

&lt;p&gt;If a CPU core has a cache line in the &lt;strong&gt;Invalid&lt;/strong&gt; state:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The cache line is &lt;strong&gt;not valid&lt;/strong&gt; → the core cannot use it. It may have been :

&lt;ul&gt;
&lt;li&gt;Evicted due to limited cache space,&lt;/li&gt;
&lt;li&gt;Invalidated by another core’s write,&lt;/li&gt;
&lt;li&gt;Or never loaded at all.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The core:

&lt;ul&gt;
&lt;li&gt;Cannot read or write to this data.&lt;/li&gt;
&lt;li&gt;Must fetch a fresh copy from memory or another core’s cache to use it.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Any access will cause a &lt;strong&gt;cache miss&lt;/strong&gt; and trigger MESI protocol actions.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F44qjjq9aus2z8txbn0vh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F44qjjq9aus2z8txbn0vh.png" alt="MESI State Summary" width="800" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Caches Communicate: Bus Snooping and Cache-to-Cache Transfer
&lt;/h2&gt;

&lt;p&gt;Caches communicate in two key ways: &lt;strong&gt;Bus Snooping&lt;/strong&gt; and &lt;strong&gt;Cache-to-Cache Transfers&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bus Snooping: A Subscription Mechanism
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3mzyuhwrsx02u65b644r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3mzyuhwrsx02u65b644r.png" alt="Bus Snooping" width="800" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bus snooping&lt;/strong&gt; is a hardware technique where each core monitors the &lt;strong&gt;shared&lt;/strong&gt; system bus to keep an eye on what other cores are doing with memory.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Every time a core &lt;strong&gt;reads or writes&lt;/strong&gt; to a memory address, that action is &lt;strong&gt;broadcast&lt;/strong&gt; on the system bus.&lt;/li&gt;
&lt;li&gt;Other cores &lt;strong&gt;snoop (listen)&lt;/strong&gt; to the bus.&lt;/li&gt;
&lt;li&gt;If another core has a copy of the requested data, it can:

&lt;ul&gt;
&lt;li&gt;Respond with the most recent version (in Modified or Exclusive state).&lt;/li&gt;
&lt;li&gt;Invalidate or update its own cached copy if needed.&lt;/li&gt;
&lt;li&gt;Trigger a state change in its MESI cache line.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Cache-to-Cache Transfer: A Reply Mechanism
&lt;/h3&gt;

&lt;p&gt;When a core issues a memory read request, and another core already has the most recent copy of the requested data in its cache, it can respond directly → this is called a &lt;strong&gt;cache-to-cache transfer&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of fetching the data from main memory, the owning core:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Snoops the request via the bus,&lt;/li&gt;
&lt;li&gt;Recognizes that it holds the latest copy, and&lt;/li&gt;
&lt;li&gt;Sends the data directly to the requesting core.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This not only satisfies the request more quickly, but it also saves the latency of accessing main memory.&lt;/p&gt;

&lt;p&gt;In this section, we just focused on how caches communicate over the system bus to stay coordinated. The specific actions caches take in various scenarios, will be covered in detail in the next section.&lt;/p&gt;

&lt;h2&gt;
  
  
  MESI State Transitions
&lt;/h2&gt;

&lt;p&gt;We’ll break it down into five common scenarios that illustrate most of the transitions you’ll encounter. Lets assume, we have 3 cores → &lt;strong&gt;Core 1&lt;/strong&gt;, &lt;strong&gt;Core 2&lt;/strong&gt; and &lt;strong&gt;Core 3&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  First Read Scenario: Core 1 Reads a Line Not in Any Cache
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Core 1 issues a memory &lt;strong&gt;read request (BusRd)&lt;/strong&gt;, which is &lt;strong&gt;broadcast&lt;/strong&gt; on the system bus.&lt;/li&gt;
&lt;li&gt;Core 2 and Core 3 &lt;strong&gt;snoop&lt;/strong&gt; this request but do &lt;strong&gt;not&lt;/strong&gt; have the line cached. Since no other cache has the line, none respond to the broadcast.&lt;/li&gt;
&lt;li&gt;Core 1 then fetches the data from &lt;strong&gt;main memory&lt;/strong&gt; in &lt;strong&gt;Exclusive (E)&lt;/strong&gt; state, indicating this core is the sole owner.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  State Transitions
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F38w7l4iyl3c15r2h5yx2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F38w7l4iyl3c15r2h5yx2.png" alt="State Transitions" width="546" height="154"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Second Read Scenario: Core 2 Reads the Same Line
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Core 2 issues a &lt;strong&gt;read request (BusRd)&lt;/strong&gt; broadcast, which is &lt;strong&gt;broadcast&lt;/strong&gt; on the system bus.&lt;/li&gt;
&lt;li&gt;Core 1 snoops and sees it has the line in &lt;strong&gt;Exclusive (E)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Core 1 supplies the data directly to Core 2 via &lt;strong&gt;cache-to-cache transfer&lt;/strong&gt;, saving a slower memory access.&lt;/li&gt;
&lt;li&gt;Both Core 1 and Core 2 downgrade their cache lines to &lt;strong&gt;Shared (S)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Core 3 snoops but does not have the line, so doesn’t respond.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  State Transitions
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc719l0cwyshdy6n6py2c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc719l0cwyshdy6n6py2c.png" alt="State Transitions" width="550" height="179"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  First Write Scenario: Core 1 Writes to the Shared Line
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Core 1 issues a &lt;strong&gt;write intent request (BusUpgr)&lt;/strong&gt; broadcast on the bus.&lt;/li&gt;
&lt;li&gt;Core 2 and Core 3 snoop the broadcast:

&lt;ul&gt;
&lt;li&gt;Core 2 invalidates its &lt;strong&gt;Shared (S)&lt;/strong&gt; copy → &lt;strong&gt;Invalid (I)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Core 3 does nothing (already Invalid).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Core 1 waits for invalidation acknowledgments.&lt;/li&gt;
&lt;li&gt;Core 1 upgrades its cache line directly to &lt;strong&gt;Modified (M)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Core 1 now has exclusive write access; Core 2 and Core 3 have invalid copies.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  State Transitions
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2w1t7wfbq4yr8mv5pnzm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2w1t7wfbq4yr8mv5pnzm.png" alt="State Transitions" width="564" height="198"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Second Write Scenario: Core 3 Writes After Core 1’s Modified
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Core 3 issues a &lt;strong&gt;read-for-ownership request (BusRdX)&lt;/strong&gt; broadcast.&lt;/li&gt;
&lt;li&gt;Core 1 snoops, sees it has the line &lt;strong&gt;Modified (M)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Core 1 supplies the updated data directly to Core 3 (&lt;strong&gt;cache-to-cache transfer&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;Core 1 invalidates its &lt;strong&gt;Modified (M)&lt;/strong&gt; copy → &lt;strong&gt;Invalid (I)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Core 3 caches the line as &lt;strong&gt;Modified (M)&lt;/strong&gt; and performs the write.&lt;/li&gt;
&lt;li&gt;Core 2 remains Invalid.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  State Transitions
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2to7992j900kkw4ojjg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2to7992j900kkw4ojjg.png" alt="State Transitions" width="578" height="161"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Concurrent Write Scenario: Core 1 and Core 2 Try to Write Simultaneously
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Both Core 1 and Core 2 issue &lt;strong&gt;write intent requests (BusRdX)&lt;/strong&gt; around the same time.&lt;/li&gt;
&lt;li&gt;The bus &lt;strong&gt;orders&lt;/strong&gt; the write intents, allowing only one core (say Core 1) to perform its MESI transition to &lt;strong&gt;Modified (M)&lt;/strong&gt;. Core 2 must &lt;strong&gt;wait&lt;/strong&gt; for its turn or &lt;strong&gt;retry&lt;/strong&gt; once the bus is available again.&lt;/li&gt;
&lt;li&gt;Core 1 proceeds to upgrade its line to &lt;strong&gt;Modified (M)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Core 2’s request is delayed or retried after Core 1’s invalidations.&lt;/li&gt;
&lt;li&gt;Other cores snoop and invalidate as needed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every action, such as a read or write request on the bus, is &lt;strong&gt;completed fully and indivisibly&lt;/strong&gt; before another conflicting request starts. The bus &lt;strong&gt;serializes&lt;/strong&gt; these requests, ensuring no two cores simultaneously hold conflicting states for the same cache line. This atomicity guarantees &lt;strong&gt;data consistency and correctness&lt;/strong&gt; across all caches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations of MESI
&lt;/h2&gt;

&lt;p&gt;While MESI effectively maintains cache coherence in many systems, it has some important limitations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;False Sharing&lt;/strong&gt; → MESI operates at the &lt;strong&gt;cache line granularity&lt;/strong&gt;, not variable granularity. That means even if two threads access &lt;strong&gt;different variables&lt;/strong&gt;, if those variables fall on the same cache line:

&lt;ul&gt;
&lt;li&gt;MESI treats them as &lt;strong&gt;shared data&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;This causes &lt;strong&gt;unnecessary invalidations&lt;/strong&gt;, even though no real data conflict exists.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability Issues&lt;/strong&gt; → MESI relies on &lt;strong&gt;bus snooping&lt;/strong&gt;, where all cores must snoop every memory transaction:

&lt;ul&gt;
&lt;li&gt;As the number of cores increases, the snooping traffic grows rapidly.&lt;/li&gt;
&lt;li&gt;More cores mean more invalidations, more broadcasts, and more bus congestion.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency on Writes&lt;/strong&gt; → To write to a cache line that’s shared, a core must broadcast a write intent, wait for other cores to invalidate their copies, then perform the write. This adds latency, especially when multiple cores frequently access the same data, or when contention is high.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No Built-in Support for Synchronization&lt;/strong&gt; → MESI doesn’t handle higher-level synchronization (like locks or barriers). It only ensures &lt;strong&gt;data coherence&lt;/strong&gt;, not &lt;strong&gt;program correctness&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!&lt;/p&gt;

</description>
      <category>cpu</category>
      <category>programming</category>
      <category>memory</category>
      <category>learning</category>
    </item>
    <item>
      <title>Understanding CPU Cache Organization and Structure</title>
      <dc:creator>Sachin Tolay</dc:creator>
      <pubDate>Thu, 19 Jun 2025 15:58:25 +0000</pubDate>
      <link>https://forem.com/sachin_tolay_052a7e539e57/understanding-cpu-cache-organization-and-structure-4o1o</link>
      <guid>https://forem.com/sachin_tolay_052a7e539e57/understanding-cpu-cache-organization-and-structure-4o1o</guid>
      <description>&lt;p&gt;Software performance is deeply influenced by how efficiently memory is accessed. The story behind memory access latency is layered: it begins with CPU caches, traverses through virtual memory translation, and ultimately reaches physical DRAM. Each layer introduces its own overhead and optimization challenges.&lt;/p&gt;

&lt;p&gt;If you’re interested in learning more about how virtual memory and DRAM work, you may want to explore these related articles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/sachin_tolay_052a7e539e57/understanding-dram-internals-how-channels-banks-and-dram-access-patterns-impact-performance-3fef"&gt;Understanding DRAM Internals: How Channels, Banks, and DRAM Access Patterns Impact Performance&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/sachin_tolay_052a7e539e57/memory-access-demystified-how-virtual-memory-caches-and-dram-impact-performance-5ec6"&gt;Memory Access Demystified: How Virtual Memory, Caches, and DRAM Impact Performance&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article focuses solely on &lt;strong&gt;CPU caches&lt;/strong&gt;, the second fastest layer in the memory hierarchy after CPU registers. We’ll dive into the structural design of CPU caches, how they manage data placement and lookup, and how this affects the speed of your code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Do We Need CPU Caches?
&lt;/h2&gt;

&lt;p&gt;CPU caches were introduced to bridge the vast speed gap between fast processors and much slower main memory. Without addressing this gap, CPUs would frequently stall, waiting hundreds of CPU cycles for data.&lt;/p&gt;

&lt;p&gt;By the late 1960s, this mismatch had become a major bottleneck. Computer architects considered three possible solutions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Make main memory faster&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Technologies like magnetic core and early &lt;strong&gt;DRAM&lt;/strong&gt; were too slow to match CPU speeds.&lt;/li&gt;
&lt;li&gt;Faster memory - &lt;strong&gt;SRAM&lt;/strong&gt; was either too expensive or not scalable to large sizes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use the CPU more efficiently&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Techniques like pipelining and instruction reordering were emerging.&lt;/li&gt;
&lt;li&gt;While they helped hide latency, they didn’t eliminate it, and the memory bottleneck still remained.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Introduce a small, fast buffer (cache)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SRAM&lt;/strong&gt; wasn’t affordable or scalable for large memory sizes, but it was ideal for implementing small, fast memory layers close to the CPU.&lt;/li&gt;
&lt;li&gt;Studies of real-world workloads from operating systems, web servers etc. to machine learning, databases etc. shows that a &lt;strong&gt;small portion&lt;/strong&gt; of memory handles the &lt;strong&gt;majority&lt;/strong&gt; of accesses. This concentration of activity enables caches to be highly effective despite their limited size.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Though all three directions have seen continued development, caches offered the best trade-off between performance, cost, and complexity at the time and became a fundamental part of CPU architecture from the 1970s onward.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory Hierarchy: Where Do Caches Fit
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjsn5j0mwhxo1gi216wd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjsn5j0mwhxo1gi216wd.png" alt="Memory Hierarchy" width="800" height="418"&gt;&lt;/a&gt;&lt;br&gt;
Modern computer systems organize memory into a hierarchy to balance &lt;strong&gt;speed, capacity, and cost&lt;/strong&gt; :- &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;At the top are &lt;strong&gt;CPU registers&lt;/strong&gt; → tiny, ultra-fast storage tightly coupled with the processor.&lt;/li&gt;
&lt;li&gt;Just below the registers are &lt;strong&gt;CPU caches&lt;/strong&gt; → small, fast buffers that store recently or frequently used data. They are organized into levels :-

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;L1 and L2 caches&lt;/strong&gt; are private to each CPU core.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L3 cache&lt;/strong&gt; is larger and &lt;strong&gt;shared&lt;/strong&gt; across all cores. &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;When data is needed, the CPU checks these caches in order: &lt;strong&gt;L1 → L2 → L3&lt;/strong&gt;. If the data is not found (a &lt;strong&gt;cache miss&lt;/strong&gt;), it falls back to &lt;strong&gt;main memory (DRAM)&lt;/strong&gt;, which is significantly slower.&lt;/li&gt;
&lt;li&gt;Each level down the hierarchy offers &lt;strong&gt;more capacity&lt;/strong&gt; and &lt;strong&gt;lower cost per bit&lt;/strong&gt;, but at the cost of &lt;strong&gt;higher latency&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7t0x6kzilp1jin6wv98.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7t0x6kzilp1jin6wv98.png" alt="Memory Hierarchy Latency" width="800" height="276"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Cache Organization And Structure
&lt;/h2&gt;

&lt;p&gt;Caches store data in fixed-size chunks called &lt;strong&gt;cache lines&lt;/strong&gt;, usually &lt;strong&gt;64 bytes&lt;/strong&gt; long. These lines are grouped into &lt;strong&gt;sets&lt;/strong&gt;, and how many cache lines each set can hold depends on the cache’s &lt;strong&gt;associativity&lt;/strong&gt; (also called the number of &lt;strong&gt;ways&lt;/strong&gt;). This design is similar to a hashmap: each set acts like a hash bucket, and the multiple ways within a set are like chained entries for resolving collisions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct-mapped (1-way associative)&lt;/strong&gt;: Each memory address maps to exactly one set and one specific cache line → so during a lookup, only that single line needs to be checked to determine if the memory address’s data is present.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgdmla0auakku5qhi9a1q.png" alt="Direct-mapped" width="601" height="379"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;N-way set-associative (e.g., 2-way associative, 4-way associative)&lt;/strong&gt;: Each set holds multiple cache lines → for example, 4 lines in a 4-way associative cache. A memory address can be stored in any of these lines, so during a lookup, all lines in the set are checked to determine if the memory address’s data is present in the cache.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6le97mwyilv07g065ckk.png" alt="N-way set associative" width="682" height="727"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fully associative&lt;/strong&gt;: There are no set divisions → any memory address can be stored in any cache line. During a lookup, all cache lines must be checked to determine if the memory address’s data is present in the cache.
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuajdahhoq5xd89p2sdgj.png" alt="Fully Associative" width="651" height="225"&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  How a Memory Address Maps to Cache: A Hashmap Analogy
&lt;/h2&gt;

&lt;p&gt;CPU caches work a lot like hashmaps. Just as hashmaps use &lt;strong&gt;keys&lt;/strong&gt; to store and retrieve values efficiently, caches use &lt;strong&gt;memory addresses&lt;/strong&gt; to store and find data quickly. Let’s break it down.&lt;/p&gt;
&lt;h3&gt;
  
  
  Cache Lookup: Using Index, Tag, and Offset from a Physical Address
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foivvdxtrtnypqzovhgq2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foivvdxtrtnypqzovhgq2.png" alt="Physical Address Breakdown" width="800" height="327"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Index → Finding the Right Set
&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Hashmap analogy: this is like using a hash function to find the right bucket.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As mentioned earlier, the cache is divided into sets, and each set holds multiple cache lines (depending on the associativity). The &lt;strong&gt;index&lt;/strong&gt; bits of the memory address are used to select the corresponding set.&lt;/p&gt;

&lt;p&gt;For example → if your cache has 256 sets, 8 bits of the address would serve as the index (since 2⁸ = 256). This tells the CPU which set to search in → just like a hashmap uses a hash of the key to find a bucket.&lt;/p&gt;
&lt;h4&gt;
  
  
  Tag → Matching the Entry Inside the Set (Handling Collisions)
&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Hashmap analogy: once you’re in the bucket, compare stored keys in the bucket to resolve a collision in the bucket.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because many memory addresses can share the same index (i.e., they map to the same set), the cache must resolve these &lt;strong&gt;collisions&lt;/strong&gt;. This is where the &lt;strong&gt;tag&lt;/strong&gt; comes in.&lt;/p&gt;

&lt;p&gt;As mentioned earlier, each set in an N-way set-associative cache contains &lt;strong&gt;N&lt;/strong&gt; cache lines. When the CPU accesses a memory address, it uses the &lt;strong&gt;index&lt;/strong&gt; to locate the set, and then compares the &lt;strong&gt;tag&lt;/strong&gt; from the incoming address against the tags stored in all &lt;strong&gt;N cache lines of that set&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If a match is found → cache hit&lt;/li&gt;
&lt;li&gt;If no match → cache miss, and one of the existing lines may be evicted to make room&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Offset → Extracting the Right Byte
&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Hashmap analogy: retrieving a specific part of the value associated with the key.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As mentioned earlier, each cache line typically holds a block of contiguous memory, such as &lt;strong&gt;64 bytes&lt;/strong&gt;. Once the correct cache line is identified through a &lt;strong&gt;cache hit&lt;/strong&gt; (via index and tag comparison), the CPU uses the &lt;strong&gt;offset&lt;/strong&gt; bits from the address to select the &lt;strong&gt;exact byte&lt;/strong&gt; within that cache line. For a 64-byte cache line, we need &lt;strong&gt;6 bits&lt;/strong&gt; for the offset → since 2⁶ = 64.&lt;/p&gt;

&lt;p&gt;Now, we have seen how the CPU uses the index, tag, and offset to find data in the cache on a cache hit. But what if none of the tags in the set match? This is a &lt;strong&gt;cache miss&lt;/strong&gt;, and the CPU must fetch the entire &lt;strong&gt;data block&lt;/strong&gt; (equivalent to cache line) from slower main memory. Since the 64-bit data bus transfers 8 bytes (64 bits) per CPU cycle, it typically takes about 8 CPU cycles to transfer a 64-byte &lt;strong&gt;data block&lt;/strong&gt; into the &lt;strong&gt;cache line&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Cache Update: What Happens on a Cache Miss
&lt;/h3&gt;

&lt;p&gt;When a cache miss occurs, the cache must be updated with the data block containing the requested memory address. Here’s how this process works step-by-step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fetching Data from Main Memory&lt;/strong&gt; → The CPU reads the entire data block (e.g., 64 bytes) starting at the &lt;strong&gt;aligned&lt;/strong&gt; memory address from main memory. For example, if the cache line size is 64 bytes and the CPU requests address &lt;strong&gt;0x1A3F&lt;/strong&gt;, the cache loads the whole 64-byte block starting at the aligned base address &lt;strong&gt;0x1A00&lt;/strong&gt; (since 64 bytes require 6 offset bits, the lower 6 bits are cleared).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Locating a Slot in the Set&lt;/strong&gt; → The block is loaded into the cache set identified by the index bits. If the cache is N-way set associative, the data can be placed in any of the N cache lines within that set. If there’s a free line available, the new block is stored there directly. Otherwise, the cache must evict an existing line based on its replacement policy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evicting an Existing Line if Needed&lt;/strong&gt; → If the set is full (all lines are occupied), the cache uses a replacement policy, like Least Recently Used (LRU), to decide which existing cache line to evict to make room for the new block.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Updating the Tag&lt;/strong&gt; → The tag field of the selected cache line is updated to reflect the new block’s address, enabling future hits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Populating the Cache Hierarchy&lt;/strong&gt; → A cache miss in L1 doesn't just populate L1, the fetched block is inserted into &lt;strong&gt;all relevant levels of the cache hierarchy&lt;/strong&gt; (L1, L2, and L3).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resuming Access&lt;/strong&gt; → After the update, the CPU can access the requested byte in the cache line using the offset bits, just as it would on a cache hit.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Comparing Different Set Associative Types
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft49kuhg9dw61lyt3woo5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft49kuhg9dw61lyt3woo5.png" alt="Set Associativity Comparison" width="800" height="231"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Cache Replacement Policies
&lt;/h2&gt;

&lt;p&gt;During a cache miss, when a new memory block needs to be brought into the cache, and the set it maps to is already full, the cache must decide which existing block to evict. This decision is made using a replacement policy. The most common policies include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LRU (Least Recently Used) → Evicts the block that hasn’t been used for the longest time.&lt;/li&gt;
&lt;li&gt;PLRU (Pseudo-LRU) →  An approximation of LRU designed for efficient hardware implementation, often used in modern CPUs to balance complexity and performance.&lt;/li&gt;
&lt;li&gt;FIFO (First-In, First-Out) → Evicts the block that has been in the cache the longest, regardless of usage.&lt;/li&gt;
&lt;li&gt;Random → Chooses a block to evict at random; simple to implement in hardware, though less predictable.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  How Cache Structure Affects Performance
&lt;/h2&gt;

&lt;p&gt;The way caches are organized → including their line size, associativity, and replacement policies → directly influences how well your code performs. Caches are designed to exploit two key principles of memory access: &lt;strong&gt;Temporal Locality&lt;/strong&gt; and &lt;strong&gt;Spatial Locality&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Temporal Locality
&lt;/h3&gt;

&lt;p&gt;Programs tend to reuse data they accessed recently. If a value is used in quick succession, keeping it in the cache can significantly reduce memory latency.&lt;/p&gt;
&lt;h4&gt;
  
  
  Mechanisms Targeting Temporal Locality
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Retaining recently accessed data&lt;/strong&gt;: Frequently used values → like loop counters, accumulator variables, or function arguments → are kept in the L1 cache, where access is fastest. As long as these values are reused quickly, they remain cached.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smart replacement policies&lt;/strong&gt;: When a cache needs to evict data, it uses policies like LRU or PLRU to decide which data is least likely to be reused. This helps preserve recently accessed (and likely-to-be-reused) blocks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-level cache hierarchy&lt;/strong&gt;: If a block is evicted from L1 due to pressure, it may still exist in L2 or L3 → larger, slightly slower caches that act as a backup. This layered design improves hit rates across varying temporal reuse patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefetching with reuse prediction&lt;/strong&gt;: Modern CPUs can predict reuse based on past access patterns. If the processor or compiler detects that certain memory is accessed repeatedly, it can preemptively fetch or retain that data to improve hit rates.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Example → Local variables in a loop:
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int sum = 0;
for (int i = 0; i &amp;lt; 1000; ++i) {
   sum += i;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The variables &lt;strong&gt;sum&lt;/strong&gt; and &lt;strong&gt;i&lt;/strong&gt; are updated on every iteration. It remains in a register or L1 cache because of temporal locality.&lt;/p&gt;
&lt;h4&gt;
  
  
  Example → Global counters or state flags:
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int error_count = 0;
void log_error() {
  error_count++;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If log_error() is called frequently, the global variable error_count is repeatedly accessed and benefits from temporal locality.&lt;/p&gt;
&lt;h4&gt;
  
  
  Pitfalls That Break Temporal Locality
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Large working sets&lt;/strong&gt;: If your program uses more data than can fit in the cache, older data gets evicted before it can be reused.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Irregular access patterns&lt;/strong&gt;: Frequently jumping between unrelated memory regions (e.g., random linked list traversal) prevents the cache from predicting reuse.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Spatial Locality
&lt;/h3&gt;

&lt;p&gt;Programs also tend to access memory locations that are close together. That’s why caches load entire cache lines (e.g., 64 bytes). If you read one byte, there’s a good chance nearby bytes will be accessed soon → so pre-loading them pays off.&lt;/p&gt;
&lt;h4&gt;
  
  
  Mechanisms Targeting Spatial Locality
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cache lines store blocks of data&lt;/strong&gt;: When a CPU accesses one memory address, the cache loads an entire cache line from main memory. This block includes the requested byte and adjacent ones, anticipating that they’ll be accessed soon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Block-based memory transfer&lt;/strong&gt;: Since a single cache line holds 64 bytes, accessing nearby addresses (like a[i+1], a[i+2], etc.) doesn’t require additional memory accesses → they’re already in the same line or adjacent lines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware prefetching&lt;/strong&gt;: Modern CPUs automatically detect sequential access patterns (like array iteration) and start fetching the &lt;strong&gt;next cache lines ahead of time&lt;/strong&gt;, reducing wait time for upcoming memory accesses.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Example → Array iteration
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for (int i = 0; i &amp;lt; N; ++i) {
  a[i] = a[i] * 2;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Elements a[0], a[1], a[2], … are stored contiguously in memory. Accessing them in order leverages spatial locality → once a cache line is loaded, the next few elements are likely already there.&lt;/p&gt;
&lt;h4&gt;
  
  
  Example → Struct field access
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;struct Point { int x; int y; } p;
int sum = p.x + p.y;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Both x and y are stored next to each other in memory. Accessing them together benefits from spatial locality.&lt;/p&gt;
&lt;h4&gt;
  
  
  Example → Matrix access (row-wise)
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for (int i = 0; i &amp;lt; rows; ++i)
  for (int j = 0; j &amp;lt; cols; ++j)
    matrix[i][j] += 1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If the matrix is stored in row-major order (as in C/C++), this access pattern walks through memory linearly, maximizing spatial locality.&lt;/p&gt;
&lt;h4&gt;
  
  
  Pitfalls That Break Spatial Locality
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Strided accesses with large gaps&lt;/strong&gt;: Accessing data with large steps skips over useful cache lines.&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for (int i = 0; i &amp;lt; N; i += 64) {
  process(a[i]);
}
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;Each access lands on a new cache line, wasting preloaded data.&lt;/p&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Accessing only one field in a wide struct repeatedly&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight plaintext"&gt;&lt;code&gt;struct Big { int a; char padding[60]; int b; };
Big arr[100];
for (int i = 0; i &amp;lt; 100; ++i) {
  process(arr[i].a);
}
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;Here, each cache line holds mostly unused data, leading to &lt;strong&gt;cache pollution&lt;/strong&gt;.&lt;/p&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unaligned data structures&lt;/strong&gt;: If structures span multiple cache lines due to misalignment, accessing even a single field can trigger multiple memory accesses.&lt;/p&gt;&lt;/li&gt;

&lt;/ol&gt;




&lt;p&gt;If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!&lt;/p&gt;

</description>
      <category>memory</category>
      <category>ram</category>
      <category>performance</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
