<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: priteshsurana</title>
    <description>The latest articles on Forem by priteshsurana (@priteshsurana).</description>
    <link>https://forem.com/priteshsurana</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3131472%2Fe8fe27c3-b58c-452e-8d0d-4d71bf353bc7.png</url>
      <title>Forem: priteshsurana</title>
      <link>https://forem.com/priteshsurana</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/priteshsurana"/>
    <language>en</language>
    <item>
      <title>Cassandra Internals: LSM Tree, SSTables, and Compaction</title>
      <dc:creator>priteshsurana</dc:creator>
      <pubDate>Wed, 15 Apr 2026 22:16:07 +0000</pubDate>
      <link>https://forem.com/priteshsurana/cassandra-internals-lsm-tree-sstables-and-compaction-2ai8</link>
      <guid>https://forem.com/priteshsurana/cassandra-internals-lsm-tree-sstables-and-compaction-2ai8</guid>
      <description>&lt;p&gt;Post 3 and 4 traced writes and reads through PostgreSQL and MongoDB. Both engines use B-Tree variants. Both optimize for reads - maintaining sorted indexes, linking leaf nodes, storing heap pointers or link to primary index and pay for that optimization with write complexity: page splits, locking, dead tuples, in-place update overhead.&lt;/p&gt;

&lt;p&gt;Cassandra makes the opposite bet. It never modifies anything on disk. Every write is an append. Every file, once written, is immutable until compaction removes it. The read path pays for this. It has to reconcile data across potentially many files to find the latest version of a row. Understanding Cassandra means understanding why that tradeoff is worth making for certain workloads, and how the engine manages the read cost through Bloom filters and compaction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Two Tables instead of One
&lt;/h2&gt;

&lt;p&gt;Before any internals, the schema needs explaining. As this is a continuation in the series, read previous parts to understand the Orders table schema and how we might have to change it if using Cassandra. Here it is again:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;orders_by_id&lt;/span&gt;
  &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;orders_by_user&lt;/span&gt;
  &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In PostgreSQL, you'd have one &lt;code&gt;orders&lt;/code&gt; table with a secondary index on &lt;code&gt;user_id&lt;/code&gt;. The engine maintains that index; you write a row once, and PostgreSQL handles the index update. In Cassandra, you write the data &lt;strong&gt;twice&lt;/strong&gt;, in two different shapes, into two different tables.&lt;/p&gt;

&lt;p&gt;This isn't a design quirk. It follows directly from how &lt;a href="https://dev.to/priteshsurana/btree-vs-lsm-tree-why-your-databases-data-structure-is-everything-94c"&gt;LSM storage works&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;LSM's write strength is sequential appends to known partition keys. Given a partition key, Cassandra can write to the right Memtable instantly - no page to find, no B-Tree to traverse, no lock to acquire. But if you want to query by a different field than the partition key, you need a different table with that field as the partition key. Cassandra does have secondary indexes, but they're implemented as hidden tables under the hood and carry significant read cost - scanning across SSTables that weren't organized for that access pattern. For production workloads, the idiomatic solution is: &lt;strong&gt;write the data multiple times, once per access pattern&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The consequence is that every INSERT into &lt;code&gt;orders&lt;/code&gt; triggers two writes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT event:
  → write to orders_by_id   (keyed by order_id)
  → write to orders_by_user (keyed by user_id + created_at)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is write amplification - paying more on the write side to make reads efficient. The tradeoff is explicit and application-owned. PostgreSQL maintains its secondary indexes for you, transparently. Cassandra requires you to maintain your denormalized tables explicitly. If &lt;code&gt;orders_by_user&lt;/code&gt; gets out of sync with &lt;code&gt;orders_by_id&lt;/code&gt;, that's your problem, not the engine's.&lt;/p&gt;

&lt;p&gt;Why accept this burden? Because at high write throughput - millions of inserts per minute - Cassandra's append-only writes stay fast under load in a way that B-Tree engines struggle to match. The write amplification is a known, bounded cost. If not, then the alternative - secondary indexes on an LSM engine under heavy write load - is an unbounded performance hazard.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Five components handle most of the things in Cassandra's LSM engine:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5o1xo520fxsl8x8dzpz8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5o1xo520fxsl8x8dzpz8.png" alt="Process" width="800" height="571"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;CommitLog&lt;/strong&gt; is the crash-safety log - write here first, before touching any data structure. The &lt;strong&gt;Memtable&lt;/strong&gt; is the in-memory sorted buffer that accumulates writes. When the Memtable fills, it flushes to disk as an &lt;strong&gt;SSTable&lt;/strong&gt; - an immutable, sorted, self-contained file. Each SSTable has a &lt;strong&gt;Bloom filter&lt;/strong&gt; (to quickly rule out keys that aren't in that file) and a &lt;strong&gt;partition index&lt;/strong&gt; (to locate specific keys within the file). &lt;strong&gt;Compaction&lt;/strong&gt; periodically merges SSTables, consolidating versions and reclaiming space from tombstones.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Write Path
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;orders_by_id&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s1"&gt;'shipped'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;149&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Order for...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;toTimestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;1. Partition key hashing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cassandra hashes the &lt;code&gt;order_id&lt;/code&gt; value using Murmur3 to produce a &lt;strong&gt;token&lt;/strong&gt; - a number in a very large range that determines where this row lives in the cluster's token space. On a single node, every token maps to the same node, so routing is trivial. In a multi-node cluster (Post 7), this hash determines which node receives the write. No coordinator needs a lookup table - any node can compute the owner from the hash.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. CommitLog append&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before the row touches the Memtable, Cassandra appends a record to the &lt;strong&gt;CommitLog&lt;/strong&gt; - a sequential, append-only file on disk. This is Cassandra's equivalent of PostgreSQL's WAL, but structurally simpler. There are no page boundaries, no page headers, no B-Tree node structures. It's a flat sequence of mutation records, written front to back. Cassandra also compresses CommitLog segments, reducing the I/O cost relative to PostgreSQL's uncompressed WAL writes.&lt;/p&gt;

&lt;p&gt;The CommitLog exists purely for crash recovery. If Cassandra crashes before the Memtable is flushed to an SSTable, the CommitLog lets it reconstruct the lost Memtable on restart.&lt;/p&gt;

&lt;p&gt;Cassandra offers two CommitLog sync modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Periodic&lt;/strong&gt; (default): the CommitLog is synced to disk every ~10 seconds. Writes are acknowledged before the sync. Up to 10 seconds of data can be lost in a hard crash. But is fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch&lt;/strong&gt;: the CommitLog is synced before every acknowledgment. No data loss window. Slower. Every write pays a disk flush, similar to PostgreSQL's default &lt;code&gt;synchronous_commit = on&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the benchmark in Post 6, the sync mode is called out explicitly because it significantly affects write throughput numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Memtable write - the client is acknowledged here&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After the CommitLog append, the row is written to the &lt;strong&gt;Memtable&lt;/strong&gt; for &lt;code&gt;orders_by_id&lt;/code&gt;. The Memtable is a sorted in-memory data structure sorted by partition key, then by clustering key within each partition. For &lt;code&gt;orders_by_id&lt;/code&gt; with only a partition key, the sort is by &lt;code&gt;order_id&lt;/code&gt;. For &lt;code&gt;orders_by_user&lt;/code&gt; with &lt;code&gt;(user_id, created_at)&lt;/code&gt;, rows are sorted first by &lt;code&gt;user_id&lt;/code&gt;, then by &lt;code&gt;created_at&lt;/code&gt; within each user.&lt;/p&gt;

&lt;p&gt;Once the Memtable write is complete, &lt;strong&gt;Cassandra acknowledges the write to the client&lt;/strong&gt;. No disk access to a data file. No B-Tree traversal. No page split. No locking against other writers. The write touched a sequential log file and an in-memory structure. That's it. This is why Cassandra's single-threaded insert throughput in the Post 1 benchmark was ~18,000 rows/sec compared to PostgreSQL's ~8,500.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The second Memtable write — write amplification in practice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The application now sends the second INSERT for the same order event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;orders_by_user&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;toTimestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s1"&gt;'shipped'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;149&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This goes through the same CommitLog + Memtable path, but to the &lt;code&gt;orders_by_user&lt;/code&gt; Memtable. Two CommitLog appends, two Memtable writes, two eventual SSTable entries for one logical business event. The write amplification is real. It's the cost of Cassandra's access pattern design, and it's visible in disk space usage and write throughput measurements on multi-table schemas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Memtable flush --&amp;gt; the SSTable is born&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When the Memtable reaches its size threshold (configurable, typically 256MB–1GB), Cassandra flushes it to disk as a new &lt;strong&gt;SSTable&lt;/strong&gt;. The flush is a single sequential write pass from beginning to end. As the Memtable is already sorted, so no sort step is needed. Cassandra writes the entire sorted buffer to a new file in one pass. Sequential. Fast. The best possible disk write pattern.&lt;/p&gt;

&lt;p&gt;The resulting SSTable is &lt;strong&gt;immutable&lt;/strong&gt;. It will never be modified after being written. If a row is later updated, the update goes to a new Memtable and eventually a new SSTable. The old SSTable keeps its old version.&lt;/p&gt;

&lt;p&gt;Three files are written alongside the SSTable data file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bloom filter&lt;/strong&gt;: a compact probabilistic structure that, given a partition key, can definitively answer "this key is NOT in this SSTable" (no false negatives). It answers "maybe yes" for keys that are there, and occasionally for keys that aren't (false positives). Kept in memory. Eliminates most unnecessary SSTable reads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition index&lt;/strong&gt;: maps each partition key in this SSTable to its byte offset in the data file. Used to seek directly to a partition without reading the whole file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition summary&lt;/strong&gt;: a sparse sample of the partition index, kept in memory. Used to narrow down the range to read from the partition index itself, avoiding a full index scan.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the SSTable is written and fsynced, the CommitLog segments that covered those writes are eligible for deletion as the data is now safe in the SSTable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's durable at each step:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Moment&lt;/th&gt;
&lt;th&gt;What's safe&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;After CommitLog append&lt;/td&gt;
&lt;td&gt;The write survives a crash, replay reconstructs the Memtable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After Memtable write&lt;/td&gt;
&lt;td&gt;Same guarantee, Memtable is in RAM, CommitLog is the safety net&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After SSTable flush&lt;/td&gt;
&lt;td&gt;Doubly safe, SSTable on disk, CommitLog segment now disposable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After compaction&lt;/td&gt;
&lt;td&gt;Cleaned up, old versions and tombstones removed, read path faster&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Updates and Deletes: The Immutability Consequence
&lt;/h2&gt;

&lt;p&gt;Since SSTables are never modified, Cassandra cannot update or delete a row the way PostgreSQL does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Updates&lt;/strong&gt; write a new version. If you update &lt;code&gt;status&lt;/code&gt; from &lt;code&gt;'shipped'&lt;/code&gt; to &lt;code&gt;'delivered'&lt;/code&gt; on &lt;code&gt;order_id = X&lt;/code&gt;, Cassandra writes a new row to the current Memtable with &lt;code&gt;status = 'delivered'&lt;/code&gt; and a newer timestamp. The old row with &lt;code&gt;status = 'shipped'&lt;/code&gt; still exists in an older SSTable. Before compaction runs, &lt;strong&gt;both versions are on disk&lt;/strong&gt;. Reads resolve this by comparing timestamps and newest wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deletes&lt;/strong&gt; write a &lt;strong&gt;tombstone&lt;/strong&gt;, a special record that marks a partition key (or specific row or column) as deleted at a particular timestamp. The tombstone goes to the Memtable and eventually an SSTable. The original data still sits in its original SSTable. Before compaction, the data is still there on disk; reads see the tombstone, find that it's newer than the data, and return nothing.&lt;/p&gt;

&lt;p&gt;This means a table that has seen many updates looks like this on disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SSTable-1 (oldest):
  order X: status=shipped,   ts=1000
  order Y: status=pending,   ts=1001

SSTable-3 (newer):
  order X: status=delivered, ts=1500  ← newer version
  order Z: [tombstone]       ts=1600  ← delete marker

SSTable-5 (newest):
  order Y: status=shipped,   ts=2000  ← another update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The disk footprint of an &lt;code&gt;orders&lt;/code&gt; table with a busy update pattern is larger than the logical size of the data. Every historical version of every row exists in some SSTable until compaction removes it. Monitoring SSTable count and disk amplification is part of running Cassandra in production.&lt;/p&gt;

&lt;p&gt;Compare this to PostgreSQL's dead tuples: PostgreSQL also keeps old row versions around (in heap pages) and cleans them with VACUUM. Different mechanism, same root cause - both engines must keep old versions available for concurrent readers or crash recovery, and both accumulate waste that a background process cleans up. The specifics differ, but neither is free.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Read Path
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario A: &lt;code&gt;SELECT * FROM orders_by_id WHERE order_id = ?&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Check the Memtable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The read path starts in memory. Is the target &lt;code&gt;order_id&lt;/code&gt; in the current Memtable? If yes, return that version. It's the newest possible. If no, continue to SSTables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Bloom filter check for each SSTable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cassandra checks the Bloom filter for every SSTable on disk. A Bloom filter check is a memory operation - the filters are loaded into RAM. For each SSTable, the result is one of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Definitely not here&lt;/strong&gt;: skip this SSTable entirely. No disk read.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maybe here&lt;/strong&gt;: proceed to check the partition index.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With 20 SSTables and a Bloom filter false positive rate of ~1%, most SSTables are eliminated with zero disk I/O. The few that pass the filter get a partition index lookup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Partition index lookup&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each SSTable that passes its Bloom filter, Cassandra checks the &lt;strong&gt;partition summary&lt;/strong&gt; (in memory) to narrow the range, then reads the relevant portion of the &lt;strong&gt;partition index&lt;/strong&gt; from disk to find the exact byte offset of this &lt;code&gt;order_id&lt;/code&gt; in the SSTable data file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Read the partition from the SSTable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cassandra reads the partition data from the byte offset identified in step 3. This is the actual disk read.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Merge versions across SSTables&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the key appeared in multiple SSTables, Cassandra now has multiple versions of rows or cells with different timestamps. It merges them: for each column, the version with the highest timestamp wins. Tombstones suppress any data with an older timestamp. The result is the most recent consistent version of the row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What read amplification looks like in practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;After 1 flush (1 SSTable):
  → 1 Bloom filter check
  → 1 partition index lookup
  → 1 data read
  → No merge needed
  ≈ fast

After 20 flushes (20 SSTables), before compaction:
  → 20 Bloom filter checks (memory, fast)
  → ~1-3 partition index lookups (most filtered by Bloom)
  → 1-3 data reads (disk)
  → Merge step across matched versions
  ≈ slower, and the p99 tail grows with SSTable count
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the compaction effect Post 6's benchmark will put exact numbers on: the same cold read query, the same hardware, the same data - and a p99 latency that drops by more than 7× once SSTable count falls from 8 to 1. The engine cleaned up after itself, and reads got proportionally faster.&lt;/p&gt;




&lt;h3&gt;
  
  
  Scenario B: &lt;code&gt;SELECT * FROM orders_by_user WHERE user_id = ?&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This query goes to the &lt;code&gt;orders_by_user&lt;/code&gt; table - a completely separate set of SSTables with &lt;code&gt;user_id&lt;/code&gt; as the partition key. The read path is identical to Scenario A: Memtable check, Bloom filters, partition index, data read, merge.&lt;/p&gt;

&lt;p&gt;Here's the thing to notice: &lt;strong&gt;this is a primary key lookup on &lt;code&gt;orders_by_user&lt;/code&gt;, not a secondary index lookup&lt;/strong&gt;. The cost was paid at write time, when the application wrote to both tables. The read is as efficient as any partition key read on any table. There's no equivalent of PostgreSQL's heap fetch, no secondary B-Tree traversal, no ctid resolution step.&lt;/p&gt;

&lt;p&gt;This is the core architectural contrast with PostgreSQL:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;PostgreSQL&lt;/th&gt;
&lt;th&gt;Cassandra&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Write one order&lt;/td&gt;
&lt;td&gt;1 heap write + N index updates&lt;/td&gt;
&lt;td&gt;2 table writes (explicit, application-owned)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Read by &lt;code&gt;order_id&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Primary index → heap fetch&lt;/td&gt;
&lt;td&gt;Partition lookup on &lt;code&gt;orders_by_id&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Read by &lt;code&gt;user_id&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Secondary index → heap fetch&lt;/td&gt;
&lt;td&gt;Partition lookup on &lt;code&gt;orders_by_user&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Who pays for the secondary access&lt;/td&gt;
&lt;td&gt;Engine, at read time&lt;/td&gt;
&lt;td&gt;Application, at write time&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;PostgreSQL does the secondary access work at read time. The heap fetch and index maintenance are handled transparently. Cassandra moves that cost to write time - you write twice, but both reads are primary lookups. Same total work, different distribution across the write/read boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Compaction: Where the magic happens.
&lt;/h2&gt;

&lt;p&gt;Every post about Cassandra mentions compaction. Most treat it as an operational detail. It's not. Compaction is the corrective force that makes the LSM design sustainable. Without it, SSTables would accumulate indefinitely, reads would get progressively slower, and tombstones would never be reclaimed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why compaction exists
&lt;/h3&gt;

&lt;p&gt;Every Memtable flush produces a new SSTable. Updates and deletes produce additional versions and tombstones in newer SSTables. Over time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SSTable count grows → read amplification grows&lt;/li&gt;
&lt;li&gt;Disk space grows → old versions and tombstones consume space that no longer represents live data&lt;/li&gt;
&lt;li&gt;Read latency grows → more files to check, more merging at read time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compaction is the mechanism that reverses all three.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens during compaction
&lt;/h3&gt;

&lt;p&gt;Cassandra selects a set of SSTables to compact (which ones depends on the strategy). Then:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open all selected SSTables simultaneously and read them in sorted partition key order. Because each SSTable is individually sorted, merging them is a merge sort efficiently.&lt;/li&gt;
&lt;li&gt;For each partition key, collect all versions and cells from all selected SSTables.&lt;/li&gt;
&lt;li&gt;For each cell, keep only the version with the highest timestamp.&lt;/li&gt;
&lt;li&gt;For tombstones: if the tombstone is older than &lt;code&gt;gc_grace_seconds&lt;/code&gt; (default: 10 days), drop both the tombstone and the data it deletes. If it's newer, keep the tombstone in the output if it may still be needed.&lt;/li&gt;
&lt;li&gt;Write the merged, deduplicated, tombstone-cleaned result as a new SSTable.&lt;/li&gt;
&lt;li&gt;Delete the input SSTables.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The output is a single, clean SSTable with no duplicate versions, no stale tombstones, and a fresh Bloom filter and partition index reflecting only live data.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;gc_grace_seconds&lt;/code&gt; window exists for multi-node safety: in a cluster, a tombstone needs time to propagate to all replicas. If compaction removed a tombstone before all replicas saw it, a replica that missed the deletion could serve the deleted data as if it were live. Ten days is the conservative window to ensure propagation completes.&lt;/p&gt;

&lt;h3&gt;
  
  
  The direct effect on reads
&lt;/h3&gt;

&lt;p&gt;After compaction, the SSTable count drops. Fewer SSTables means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fewer Bloom filter checks per read&lt;/li&gt;
&lt;li&gt;Fewer partition index lookups&lt;/li&gt;
&lt;li&gt;Less data to merge&lt;/li&gt;
&lt;li&gt;Shorter, faster read path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the benchmark result from Post 1 made concrete. Cassandra read p99 went from ~4.1ms to ~1.4ms after compaction. The engine cleaned up after itself, and reads got proportionally faster.&lt;/p&gt;

&lt;h3&gt;
  
  
  The three compaction strategies
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;STCS - Size-Tiered Compaction Strategy&lt;/strong&gt; (the default)&lt;/p&gt;

&lt;p&gt;Groups SSTables by size and merges groups of similarly-sized ones together. Think of it as bins: small SSTables merge into medium ones, medium into large, large into very large. Write-optimized, each byte of data participates in few compaction passes. The downside: at any moment you can have many SSTables at the small tier, which means read amplification spikes under heavy write load before a tier-level merge runs.&lt;/p&gt;

&lt;p&gt;Use STCS for write-heavy workloads where compaction I/O budget is limited.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LCS - Leveled Compaction Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Organizes SSTables into levels (L0, L1, L2, ...), where each level is 10× larger than the previous. From L1 onward &lt;strong&gt;no two SSTables at the same level overlap in key range&lt;/strong&gt;. A read needs to check at most one SSTable per level from L1 up. With 5 levels, that's at most 5 SSTables for any read, regardless of how many total SSTables exist.&lt;/p&gt;

&lt;p&gt;The exception is L0. L0 receives Memtable flushes directly and SSTables here &lt;em&gt;can&lt;/em&gt; have overlapping key ranges, they arrive as flushed, unsorted relative to each other. A read must check every L0 SSTable. This is why L0 SSTable count is the critical operational metric for LCS: under heavy write load, L0 accumulates faster than compaction can promote files to L1, and read amplification rises until the backlog clears. A healthy LCS table keeps L0 small. Typically under 4 files.&lt;/p&gt;

&lt;p&gt;Read-optimized above L0. Bounded read amplification once L0 is under control. The cost: compaction is more frequent and more I/O-intensive because every write eventually needs to be organized into non-overlapping ranges at each level.&lt;/p&gt;

&lt;p&gt;Use LCS for read-heavy workloads where predictable read latency matters more than compaction overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TWCS - Time-Window Compaction Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Divides SSTables into time windows (e.g., one window per day). SSTables within a window are compacted together; windows don't compact across each other. When a window's TTL expires, the entire SSTable for that window is deleted as a unit - no need to read the data, just drop the file.&lt;/p&gt;

&lt;p&gt;Built for time-series data with TTL. Extremely efficient for the "write once, read briefly, expire in bulk" pattern. Breaks down for workloads that update historical data, because updates write new timestamps that cross window boundaries.&lt;/p&gt;

&lt;p&gt;Use TWCS for time-series tables, event logs, anything with uniform TTL.&lt;/p&gt;

&lt;h3&gt;
  
  
  The operational cost
&lt;/h3&gt;

&lt;p&gt;During compaction, both the input SSTables (being read) and the output SSTable (being written) exist on disk simultaneously. Peak disk usage during a compaction can be roughly 2× the size of the data being compacted. Cassandra nodes should never run above ~50% disk utilization, or compaction may fail for lack of space.&lt;/p&gt;

&lt;p&gt;Compaction also competes with live reads and writes for disk I/O. Cassandra has throughput throttling for compaction, but under heavy write load that produces SSTables faster than compaction can consume them, read amplification can climb even with throttling. Monitoring SSTable count per table is a core Cassandra operational metric.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cassandra Production Footgun: Tombstones
&lt;/h2&gt;

&lt;p&gt;Here's the scenario that causes real production incidents.&lt;/p&gt;

&lt;p&gt;Your application deletes all orders with &lt;code&gt;status = 'cancelled'&lt;/code&gt; from a partition lets say, all orders for a specific user in &lt;code&gt;orders_by_user&lt;/code&gt;. Each delete writes a tombstone. The data still exists in the older SSTables. For the next &lt;code&gt;gc_grace_seconds&lt;/code&gt; (10 days by default), &lt;strong&gt;every read of that partition must process every tombstone&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Now imagine a partition with 100,000 cancelled orders, all tombstoned. A read for that user's current orders must scan through 100,000 tombstones to find the handful of live rows. Even if the application considers those orders "deleted," Cassandra is reading every tombstone at query time. Reads that should return 5 rows in milliseconds take seconds because of tombstone scanning.&lt;/p&gt;

&lt;p&gt;Cassandra will warn you that there's a &lt;code&gt;tombstone_warn_threshold&lt;/code&gt; (default: 1,000 tombstones per read) and a &lt;code&gt;tombstone_failure_threshold&lt;/code&gt; (default: 100,000). Hitting the failure threshold causes reads to be aborted. Both are real production incident causes at companies running Cassandra at scale.&lt;/p&gt;

&lt;p&gt;The mitigation: set TTLs on data instead of deleting it, use TWCS so expired data is dropped as whole SSTables rather than tombstoned row by row, and monitor tombstone metrics actively. The problem doesn't appear in development because development data volumes are too small to trigger it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Engines, One Comparison
&lt;/h2&gt;

&lt;p&gt;Now that you've seen all three storage paths, here's how they line up:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where the write lands first&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL and MongoDB both write to a sequential log first (WAL / WiredTiger journal), then modify in-memory page structures (shared_buffers / WiredTiger cache). Cassandra also writes to a sequential log first (CommitLog), then writes to an in-memory sorted buffer (Memtable). The durability pattern is the same - log first, then memory, then data files. The difference is what the in-memory structure is: a page cache holding B-Tree nodes (PG/Mongo) vs a sorted write buffer that will become an immutable file (Cassandra).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What makes writes fast&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL and MongoDB writes involve finding the right position in a B-Tree, acquiring page or document locks, potentially splitting pages, and writing WAL records for modified pages. Under sustained write load, page splits and locking create latency variance. Cassandra writes append to a log and insert into a sorted RAM buffer. No page to find, no lock to acquire, no split to handle. The write path is maximally simple. The price is paid elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What makes reads complex&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL and MongoDB reads follow a tree to a single authoritative location. The heap or the B-Tree leaf. The data is there, in one place. Cassandra reads must check multiple SSTables, each of which may contain a version of the requested row. Bloom filters eliminate most checks, but the merge step is always present when multiple versions exist. Read complexity grows with SSTable count and shrinks after compaction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who owns the secondary access pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL maintains secondary indexes automatically. MongoDB maintains them automatically. The application writes once and the engine handles multiple access paths. In Cassandra, the application owns the secondary access pattern by writing to multiple tables. This is more work for the application developer and more disk space consumed. But both reads end up as efficient primary key lookups on their respective tables, which is not true of B-Tree secondary indexes that require heap fetches.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next: The Numbers
&lt;/h2&gt;

&lt;p&gt;Posts 2, 3, 4 and 5 have been entirely about understanding &lt;em&gt;why&lt;/em&gt; each engine behaves the way it does. Post 6 is where that understanding meets measurement.&lt;/p&gt;

&lt;p&gt;Real C++ clients. 1 million rows. Identical hardware. Cold reads and warm reads. Write throughput under sustained load. Pre-compaction and post-compaction latency distributions. The full picture.&lt;/p&gt;

&lt;p&gt;The most surprising result in the benchmark is not the one you'd predict from the theory alone. You'll need to read coming post to find out what it is.&lt;/p&gt;

</description>
      <category>database</category>
      <category>cassandra</category>
      <category>backend</category>
      <category>performance</category>
    </item>
    <item>
      <title>MongoDB Internals: Inside the Storage Engine and How is it different than Postgre</title>
      <dc:creator>priteshsurana</dc:creator>
      <pubDate>Thu, 09 Apr 2026 00:05:04 +0000</pubDate>
      <link>https://forem.com/priteshsurana/mongodb-internals-inside-the-storage-engine-2c9b</link>
      <guid>https://forem.com/priteshsurana/mongodb-internals-inside-the-storage-engine-2c9b</guid>
      <description>&lt;p&gt;Post 3 explained the flow of INSERT and SELECT from PostgreSQL lense. Now its time for &lt;code&gt;insertOne&lt;/code&gt;/&lt;code&gt;insertMany&lt;/code&gt; and &lt;code&gt;find&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  MongoDB
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How MongoDB is different before you start
&lt;/h3&gt;

&lt;p&gt;Three major differences from PostgreSQL that we will visit in this section.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;WiredTiger is a separate, pluggable storage engine&lt;/strong&gt; underneath MongoDB. PostgreSQL's storage is tightly integrated with the query engine. WiredTiger is a standalone embeddable key-value store that MongoDB sits on top of. This matters because WiredTiger has its own caching, its own journal, its own compression, and its own concurrency model, somewhat independent of MongoDB's query layer.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;documents are stored as BSON&lt;/strong&gt; — a binary encoding where field names are stored as strings inside every document, on disk, for every document in the collection. PostgreSQL's heap rows contain only values; column names live once in the catalog. BSON's field name overhead matters at scale.&lt;/p&gt;

&lt;p&gt;Third, MongoDB provides &lt;strong&gt;document-level concurrency&lt;/strong&gt;, implemented using WiredTiger’s &lt;strong&gt;optimistic concurrency control and fine-grained locking&lt;/strong&gt;, not page-level locking. Two concurrent writes to different documents in the &lt;code&gt;orders&lt;/code&gt; collection never block each other, even if they land on the same internal storage page. PostgreSQL's page-level LWLocks can cause contention between concurrent writers targeting the same page.&lt;/p&gt;




&lt;h3&gt;
  
  
  The architecture
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1cuur7r2vxx7j1vcire.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1cuur7r2vxx7j1vcire.png" alt="Process" width="800" height="663"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The write path: &lt;code&gt;insertOne&lt;/code&gt; into orders
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insertOne&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;a1b2...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;u9x8...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;shipped&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;149.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Order for...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;1. BSON serialization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before anything reaches WiredTiger, the document is serialized to BSON. In BSON, each field is encoded as: a type byte, the field name as a null-terminated string, then the value. For our &lt;code&gt;orders&lt;/code&gt; document with six fields, the field names themselves (&lt;code&gt;order_id&lt;/code&gt;, &lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt;, &lt;code&gt;amount&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt;, &lt;code&gt;created_at&lt;/code&gt;) add roughly 50–70 bytes of overhead per document.&lt;/p&gt;

&lt;p&gt;That overhead exists for every document in the collection. For 1 million orders, that's 50–70MB of field name data that PostgreSQL simply doesn't have, because PostgreSQL stores column names once in &lt;code&gt;pg_attribute&lt;/code&gt;. For a collection with short values and many fields, BSON overhead is a meaningful fraction of total storage. For a collection dominated by large field values (like the 500-character &lt;code&gt;description&lt;/code&gt;), it's a smaller percentage but never zero.&lt;/p&gt;

&lt;p&gt;This is a fundamental consequence of the schema-free document model that the schema travels with the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. WiredTiger cache - the document lands here first&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;WiredTiger maintains its own in-memory cache (configured via &lt;code&gt;wiredTigerCacheSizeGB&lt;/code&gt;). This is conceptually similar to PostgreSQL's shared_buffers - a pool of in-memory pages that buffer both reads and writes.&lt;/p&gt;

&lt;p&gt;The key difference: &lt;strong&gt;WiredTiger stores data compressed on disk but uncompressed in cache&lt;/strong&gt;. When a document is written to the WiredTiger cache, it lives there uncompressed. When it's evicted to disk (during a checkpoint), WiredTiger compresses it using Snappy by default. When it's read back from disk (a cold read), it's decompressed as it loads into cache.&lt;/p&gt;

&lt;p&gt;This means your configured cache size represents uncompressed data, while your disk usage reflects compressed data. A 4GB WiredTiger cache might correspond to 8–12GB of data on disk, depending on compression ratio. Cold reads pay a decompression cost that PostgreSQL doesn't have in its default configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Journal write - durability before acknowledgment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;WiredTiger has its own journal, conceptually equivalent to PostgreSQL's WAL. It's a sequential, append-only log that describes changes before they're applied to data files.&lt;/p&gt;

&lt;p&gt;The key behavioral difference from PostgreSQL is in the default durability setting. PostgreSQL's default &lt;code&gt;synchronous_commit = on&lt;/code&gt; fsyncs the WAL before every commit. WiredTiger's default journal sync interval is &lt;strong&gt;100 milliseconds&lt;/strong&gt;. Acknowledgment can happen before the journal is fsynced, accepting up to 100ms of potential data loss in a hard crash.&lt;/p&gt;

&lt;p&gt;MongoDB exposes this to the application as &lt;strong&gt;write concern&lt;/strong&gt;. With &lt;code&gt;j: false&lt;/code&gt;, MongoDB acknowledges the write as soon as WiredTiger's cache accepts it. With &lt;code&gt;j: true&lt;/code&gt;, MongoDB waits for the journal to be fsynced before acknowledging. The latency difference between these two settings is measurable; &lt;code&gt;j: true&lt;/code&gt; adds the cost of a synchronous disk flush to every write, similar to what PostgreSQL pays by default.&lt;/p&gt;

&lt;p&gt;For the benchmark in Post 6, write concern settings will be explicitly called out because they significantly affect the numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Primary B-Tree index update on &lt;code&gt;order_id&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;WiredTiger updates the &lt;code&gt;order_id&lt;/code&gt; B-Tree index. The traversal is the same pattern as PostgreSQL's B+Tree - root to internal nodes to leaf, finding the right position for the new UUID. The same random-UUID page-split problem applies: UUIDs land at random positions, any leaf can be full, splits are frequent.&lt;/p&gt;

&lt;p&gt;One architectural distinction worth noting: in MongoDB, the &lt;strong&gt;collection storage itself is a WiredTiger B-Tree keyed by &lt;code&gt;order_id&lt;/code&gt;&lt;/strong&gt;. There isn't a separate heap file and a separate primary index. The collection B-Tree serves as both. The document data lives in the B-Tree's leaf nodes. This is different from PostgreSQL, where the heap is unordered storage and the primary index is a separate B+Tree pointing into it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Secondary index updates on &lt;code&gt;user_id&lt;/code&gt; and &lt;code&gt;created_at&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MongoDB secondary indexes differ from PostgreSQL's in one important design choice: &lt;strong&gt;MongoDB secondary index entries contain the document's &lt;code&gt;order_id&lt;/code&gt; value, not a physical storage location&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In PostgreSQL, a secondary index leaf entry holds a &lt;code&gt;ctid&lt;/code&gt;- a literal page number and slot number. This is a direct physical pointer into the heap. It's fast to follow at read time, but it becomes stale if the row moves (which can happen during certain heap operations).&lt;/p&gt;

&lt;p&gt;MongoDB chose to store &lt;code&gt;order_id&lt;/code&gt; in secondary index entries instead. The lookup then requires a second step: use the &lt;code&gt;order_id&lt;/code&gt; to look up the document in the collection's primary B-Tree. This is effectively two B-Tree traversals for a secondary index read.&lt;/p&gt;

&lt;p&gt;The reason for this choice: documents in WiredTiger can move within storage during compaction and internal page restructuring. If secondary indexes contained physical locations, every document move would require updating every secondary index entry pointing to it which is potentially expensive. Storing &lt;code&gt;order_id&lt;/code&gt; instead means document moves never invalidate secondary index entries. The tradeoff is paid at read time with the double traversal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Document-level locking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When two concurrent inserts arrive, WiredTiger acquires a document-level lock for each, not a page-level lock. If the two documents happen to land on the same internal B-Tree page, they still don't block each other. So WiredTiger's concurrency model ensures independent documents can be written independently even when they share a storage page.&lt;/p&gt;

&lt;p&gt;PostgreSQL's LWLocks are page-level: if two concurrent inserts target the same B+Tree leaf page, one must wait for the other to release the lock before proceeding. Under high concurrent write load to the same key range (like sequential timestamps), this becomes measurable contention.&lt;/p&gt;




&lt;h3&gt;
  
  
  The read path: primary and secondary key
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Scenario A: &lt;code&gt;db.orders.findOne({ order_id: "a1b2..." })&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. B-Tree traversal&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MongoDB traverses the &lt;code&gt;order_id&lt;/code&gt; B-Tree from root to the leaf containing the target UUID. Same depth as PostgreSQL for the same data volume - 3–4 levels for 1M documents. The query planner identifies this as a primary key lookup and routes it through the &lt;code&gt;order_id&lt;/code&gt; index without deliberation.&lt;/p&gt;

&lt;p&gt;MongoDB caches which index to use per &lt;strong&gt;query shape&lt;/strong&gt;, the structure of the filter, sort, and projection without the specific values. If you've run &lt;code&gt;findOne({ order_id: ... })&lt;/code&gt; before, the planner uses the cached plan. PostgreSQL re-evaluates cost-based plans per query but also uses plan caching for prepared statements; MongoDB's trial-based plan caching behaves differently under data distribution changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Document fetch and decompression&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because the collection data lives in the &lt;code&gt;order_id&lt;/code&gt; B-Tree itself (not a separate heap), the document is retrieved directly from the leaf node. There's no separate heap fetch step. The B-Tree traversal ends with the document.&lt;/p&gt;

&lt;p&gt;If the page is in the WiredTiger cache, the document is already uncompressed in memory. If not, a cold read - WiredTiger reads the compressed page from disk, decompresses it into cache, and returns the document. The decompression step adds CPU work to cold reads that PostgreSQL's default configuration doesn't have.&lt;/p&gt;

&lt;p&gt;For the benchmark's cold read numbers, this decompression cost is part of why MongoDB's p99 is slightly higher than PostgreSQL's - the disk I/O is similar, but MongoDB adds decompression.&lt;/p&gt;




&lt;h4&gt;
  
  
  Scenario B: &lt;code&gt;db.orders.find({ user_id: "u9x8..." })&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Secondary index traversal&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MongoDB traverses the &lt;code&gt;user_id&lt;/code&gt; secondary index B-Tree to find all entries matching the target UUID. Each matching leaf entry contains &lt;code&gt;(user_id_value, _order_id_value)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Document lookups by &lt;code&gt;order_id&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each matching &lt;code&gt;order_id&lt;/code&gt;, MongoDB performs a second B-Tree traversal but this time on the collection's &lt;code&gt;order_id&lt;/code&gt; B-Tree to retrieve the full document. If a user has 20 orders, that's 20 secondary index traversals + 20 &lt;code&gt;order_id&lt;/code&gt; lookups. Each &lt;code&gt;order_id&lt;/code&gt; lookup is a full tree traversal from root to leaf.&lt;/p&gt;

&lt;p&gt;This double-traversal is the same fundamental cost as PostgreSQL's secondary index + heap fetch, but the mechanism differs. PostgreSQL follows a ctid (direct physical pointer, one disk seek). MongoDB follows an &lt;code&gt;order_id&lt;/code&gt; (logical key, full tree traversal). The logical key approach is more robust to storage reorganization; the physical pointer approach is faster per lookup.&lt;/p&gt;

&lt;p&gt;Both databases pay the secondary-index penalty. The performance gap between primary key reads and secondary index reads is visible in both engines.&lt;/p&gt;




&lt;h3&gt;
  
  
  The MongoDB surprise: WiredTiger cache and compression
&lt;/h3&gt;

&lt;p&gt;Engineers who come from PostgreSQL often assume that "cache size" and "data size" are in the same units. In WiredTiger, they're not.&lt;/p&gt;

&lt;p&gt;WiredTiger compresses data on disk (Snappy by default achieves 2–4× compression on typical BSON documents). The WiredTiger cache holds data &lt;strong&gt;uncompressed&lt;/strong&gt;. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The WiredTiger cache holds data uncompressed, but disk stores it compressed.&lt;/strong&gt; This means the cache must be sized for the &lt;em&gt;uncompressed&lt;/em&gt; working set which is 2–4× larger than the on-disk footprint. A working set that occupies 4GB compressed on disk expands to 8–16GB in the WiredTiger cache. Under-sizing the cache relative to this uncompressed working set is a common MongoDB performance issue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Every cold read involves a decompression step.&lt;/strong&gt; Reading from disk means reading compressed bytes, then spending CPU cycles to decompress them into cache. On modern hardware this is fast. Snappy decompresses at multiple GB/sec but it's not free, and it doesn't exist in PostgreSQL's default storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache pressure is measured in uncompressed bytes.&lt;/strong&gt; To hold 8GB of compressed on-disk data in the WiredTiger cache, you need roughly 8GB × compression_ratio so somewhere between 16GB and 32GB of cache. More cache is required than the raw disk size suggests, not less. This is the inverse of what most engineers assume when first sizing a MongoDB deployment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The compression tradeoff is generally positive. You get more effective cache and less disk I/O. But it changes the cost model for cold reads in a way that's easy to overlook when sizing hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  Direct comparison
&lt;/h2&gt;

&lt;p&gt;After tracing both engines, here's where they converge and where they diverge:&lt;/p&gt;

&lt;h3&gt;
  
  
  Where the first write lands
&lt;/h3&gt;

&lt;p&gt;Both PostgreSQL (WAL) and MongoDB (journal) write to a sequential log before touching data structures. The principle is identical - write-ahead logging for crash safety. The difference is the default sync behavior: PostgreSQL fsyncs per commit by default; WiredTiger's journal syncs every 100ms by default. PostgreSQL's default is safer; MongoDB's default is faster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Locking granularity
&lt;/h3&gt;

&lt;p&gt;PostgreSQL uses page-level LWLocks - concurrent writes to the same B+Tree leaf page serialize. WiredTiger uses document-level locks - concurrent writes to different documents never block each other. Under high concurrent write load to overlapping key ranges, WiredTiger's finer granularity shows up as better throughput.&lt;/p&gt;

&lt;h3&gt;
  
  
  Secondary index design
&lt;/h3&gt;

&lt;p&gt;PostgreSQL secondary indexes store ctids - physical heap pointers. Fast to follow, but tied to physical location. MongoDB secondary indexes store &lt;code&gt;order_id&lt;/code&gt; - logical keys. Requires a second B-Tree traversal to fetch the document, but immune to document moves. Both choices are deliberate; both have real costs at read time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storage overhead
&lt;/h3&gt;

&lt;p&gt;PostgreSQL stores column names once in the system catalog; heap rows contain only values. MongoDB embeds field names in every BSON document on disk. For our &lt;code&gt;orders&lt;/code&gt; collection with six fields and 1 million documents, BSON overhead adds roughly 50–70MB that PostgreSQL doesn't have. For collections with larger values, this is a smaller percentage; for collections with many small fields, it's significant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where each engine is faster
&lt;/h3&gt;

&lt;p&gt;PostgreSQL's linked B+Tree leaf nodes make range scans faster - follow the list, read sequentially. MongoDB's document-level locking makes high-concurrency writes more scalable. PostgreSQL's direct ctid heap fetch is faster per secondary index lookup than MongoDB's double B-Tree traversal. WiredTiger's compression means more data fits in cache per GB of RAM.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next: when B-Tree assumptions break down
&lt;/h2&gt;

&lt;p&gt;Both PostgreSQL and MongoDB are built on B-Tree variants. They organize data in sorted pages, they update those pages in place, they use WAL or journal for crash recovery, and they handle MVCC by keeping old versions around for concurrent readers. The details differ, but the fundamental bias is the same: &lt;strong&gt;optimize for reads at the cost of write complexity and in-place update overhead&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Next introduces Cassandra, which starts from the opposite premise. Its storage is append-only and immutable. There are no pages to split, no tuples to vacuum, no in-place updates. Writes are always sequential appends to a log and an in-memory buffer. Reads are more complex because data may be spread across multiple immutable files.&lt;/p&gt;

&lt;p&gt;Every performance characteristic that differs between Cassandra and the two engines you just traced flows from that single architectural inversion.&lt;/p&gt;




</description>
      <category>database</category>
      <category>postgres</category>
      <category>mongodb</category>
      <category>performance</category>
    </item>
    <item>
      <title>PostgreSQL Internals: Inside the Storage Engine</title>
      <dc:creator>priteshsurana</dc:creator>
      <pubDate>Sun, 05 Apr 2026 04:29:42 +0000</pubDate>
      <link>https://forem.com/priteshsurana/postgresql-internals-inside-the-storage-engine-1bhm</link>
      <guid>https://forem.com/priteshsurana/postgresql-internals-inside-the-storage-engine-1bhm</guid>
      <description>&lt;p&gt;Post 2 gave you the data structures: B+Tree, B-Tree, LSM Tree - how they're shaped and the tradeoffs they do.&lt;/p&gt;

&lt;p&gt;This post traces an &lt;code&gt;INSERT&lt;/code&gt; and a &lt;code&gt;SELECT&lt;/code&gt; through PostgreSQL, step by step, using the same &lt;code&gt;orders&lt;/code&gt; schema throughout.  &lt;/p&gt;




&lt;h2&gt;
  
  
  PostgreSQL
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The architecture
&lt;/h3&gt;

&lt;p&gt;Before tracing any queries, map the terrain. PostgreSQL has four major components you'll encounter on every operation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbti2igcu2sl020y1xdah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbti2igcu2sl020y1xdah.png" alt="PostgreSQL Process" width="800" height="529"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;shared_buffers&lt;/strong&gt; is PostgreSQL's in-memory page cache; every read and write touches it first. The &lt;strong&gt;WAL&lt;/strong&gt; is an append-only sequential log on disk; it's how PostgreSQL survives crashes. The &lt;strong&gt;heap file&lt;/strong&gt; is where rows actually live, in 8KB pages, in roughly insertion order. The &lt;strong&gt;index files&lt;/strong&gt; are separate B+Tree structures, also stored as 8KB pages, pointing into the heap. Writes go to the WAL first, then into shared_buffers, and eventually to the heap and index files on disk. Reads check shared_buffers first; if the page isn't there, it comes from disk.&lt;/p&gt;




&lt;h3&gt;
  
  
  The write path: &lt;code&gt;INSERT INTO orders&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'a1b2...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'u9x8...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'shipped'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;149&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Order for...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;1. Parsing and planning&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL parses the SQL into an AST, resolves table and column names against the catalog, and produces a trivial plan: "insert one row into the orders heap, update the primary index on &lt;code&gt;order_id&lt;/code&gt;, update secondary indexes on &lt;code&gt;user_id&lt;/code&gt; and &lt;code&gt;created_at&lt;/code&gt;." No interesting planning happens for a simple insert; the executor takes over immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The WAL write happens before anything else&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before PostgreSQL touches shared_buffers, before it modifies a single heap page, it writes a WAL record describing this insert. The WAL record contains enough information to reconstruct the change: which relation, which page, what was written.&lt;/p&gt;

&lt;p&gt;Why first? Because disk writes are not atomic. If PostgreSQL wrote to the heap file and then crashed before finishing, you'd have a partially written page with no way to know what it should contain. The WAL is written sequentially. On crash, PostgreSQL replays WAL to reconstruct any changes that didn't make it to the heap.&lt;/p&gt;

&lt;p&gt;The WAL write is fsynced to disk before the transaction is acknowledged to your application. That fsync is real I/O and it is one of the biggest contributors to PostgreSQL write latency.&lt;/p&gt;

&lt;p&gt;This behavior is controlled by &lt;code&gt;synchronous_commit&lt;/code&gt;. The default (&lt;code&gt;on&lt;/code&gt;) fsyncs the WAL before acknowledging. Setting it to &lt;code&gt;off&lt;/code&gt; lets PostgreSQL acknowledge before the fsync, reducing write latency significantly, but accepting up to &lt;code&gt;wal_writer_delay&lt;/code&gt; (default 200ms) of potential data loss on a hard crash. In the benchmark in Post 5, you'll see exactly how much latency this setting saves. The difference is substantial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The row lands in shared_buffers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL now needs a heap page with enough free space for the new row. The &lt;code&gt;orders&lt;/code&gt; table with a 500-character &lt;code&gt;description&lt;/code&gt; field has rows of roughly 600–700 bytes. An 8KB page holds about 10–12 of these rows.&lt;/p&gt;

&lt;p&gt;PostgreSQL consults the &lt;strong&gt;Free Space Map (FSM)&lt;/strong&gt; , a structure that tracks how much free space exists in each heap page, to find a suitable page. It loads that page into shared_buffers if it isn't already there, and writes the new row into the page's free space. The page is now &lt;strong&gt;dirty&lt;/strong&gt;; its in-memory version differs from what's on disk. It will be flushed to the heap file eventually, by the checkpointer background process.&lt;/p&gt;

&lt;p&gt;The heap is unordered by design. PostgreSQL doesn't store rows sorted by &lt;code&gt;order_id&lt;/code&gt; or any other key. New rows go wherever there's space. This is what enables fast inserts; you never need to find a sorted position in the heap. The tradeoff is that reads by non-indexed fields require a full sequential scan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So what's durable right now:&lt;/strong&gt; The WAL record is on disk. If the server crashes at this exact moment, PostgreSQL will replay the WAL on restart and re-apply this insert. The heap page is only in shared_buffers. But that's fine, because the WAL has it covered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The primary B+Tree index update on &lt;code&gt;order_id&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL now updates the &lt;code&gt;order_id&lt;/code&gt; index. It traverses the B+Tree from the root page to find the leaf page where the new UUID belongs.&lt;/p&gt;

&lt;p&gt;For a table with 1 million rows, the B+Tree is typically 3–4 levels deep. Each level is a page read. So if the page is in shared_buffers, it's a memory access. If not, it's a disk read. Root and upper internal pages stay hot in shared_buffers because they're accessed on every operation; leaf pages are the cold part.&lt;/p&gt;

&lt;p&gt;UUID keys are random. They don't arrive in sorted order, so each new key lands at a random position in the leaf level. Because of this, &lt;strong&gt;any leaf page can be the target of any insert&lt;/strong&gt; and any leaf page can be full. Page splits happen frequently with UUID primary keys. When a leaf page is full, it splits: half its entries move to a new sibling page, and the parent node gets a new routing key. This is two page writes instead of one, plus a parent modification. Under high insert load with UUID keys, this is a real source of write amplification and p99 latency spikes.&lt;/p&gt;

&lt;p&gt;Sequential or time-ordered keys (monotonically increasing integers) avoid this almost entirely. Splits only happen at the rightmost leaf as the tree grows forward. If write performance matters for your schema, key selection matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Secondary index updates on &lt;code&gt;user_id&lt;/code&gt; and &lt;code&gt;created_at&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each secondary index is a separate B+Tree on disk. After the heap write, PostgreSQL updates both of them. The &lt;code&gt;user_id&lt;/code&gt; index leaf entries contain &lt;code&gt;(user_id_value, ctid)&lt;/code&gt;, the indexed value plus a &lt;strong&gt;tuple ID&lt;/strong&gt; pointing to the physical location of the row in the heap (page number + slot number). The &lt;code&gt;created_at&lt;/code&gt; index works identically.&lt;/p&gt;

&lt;p&gt;This means a single &lt;code&gt;INSERT&lt;/code&gt; into &lt;code&gt;orders&lt;/code&gt; touches: 1 heap page + 1 primary index leaf page + 1 &lt;code&gt;user_id&lt;/code&gt; index leaf page + 1 &lt;code&gt;created_at&lt;/code&gt; index leaf page + WAL records for all of them. That's the minimum. Page splits add more.&lt;/p&gt;

&lt;p&gt;This is why more secondary indexes means slower writes. Each index is another B+Tree traversal, another page modification, another WAL record. The cost is O(k) in the number of indexes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's durable right now:&lt;/strong&gt; WAL records for all modifications have been fsynced. All four sets of page changes are in shared_buffers, not yet on disk in the heap and index files. None of that matters for durability; the WAL has everything. The checkpointer will flush the pages to disk in the background, at which point the corresponding WAL segments become eligible for recycling.&lt;/p&gt;




&lt;h3&gt;
  
  
  The read path: primary key and secondary index
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Scenario A: &lt;code&gt;SELECT * FROM orders WHERE order_id = &amp;lt;some order id&amp;gt;&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Planning&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The query planner sees a filter on &lt;code&gt;order_id&lt;/code&gt;, which is the primary key. It knows there's a B+Tree index on this column. For an equality predicate on an indexed column with high selectivity (one specific UUID), an index scan is the obvious choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. B+Tree traversal&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL starts at the root page of the &lt;code&gt;order_id&lt;/code&gt; index. Lets say with 1 million rows and 8KB pages, the tree is about 3–4 levels deep. At each level, it reads the node and follows the pointer toward the target UUID. This takes 3–4 page reads to reach the leaf.&lt;/p&gt;

&lt;p&gt;Each page read checks shared_buffers first. Root and upper internal pages are almost always warm; they're tiny (a few pages) and accessed constantly. Leaf pages may or may not be cached depending on your workload and how recently this specific range was accessed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The heap fetch&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The leaf node entry contains a &lt;strong&gt;ctid&lt;/strong&gt;, a physical pointer to a specific page and slot in the heap file. PostgreSQL takes that ctid and fetches the heap page. This is a second disk access (or shared_buffers hit) beyond the index traversal.&lt;/p&gt;

&lt;p&gt;This two-step structure - index lookup to get a pointer, then heap fetch to get the actual row is the fundamental cost. And it's worth understanding clearly: even the primary index doesn't contain the row data. Rows live in the heap. Indexes are always pointers into the heap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. MVCC - finding the right version&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When PostgreSQL finds the row in the heap page, it may find multiple versions of the same logical row. This is MVCC (Multi-Version Concurrency Control). Every row version has two hidden fields: &lt;code&gt;xmin&lt;/code&gt; (the transaction ID that created this version) and &lt;code&gt;xmax&lt;/code&gt; (the transaction ID that deleted or superseded it, or 0 if still live).&lt;/p&gt;

&lt;p&gt;PostgreSQL checks these against your transaction's snapshot; the set of transaction IDs that were committed when your query started. If &lt;code&gt;xmin&lt;/code&gt; is committed and visible to your snapshot, and &lt;code&gt;xmax&lt;/code&gt; is 0 or not yet committed, this is your row. If another transaction is currently updating this row, you'll find its old version without blocking so MVCC means readers never wait for writers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warm vs cold read:&lt;/strong&gt; If the heap page is in shared_buffers, the whole operation takes microseconds. If it's not, then a cold read - PostgreSQL reads it from disk, which is where that ~1.2ms p99 in the benchmark comes from. The OS page cache may have it buffered below the database level, which is faster than physical disk but slower than shared_buffers.&lt;/p&gt;




&lt;h4&gt;
  
  
  Scenario B: &lt;code&gt;SELECT * FROM orders WHERE user_id = $1&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Planning and index choice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;user_id&lt;/code&gt; is not the primary key — it's a secondary index column. The planner estimates how many rows match this &lt;code&gt;user_id&lt;/code&gt; value. If it's highly selective (one user with a few orders out of a million), an index scan on the &lt;code&gt;user_id&lt;/code&gt; B+Tree is the right call. If it's low selectivity (a user with 50,000 orders), a sequential scan of the heap might actually be faster because random heap fetches at scale are slower than a sequential read.&lt;/p&gt;

&lt;p&gt;One flag the planner uses is &lt;code&gt;random_page_cost&lt;/code&gt; — the estimated cost of a random page read relative to a sequential read. The default is 4.0, which reflects spinning disk characteristics. On SSDs, it should be closer to 1.1–1.5. If &lt;code&gt;random_page_cost&lt;/code&gt; is set too high for your hardware, the planner over-penalizes index scans and may choose a sequential scan when an index scan would be faster. This is a common tuning issue on SSD-backed databases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Secondary index traversal + heap fetches&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL traverses the &lt;code&gt;user_id&lt;/code&gt; B+Tree to find all leaf entries matching the target UUID. Each matching entry contains a ctid. For a user with 20 orders, that's 20 ctids. PostgreSQL then fetches each corresponding heap page.&lt;/p&gt;

&lt;p&gt;The critical issue: those 20 heap pages are likely scattered randomly across the heap file, because rows were inserted in time order, not user order. That's 20 potentially non-sequential disk reads. This is why secondary index reads are more expensive than primary key reads at scale. Not because the index traversal is slower, but because the heap fetches are random.&lt;/p&gt;

&lt;p&gt;PostgreSQL has an optimization called a &lt;strong&gt;bitmap index scan&lt;/strong&gt; for this case: it collects all matching ctids first, sorts them by physical page order, then fetches heap pages in order. This converts random reads into something closer to sequential reads. The planner chooses this strategy automatically when it estimates enough matching rows to make the sort worthwhile.&lt;/p&gt;




&lt;h3&gt;
  
  
  Crash recovery in summary
&lt;/h3&gt;

&lt;p&gt;When PostgreSQL restarts after a crash, it reads the control file to find the position of the last successful &lt;strong&gt;checkpoint&lt;/strong&gt;; a moment when all dirty shared_buffers pages were flushed to the heap and index files. From that position forward, PostgreSQL replays every WAL record, re-applying all changes that happened after the checkpoint. Any transaction with a WAL commit record is replayed to completion; any transaction without a commit record is effectively rolled back. When replay finishes, the heap and index files are in a consistent state and the database opens for connections.&lt;/p&gt;




&lt;h3&gt;
  
  
  The PostgreSQL surprise: dead tuples
&lt;/h3&gt;

&lt;p&gt;Here's something that surprises most engineers when they first encounter it: &lt;strong&gt;PostgreSQL never updates a row in place.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you execute &lt;code&gt;UPDATE orders SET status = 'delivered' WHERE order_id = 1234&lt;/code&gt;, PostgreSQL does not find the existing row and modify it. It writes a &lt;strong&gt;new version&lt;/strong&gt; of the row into the heap (in a free slot on the same or a different page), sets &lt;code&gt;xmax&lt;/code&gt; on the old version to the current transaction ID, and leaves the old version in the heap page. The old version is now a &lt;strong&gt;dead tuple&lt;/strong&gt;,  invisible to future transactions but still occupying space.&lt;/p&gt;

&lt;p&gt;The heap page now contains more data than it represents. Over time, on a table with frequent updates, heap pages can be mostly dead tuples. This is called &lt;strong&gt;heap bloat&lt;/strong&gt;. It wastes disk space and, more importantly, causes reads to do more I/O to find live rows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VACUUM&lt;/strong&gt; is the background process that reclaims dead tuples. It scans heap pages, identifies tuples whose &lt;code&gt;xmax&lt;/code&gt; is old enough that no active transaction could ever need them, and marks their space as reusable. &lt;code&gt;autovacuum&lt;/code&gt; runs this automatically, but under heavy update load it can fall behind.&lt;/p&gt;

&lt;p&gt;The reason PostgreSQL does writing new versions rather than modifying in place is MVCC. Concurrent readers may need the old version of a row while a writer is updating it. Both versions need to coexist in the heap until no reader needs the old one. The dead tuple overhead is the cost of non-blocking reads.&lt;/p&gt;

&lt;p&gt;Cassandra has a different but equivalent cost: it also writes new versions and marks deletions with tombstones, and the cleanup (compaction) is also asynchronous.&lt;/p&gt;




</description>
      <category>database</category>
      <category>postgres</category>
      <category>storage</category>
      <category>performance</category>
    </item>
    <item>
      <title>B+Tree vs LSM Tree: Why Your Database's Data Structure Is Everything</title>
      <dc:creator>priteshsurana</dc:creator>
      <pubDate>Wed, 01 Apr 2026 23:45:41 +0000</pubDate>
      <link>https://forem.com/priteshsurana/btree-vs-lsm-tree-why-your-databases-data-structure-is-everything-94c</link>
      <guid>https://forem.com/priteshsurana/btree-vs-lsm-tree-why-your-databases-data-structure-is-everything-94c</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/priteshsurana/what-actually-happens-when-you-call-insert-a93"&gt;Post 1&lt;/a&gt;, we looked at a benchmark result where Cassandra wrote 2× faster than PostgreSQL — then read 3× slower before compaction ran. Same hardware. Same data. Wildly different numbers in both directions.&lt;/p&gt;

&lt;p&gt;That result is a direct consequence of the data structures each engine is built on. Cassandra and PostgreSQL made opposite choices at the foundation level, and those choices ripple through every read, every write, and every latency number you'll ever measure.&lt;/p&gt;

&lt;p&gt;This post explains those choices.  &lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem All Database Indexes Must Solve
&lt;/h2&gt;

&lt;p&gt;Before we look at any specific data structure, let's talk about why this problem is hard.&lt;/p&gt;

&lt;p&gt;You want three things from a database index:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fast writes.&lt;/strong&gt; When you insert an order, the index should update quickly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast reads.&lt;/strong&gt; When you query by &lt;code&gt;order_id&lt;/code&gt;, finding the right row should take as few disk operations as possible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficient range scans.&lt;/strong&gt; When you query orders between two dates, the engine should be able to find the start of the range and read forward — not scatter-gather across random disk locations.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every storage engine is making a tradeoff between these goals, and the tradeoff it makes determines its entire performance profile.&lt;/p&gt;

&lt;p&gt;Data stored in sorted order is fast to read sequentially and fast to scan in ranges. But keeping data sorted as new writes arrive requires finding the right position for each new key - which means touching existing data structures, reading before you can write, which adds latency. If instead you just append new data without sorting, writes are fast but reads become expensive because you have to search unsorted data.&lt;/p&gt;

&lt;p&gt;Every storage engine resolves this tension differently. Let's look at each approach.&lt;/p&gt;




&lt;h2&gt;
  
  
  B+Tree - The Structure That Powers PostgreSQL
&lt;/h2&gt;

&lt;p&gt;The B+Tree is the dominant data structure in relational databases and has been for decades. PostgreSQL uses it for every index. It's a tree, and to understand it, you need to picture what that tree actually looks like.&lt;/p&gt;

&lt;h3&gt;
  
  
  Nodes, leaves, and the shape of the tree
&lt;/h3&gt;

&lt;p&gt;Imagine you're storing &lt;code&gt;order_id&lt;/code&gt; values (UUIDs) in an index. The B+Tree organizes these into a hierarchy of &lt;strong&gt;nodes&lt;/strong&gt;. Each node holds a sorted list of keys and pointers.&lt;/p&gt;

&lt;p&gt;There are two kinds of nodes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Internal nodes&lt;/strong&gt; hold keys and pointers to child nodes. They exist purely for navigation - you use them to route your search toward the right leaf. They don't hold the actual row data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leaf nodes&lt;/strong&gt; hold the actual data (or pointers to it). &lt;br&gt;
In PostgreSQL's case, leaf nodes in a secondary index hold &lt;code&gt;(key, heap tuple ID)&lt;/code&gt; pairs - the key you indexed, and a pointer to where the actual row lives in the heap file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hu7l4hb9psngzagcar6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hu7l4hb9psngzagcar6.png" alt="Tree" width="800" height="223"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The root and internal nodes are small. The leaves are where the bulk of the data lives. For a table with millions of rows, the tree might be only 3-4 levels deep. That's 3-4 node reads to find any row, regardless of table size. This is what makes B+Tree reads fast.&lt;/p&gt;
&lt;h3&gt;
  
  
  Leaf nodes form a linked list - and this matters for range scans
&lt;/h3&gt;

&lt;p&gt;Here's the detail that makes B+Trees especially good for range queries: &lt;strong&gt;all leaf nodes are linked together in a doubly linked list&lt;/strong&gt;, in sorted key order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[A-B] ↔ [C-D] ↔ [E-F] ↔ [G-H] ↔ [I-J] → ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why does this matter? Consider a query like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-01'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2026-03-31'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The engine traverses the tree from the root to find the leaf containing &lt;code&gt;2026-01-01&lt;/code&gt;. It reads that leaf. Then, instead of going back to the root to find the next range, it just follows the linked list pointer to the next leaf page, then the next, reading forward sequentially until it passes &lt;code&gt;2026-03-31&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Range scans on a B+Tree are essentially sequential reads through a linked list once you've found the starting point. On modern storage, sequential reads are dramatically faster than random reads. This is a core reason B+Trees are the default for databases with complex querying needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pages - the unit of disk I/O
&lt;/h3&gt;

&lt;p&gt;Each node in the tree maps to a &lt;strong&gt;page&lt;/strong&gt; — a fixed-size chunk of data, typically 8KB in PostgreSQL. A page is the smallest unit the database reads from or writes to disk. If you need one row from a leaf node, you read the entire 8KB page that contains it.&lt;/p&gt;

&lt;p&gt;This is important for understanding write cost. When you modify a node — say, inserting a new key into a leaf — you read the 8KB page, modify it in memory, and write the full 8KB page back to disk. Even if your change was one row.&lt;/p&gt;

&lt;h3&gt;
  
  
  Page splits - the hidden cost of inserts
&lt;/h3&gt;

&lt;p&gt;Here's where B+Tree write performance gets interesting.&lt;/p&gt;

&lt;p&gt;Every leaf page has a finite capacity. When a leaf page is full and a new key must be inserted into it, the page must &lt;strong&gt;split&lt;/strong&gt;: the existing entries are divided between the old page and a new sibling page, and the parent internal node gets a new key pointing to the new sibling.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4dq23w9fe8pxyztoqkxv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4dq23w9fe8pxyztoqkxv.png" alt="Split" width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This split writes two pages instead of one, and also modifies the parent node. If the parent is also full, it splits too — and the cascade can propagate up to the root. Root splits are rare but expensive.&lt;/p&gt;

&lt;p&gt;For sequential integer keys (1, 2, 3, ...), splits only happen at the rightmost leaf, the tree just grows a new page at the end. Predictable and cheap.&lt;/p&gt;

&lt;p&gt;For random UUID keys, each insert lands at a random position in the key space. Any leaf can be the target. Any leaf can be full. &lt;strong&gt;Splits happen frequently and unpredictably.&lt;/strong&gt; This is why p99 write latency for UUID primary keys is higher than p50 - most inserts are fast, but the splits that hit full pages cause latency spikes that show up in the tail.&lt;/p&gt;

&lt;p&gt;This is also write amplification: one logical insert can cause two or more physical page writes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why B+Tree reads are fast, writes have variance
&lt;/h3&gt;

&lt;p&gt;To summarize the B+Tree:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reads:&lt;/strong&gt; 3–4 page reads to find any row in a million-row table. Fast, predictable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Range reads:&lt;/strong&gt; Follow the leaf linked list sequentially. Very fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writes:&lt;/strong&gt; Usually fast, but page splits cause write amplification and latency spikes on random keys. The variance shows up at p99.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  B-Tree - What MongoDB Uses and How It Differs
&lt;/h2&gt;

&lt;p&gt;MongoDB's WiredTiger storage engine uses a B-Tree, and most engineers use the terms B-Tree and B+Tree interchangeably. They're not the same - but the difference that matters for WiredTiger isn't the one most textbooks describe.&lt;/p&gt;

&lt;p&gt;The key difference is in the leaf layer.&lt;/p&gt;

&lt;p&gt;In a &lt;strong&gt;B+Tree&lt;/strong&gt;, all leaf nodes are linked together in a doubly linked list in sorted key order. Once you find the start of a range, you follow the chain forward page by page without re-entering the tree. This is what makes PostgreSQL range scans fast: a &lt;code&gt;BETWEEN&lt;/code&gt; query traverses the tree once to find the starting leaf, then reads forward along the linked list to the end of the range.&lt;/p&gt;

&lt;p&gt;In WiredTiger's B-Tree, leaf nodes hold the actual document data, same as a B+Tree so far. The structural difference is that &lt;strong&gt;leaf nodes are not linked&lt;/strong&gt;. There is no chain to follow between adjacent leaves. To advance from one leaf to the next during a range scan, the engine must re-enter the tree from a higher level to find the next page.&lt;/p&gt;

&lt;p&gt;What this means in practice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Range scans are more expensive per step.&lt;/strong&gt; A PostgreSQL range scan follows linked leaf pages sequentially. A WiredTiger range scan re-traverses the tree structure to reach each successive page. For small range scans the difference is minimal. For large ones, the linked list wins — which is why direct comparison of benchmark numbers in later post shows PostgreSQL's linked B+Tree leaf nodes as a range-scan advantage over MongoDB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Point reads are equivalent.&lt;/strong&gt; Both structures find a single key in O(log n) page reads. Neither has a meaningful advantage for typical table sizes.&lt;/p&gt;

&lt;p&gt;For most transactional workloads — point lookups, small range scans, mixed reads and writes — the practical performance difference between WiredTiger's B-Tree and PostgreSQL's B+Tree is small. WiredTiger's implementation is heavily optimized and the engine adds its own caching and concurrency mechanisms on top.&lt;/p&gt;

&lt;p&gt;The important comparison is not B-Tree vs B+Tree. It's both of them versus what comes next.&lt;/p&gt;




&lt;h2&gt;
  
  
  LSM Tree - The Structure That Powers Cassandra
&lt;/h2&gt;

&lt;p&gt;The Log Structured Merge Tree (LSM Tree) starts from a completely different premise: what if we never modified data on disk at all?&lt;/p&gt;

&lt;p&gt;This is the key insight that makes Cassandra's write throughput possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  The core insight: sequential writes beat random writes
&lt;/h3&gt;

&lt;p&gt;On any storage medium — spinning disk, SSD, NVMe — sequential writes are faster than random writes. On a spinning disk, the difference is enormous (the head doesn't need to seek). On an SSD, it's smaller but real (flash write amplification is lower for sequential patterns). The gap narrows on modern NVMe, but it never disappears.&lt;/p&gt;

&lt;p&gt;B+Tree writes are fundamentally random: each insert must find its exact position in the tree and modify the page at that location. The page could be anywhere on disk.&lt;/p&gt;

&lt;p&gt;LSM asks: what if we turned all writes into sequential appends?&lt;/p&gt;

&lt;h3&gt;
  
  
  The Memtable -- all writes go here first
&lt;/h3&gt;

&lt;p&gt;When Cassandra receives an insert for an &lt;code&gt;orders&lt;/code&gt; row, it writes to the &lt;strong&gt;Memtable&lt;/strong&gt;, an in-memory sorted data structure. Think of it as a sorted list living entirely in RAM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Memtable (in memory):
┌──────────────────────────────────────────────┐
│ order: aaa-111, user: x, status: pending ... │
│ order: bbb-222, user: y, status: shipped ... │
│ order: ccc-333, user: x, status: delivered.. │
│ order: ddd-444, user: z, status: pending ... │
│                  ↑ sorted by order_id        │
└──────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Writing to the Memtable is fast because it's just RAM. It's sorted because reads need to find data efficiently. New inserts go to their sorted position in memory — no disk I/O involved in the write path at all, beyond the CommitLog (Cassandra's crash-recovery log, which is a sequential append and extremely fast).&lt;/p&gt;

&lt;h3&gt;
  
  
  The SSTable - immutable and sorted
&lt;/h3&gt;

&lt;p&gt;When the Memtable fills up, Cassandra flushes it to disk as an &lt;strong&gt;SSTable&lt;/strong&gt; (Sorted String Table). This flush is a single sequential write, the entire sorted Memtable gets written from beginning to end in one pass. No random writes. No page modifications. Just a stream of sorted data to a new file.&lt;/p&gt;

&lt;p&gt;Once written, &lt;strong&gt;an SSTable is never modified&lt;/strong&gt;. It is immutable. Future writes don't touch it. This immutability is what makes the write path so clean: you never need to find a specific location on disk and modify it. You only ever write new files.&lt;/p&gt;

&lt;p&gt;The immutability has an important consequence for updates and deletes. In a B+Tree, an update modifies the existing row in place. In an LSM Tree, an update writes a new version of the row to the current Memtable, which eventually becomes a new SSTable. Both the old version and the new version coexist on disk until compaction runs. Similarly, a delete doesn't remove data, it writes a &lt;strong&gt;tombstone record&lt;/strong&gt; that marks the row as deleted. The old data persists until compaction.&lt;/p&gt;

&lt;h3&gt;
  
  
  The read problem — read amplification
&lt;/h3&gt;

&lt;p&gt;Over time, you accumulate multiple SSTables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Disk state after several Memtable flushes:

SSTable-1 (oldest): [aaa-111] [ccc-333] [eee-555] [ggg-777]
SSTable-2:          [bbb-222] [ddd-444] [fff-666]
SSTable-3:          [aaa-111] [hhh-888]  ← updated version of aaa-111
SSTable-4 (newest): [iii-999] [jjj-000]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you query for &lt;code&gt;order_id = aaa-111&lt;/code&gt;. Where is the latest version?&lt;/p&gt;

&lt;p&gt;It could be in any SSTable. SSTable-1 has an old version. SSTable-3 has a newer version. To find the latest, the read path must check all four SSTables, compare the timestamps on any matching rows, and return the most recent one. This is &lt;strong&gt;read amplification&lt;/strong&gt; — one logical read requires multiple physical reads across multiple files.&lt;/p&gt;

&lt;p&gt;Before compaction, with many SSTables, reads are slow. Later post's benchmark puts a precise number on exactly this: Cassandra's cold read p99 before compaction is dramatically higher than PostgreSQL's — a gap that closes almost entirely once compaction runs and SSTable count drops to one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bloom filters - the read shortcut
&lt;/h3&gt;

&lt;p&gt;Reading every SSTable for every query would be unacceptably slow. Cassandra uses &lt;strong&gt;Bloom filters&lt;/strong&gt; to short-circuit most of these checks.&lt;/p&gt;

&lt;p&gt;A Bloom filter is a small, probabilistic data structure that answers one question: is this key &lt;em&gt;definitely not&lt;/em&gt; in this SSTable?&lt;/p&gt;

&lt;p&gt;The key word is &lt;em&gt;definitely not&lt;/em&gt;. A Bloom filter can give you a false positive (it says "maybe yes" when the key isn't there) but it never gives you a false negative (if it says "definitely not," you can trust it). So before reading an SSTable, Cassandra checks the Bloom filter for that SSTable — if it says the key isn't there, you skip the entire SSTable. No disk read required.&lt;/p&gt;

&lt;p&gt;In practice, Bloom filters eliminate most SSTable checks for most reads. Instead of reading 10 SSTables to find a key, you might read 1 or 2, the ones whose Bloom filters said "maybe." The filters live in memory, so checking them costs microseconds.&lt;/p&gt;

&lt;p&gt;Bloom filters don't eliminate read amplification entirely, but they dramatically reduce it. They're why LSM reads are "slow but not catastrophic" rather than "completely unusable."&lt;/p&gt;

&lt;h3&gt;
  
  
  Compaction - the background process that makes reads fast again
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Compaction&lt;/strong&gt; is the merge step that gives LSM its name. At regular intervals, Cassandra picks a set of SSTables and merges them into a single new SSTable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before compaction:
SSTable-1: [aaa-111 v1] [ccc-333]
SSTable-3: [aaa-111 v2] [hhh-888]

After compaction:
SSTable-merged: [aaa-111 v2] [ccc-333] [hhh-888]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;During compaction, for any key that appears in multiple SSTables, only the latest version (highest timestamp) survives. Tombstones are eventually resolved and the deleted data is removed. Old SSTables are deleted after the merge completes.&lt;/p&gt;

&lt;p&gt;After compaction, there are fewer SSTables, so reads check fewer files. Bloom filters cover fewer files. Read latency drops.&lt;/p&gt;

&lt;p&gt;This is the mechanism behind the benchmark number: Cassandra's read latency fell from 4.1ms to 1.4ms after compaction. Same data, same hardware, fewer SSTables to check. The engine cleaned up after itself, and reads got faster as a result.&lt;/p&gt;

&lt;p&gt;There are different compaction strategies with different tradeoffs — some optimize for write throughput, others for read consistency, others for time-series data. Post 4 covers these in detail.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Tradeoff Table
&lt;/h2&gt;

&lt;p&gt;Here's how the three structures compare across the dimensions that matter:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;B+Tree (PostgreSQL)&lt;/th&gt;
&lt;th&gt;B-Tree (MongoDB)&lt;/th&gt;
&lt;th&gt;LSM Tree (Cassandra)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Write speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Moderate - random page writes&lt;/td&gt;
&lt;td&gt;Moderate - similar to B+Tree&lt;/td&gt;
&lt;td&gt;High - sequential appends&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Write variance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High - page splits cause p99 spikes&lt;/td&gt;
&lt;td&gt;High - same mechanism&lt;/td&gt;
&lt;td&gt;Low - appends are uniform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Point read speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fast - log(n) tree traversal&lt;/td&gt;
&lt;td&gt;Fast - similar, sometimes shortcut&lt;/td&gt;
&lt;td&gt;Variable - depends on SSTable count and compaction state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Range scan speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fast - linked leaf list&lt;/td&gt;
&lt;td&gt;Moderate - no linked leaves&lt;/td&gt;
&lt;td&gt;Moderate - needs Bloom filter + SSTable scan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Updates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;In-place modification&lt;/td&gt;
&lt;td&gt;In-place modification&lt;/td&gt;
&lt;td&gt;New write (old version persists until compaction)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deletes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;In-place (mark deleted)&lt;/td&gt;
&lt;td&gt;In-place (mark deleted)&lt;/td&gt;
&lt;td&gt;Tombstone write (data persists until compaction)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Disk space&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compact, matches live data&lt;/td&gt;
&lt;td&gt;Compact, matches live data&lt;/td&gt;
&lt;td&gt;Can exceed live data size (historical versions + tombstones)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Read after heavy writes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Consistent&lt;/td&gt;
&lt;td&gt;Consistent&lt;/td&gt;
&lt;td&gt;Degrades until compaction runs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Neither structure is universally better. The table describes tradeoffs, not rankings.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters for Your Application
&lt;/h2&gt;

&lt;p&gt;The theory translates directly to system design choices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your workload is read-heavy with complex queries&lt;/strong&gt; - e-commerce product search, financial reporting, analytics over order history. Users query by multiple fields, run date range scans, join across tables. B+Tree wins here. PostgreSQL's linked leaf nodes make range scans fast. The write cost is acceptable because writes are infrequent relative to reads. The predictable read latency matters more than maximum write throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your workload is write-heavy with known, fixed access patterns&lt;/strong&gt; - event ingestion, IoT sensor data, activity logs, append-heavy order pipelines. You're inserting millions of rows and reading them back by a known key pattern. LSM wins here. Cassandra's sequential write path handles sustained high-throughput inserts without the page-split variance that B+Tree would introduce. You pay in read complexity, but if your reads are simple key lookups or partition scans, Bloom filters and compaction keep that cost manageable.&lt;/p&gt;

&lt;p&gt;Most real workloads are somewhere in the middle, which is why MongoDB exists in the space between the two extremes - a B-Tree engine with flexible documents, good write performance, and reasonable query flexibility.&lt;/p&gt;

&lt;p&gt;Post 6 will put real numbers on these tradeoffs. When you see Cassandra's write throughput compared to PostgreSQL's on the same hardware, the gap will be exactly what the data structure analysis predicts.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next: The Theory Gets Concrete
&lt;/h2&gt;

&lt;p&gt;You now have the foundation: B+Tree for reads, LSM for writes, and a B-Tree sitting comfortably in between. You know why page splits cause latency spikes. You know why Cassandra reads degrade with SSTable accumulation and recover after compaction.&lt;/p&gt;

&lt;p&gt;Next Post takes everything you just learned and shows it working inside real database engines.&lt;/p&gt;

&lt;p&gt;We're going inside PostgreSQL and MongoDB - from the moment a &lt;code&gt;SELECT&lt;/code&gt; arrives, through the buffer pool, down the B+Tree, out to the heap file, and back. We'll trace a write through the WAL and watch it land in shared_buffers. We'll see what MVCC actually looks like inside a heap page, and why a PostgreSQL UPDATE doesn't modify a row - it writes a new one.&lt;/p&gt;

</description>
      <category>database</category>
      <category>postgres</category>
      <category>cassandra</category>
      <category>performance</category>
    </item>
    <item>
      <title>What Actually Happens When You Call INSERT?</title>
      <dc:creator>priteshsurana</dc:creator>
      <pubDate>Tue, 31 Mar 2026 01:56:30 +0000</pubDate>
      <link>https://forem.com/priteshsurana/what-actually-happens-when-you-call-insert-a93</link>
      <guid>https://forem.com/priteshsurana/what-actually-happens-when-you-call-insert-a93</guid>
      <description>&lt;p&gt;You call &lt;code&gt;INSERT&lt;/code&gt;. The database says &lt;code&gt;OK&lt;/code&gt;. You move on.&lt;/p&gt;

&lt;p&gt;That acknowledgment feels instant. It feels like the database just... wrote something down. But between your &lt;code&gt;INSERT&lt;/code&gt; and that &lt;code&gt;OK&lt;/code&gt;, at minimum four distinct things happened that most engineers who use databases every day have never thought about:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The write was recorded in a sequential log before it touched any data structure, so a crash wouldn't lose it&lt;/li&gt;
&lt;li&gt;At least one index was updated. And that update is more expensive than the insert itself on some engines&lt;/li&gt;
&lt;li&gt;A decision was made about whether to hit the disk synchronously or defer it; a tradeoff with real latency consequences&lt;/li&gt;
&lt;li&gt;The data was placed into a structure that was chosen years ago by the database designers, and that choice explains almost every performance characteristic you've ever observed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This series is about those four things.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Moment of Insertion
&lt;/h2&gt;

&lt;p&gt;Let's start with the most basic question: where does the data actually go first?&lt;/p&gt;

&lt;p&gt;The answer is different for each database, and the differences are not superficial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL&lt;/strong&gt; receives your &lt;code&gt;INSERT&lt;/code&gt; and immediately writes a record to something called the &lt;strong&gt;WAL&lt;/strong&gt; - the Write-Ahead Log. This is a sequential append-only file on disk. Only after that WAL record is safely written does PostgreSQL modify the in-memory structure called &lt;strong&gt;shared_buffers&lt;/strong&gt;, and eventually write the row to its final resting place: a &lt;strong&gt;heap file&lt;/strong&gt;. The heap is just what it sounds like, rows stored roughly in the order they arrived, in fixed 8KB pages. The primary key index is a completely separate structure, updated separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MongoDB&lt;/strong&gt; receives your document and routes it through &lt;strong&gt;WiredTiger&lt;/strong&gt;, its storage engine. WiredTiger writes to its own &lt;strong&gt;journal&lt;/strong&gt; (similar in spirit to PostgreSQL's WAL but different in format and behavior) and places the document into the &lt;strong&gt;WiredTiger cache&lt;/strong&gt;, an in-memory buffer. Here's the twist: MongoDB's primary storage structure is a B-Tree, and unlike PostgreSQL's heap, the document data lives &lt;em&gt;inside&lt;/em&gt; the B-Tree itself. There is no separate heap. The collection and the primary index are the same structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cassandra&lt;/strong&gt; does something neither of the others does. It writes to a &lt;strong&gt;CommitLog&lt;/strong&gt; (its crash-recovery log) and then places the data into a &lt;strong&gt;Memtable&lt;/strong&gt;, a sorted in-memory buffer that accumulates writes until it's full, at which point it gets flushed to disk as an immutable file called an &lt;strong&gt;SSTable&lt;/strong&gt;. Cassandra never modifies existing files. Every write is an append. Every update is a new version. Every delete is a new record saying "this data is gone."&lt;/p&gt;

&lt;p&gt;Three databases. Three fundamentally different answers to the question "where does the data go first."&lt;/p&gt;

&lt;p&gt;Why does this matter? Because each approach has consequences that ripple through everything - write speed, read speed, crash recovery time, compaction behavior, and what happens to your p99 latency under load. Posts 2 through 4 go deep on each engine. But first, let's look at the structure that sits underneath all of this.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Index Is Not What You Think
&lt;/h2&gt;

&lt;p&gt;Most engineers think about indexes when they're writing queries. "This query is slow, I need an index." The mental model is: indexes exist to make reads fast.&lt;/p&gt;

&lt;p&gt;That's not wrong. But it's half the picture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every index is updated on every write.&lt;/strong&gt; Throughout this series we'll use a concrete example: an orders table with fields like order_id, user_id, status, amount, and created_at. When you insert an order, PostgreSQL doesn't just write the row to the heap, it also updates the B+Tree index on &lt;code&gt;order_id&lt;/code&gt;, the B+Tree index on &lt;code&gt;user_id&lt;/code&gt;, the B+Tree index on &lt;code&gt;created_at&lt;/code&gt;. Three indexes means roughly four write operations for a single &lt;code&gt;INSERT&lt;/code&gt;. The index update cost is not a footnote. On tables with several indexes under heavy write load, index maintenance is often the dominant cost of the write path.&lt;/p&gt;

&lt;p&gt;Now here's the part that most engineers have never considered: the three databases use fundamentally different data structures for those indexes, and the choice of structure explains almost everything about their performance profiles.&lt;/p&gt;

&lt;p&gt;PostgreSQL uses a &lt;strong&gt;B+Tree&lt;/strong&gt;. MongoDB's WiredTiger also uses a &lt;strong&gt;B-Tree&lt;/strong&gt; (a close relative). Cassandra uses an &lt;strong&gt;LSM Tree&lt;/strong&gt;, a completely different class of structure that doesn't modify anything in place, ever.&lt;/p&gt;

&lt;p&gt;The B+Tree and B-Tree are optimized for reads. You can find any key in a handful of disk reads, and the structure stays balanced automatically. The cost is that every write has to find the right place in the tree and modify it, which can mean reading pages from disk, modifying them, and writing them back. Random writes. Slow on spinning disks, less slow on SSDs, but never free.&lt;/p&gt;

&lt;p&gt;The LSM Tree is optimized for writes. New data always goes to the end of something — a log, a buffer, a new file. There are no random writes. The cost is that reads become more complex, because the data you're looking for might be in any of several files that were written at different times.&lt;/p&gt;

&lt;p&gt;This is the reason Cassandra writes faster than PostgreSQL under sustained load. And it's the reason Cassandra reads &lt;em&gt;slower&lt;/em&gt; before a process called &lt;strong&gt;compaction&lt;/strong&gt; runs.&lt;/p&gt;

&lt;p&gt;Post 2 breaks this down completely. What each structure looks like, how it works, and why the choice of structure is the single most important decision a database designer makes.&lt;/p&gt;




&lt;h2&gt;
  
  
  What "Durability" Actually Costs
&lt;/h2&gt;

&lt;p&gt;When someone says a database is durable, they mean: if the server loses power the instant after your &lt;code&gt;INSERT&lt;/code&gt; is acknowledged, your data will still be there when the server comes back.&lt;/p&gt;

&lt;p&gt;That's a strong guarantee. And it has a real cost.&lt;/p&gt;

&lt;p&gt;To survive a power failure, data has to reach physical storage - magnetic disk or flash cells before the acknowledgment goes out. Not the OS's buffer. Not the CPU cache. The actual hardware. The system call that forces this is called &lt;code&gt;fsync&lt;/code&gt;, and it is one of the most expensive operations a database performs. On a typical NVMe SSD, a single &lt;code&gt;fsync&lt;/code&gt; takes microseconds. On a networked file system or a busy system under load, it can take milliseconds. And some workloads trigger thousands of them per second.&lt;/p&gt;

&lt;p&gt;Here's where the databases diverge sharply.&lt;/p&gt;

&lt;p&gt;PostgreSQL's default behavior is &lt;strong&gt;synchronous durability&lt;/strong&gt;: every committed transaction waits for its WAL record to be &lt;code&gt;fsync&lt;/code&gt;'d to disk before the acknowledgment goes out. You get the strongest possible guarantee. You also pay for every write with a disk flush.&lt;/p&gt;

&lt;p&gt;MongoDB's default behavior has historically been more relaxed. By default, the journal syncs every 100 milliseconds. Your write is acknowledged as soon as WiredTiger's in-memory cache accepts it. You get lower latency. You accept that up to 100ms of writes could be lost in a hard crash. This is configurable - &lt;code&gt;j: true&lt;/code&gt; forces a journal flush per write. But the default trades some durability for speed.&lt;/p&gt;

&lt;p&gt;Cassandra's CommitLog syncs periodically too, every ~10 seconds by default. Same tradeoff, pushed even further toward throughput. If you need per-write durability in Cassandra, you configure it, and you pay the latency cost.&lt;/p&gt;

&lt;p&gt;These aren't just configuration details. They're architectural philosophies. The benchmark numbers in later posts will show you exactly what each philosophy costs in microseconds. The difference between synchronous and asynchronous durability shows up plainly in the latency measurements — and the gap is bigger than most engineers expect.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Benchmark Teaser
&lt;/h2&gt;

&lt;p&gt;Here's what we measured when we ran 1 million inserts and then reads against all three databases on identical hardware. These numbers are illustrative — directionally accurate — and the real C++ benchmark results are covered in a later post.&lt;/p&gt;

&lt;h3&gt;
  
  
  Single-threaded insert throughput
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Database&lt;/th&gt;
&lt;th&gt;Rows/sec (approx)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cassandra&lt;/td&gt;
&lt;td&gt;~18,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MongoDB&lt;/td&gt;
&lt;td&gt;~12,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;~8,500&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cassandra is more than 2× faster than PostgreSQL on pure insert throughput. If you know about LSM Trees, this makes sense. If you don't, it looks like magic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cold primary key read latency (p99)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Database&lt;/th&gt;
&lt;th&gt;p99 latency (approx)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;~1.2ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MongoDB&lt;/td&gt;
&lt;td&gt;~1.8ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cassandra (before compaction)&lt;/td&gt;
&lt;td&gt;~4.1ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cassandra (after compaction)&lt;/td&gt;
&lt;td&gt;~1.4ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now look at Cassandra. Before compaction, it's the slowest reader by a factor of 3. After compaction, it's competitive with PostgreSQL. Same database. Same data. Same query. The difference is whether a background merge process has run.&lt;/p&gt;

&lt;p&gt;This is the central tension of LSM-based storage: you pay for write speed with read complexity, and you recover that complexity through compaction. The "after compaction" number is not a cheat — it reflects real production behavior. But it tells you something important: Cassandra's read latency is not a fixed property. It depends on what state the engine is in.&lt;/p&gt;

&lt;p&gt;These numbers also tell you something about the write-read tradeoff that runs through this entire series. The engine that writes fastest reads slowest before it cleans up after itself. The engine that reads fastest — PostgreSQL — is also the one paying the highest cost per write to keep its B+Tree in a consistent, readable state.&lt;/p&gt;

&lt;p&gt;The real numbers from the C++ benchmark, with methodology, hardware specs, and full latency distributions, are in a later post. But this is the shape of what you'll find.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Series Will Cover
&lt;/h2&gt;

&lt;p&gt;Here's the full map. Each post answers one question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post 1 - What is actually happening when you insert a row?&lt;/strong&gt;&lt;br&gt;
This post. The wide-angle view. Why the insert isn't simple, and why the differences between engines matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post 2 - Why do databases use B+Tree, B-Tree, or LSM and what's the real difference?&lt;/strong&gt;&lt;br&gt;
The data structures underneath the storage engines, explained from first principles. Why the choice of structure determines almost everything about read and write performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post 3 and 4 - How do PostgreSQL and MongoDB store and retrieve data internally?&lt;/strong&gt;&lt;br&gt;
Deep dive into the heap file, shared_buffers, WAL, WiredTiger's B-Tree, the journal, MVCC, and what a real insert and read look like step by step in each engine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post 5 - How does Cassandra's LSM-based storage work end to end?&lt;/strong&gt;&lt;br&gt;
Memtables, SSTables, Bloom filters, compaction strategies, tombstones, and why deleting data in Cassandra is more complicated than it sounds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post 6 - What do the benchmark numbers actually show - and why?&lt;/strong&gt;&lt;br&gt;
The C++ benchmark: 1 million records, controlled hardware, full methodology. Write throughput, read latency distributions, the compaction effect, and what explains every number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post 7 - How does everything change when you go multi-node?&lt;/strong&gt;&lt;br&gt;
Replication, sharding, consistency levels, and why the single-node storage engine behavior is just the beginning. CAP theorem applied to real production decisions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Before we can explain why Cassandra writes twice as fast as PostgreSQL but reads three times slower before compaction, we need to understand the structures that cause those numbers.&lt;/p&gt;

&lt;p&gt;The next post is entirely about data structures - B+Tree, B-Tree, and LSM Tree. You'll understand why B+Trees are the default for read-heavy databases, why LSM Trees dominate write-heavy systems, and why the choice of structure is the single decision that every other database behavior flows from.&lt;/p&gt;

&lt;p&gt;If you've ever wondered why adding a sixth index to a table hurt write performance more than the fifth index did, next post will have the answer.&lt;/p&gt;

</description>
      <category>database</category>
      <category>postgres</category>
      <category>mongodb</category>
      <category>backend</category>
    </item>
  </channel>
</rss>
